Re: Pluggable cumulative statistics
Hi, On Tue, Jul 09, 2024 at 10:45:05AM +0900, Michael Paquier wrote: > On Mon, Jul 08, 2024 at 07:22:32AM +0000, Bertrand Drouvot wrote: > > Yeah, what I meant to say is that one could think for example that's the > > PgStatShared_Archiver size while in fact it's the PgStat_ArchiverStats size. > > I think it's more confusing when writing the stats. Here we are manipulating > > "snapshot" and "snapshot" offsets. It was not that confusing when reading > > as we > > are manipulating "shmem" and "shared" offsets. > > > > As I said, the code is fully correct, that's just the wording here that > > sounds > > weird to me in the "snapshot" context. > > After sleeping on it, I can see your point. If we were to do the > (shared_data_len -> stats_data_len) switch, could it make sense to > rename shared_data_off to stats_data_off to have a better symmetry? > This one is the offset of the stats data in a shmem entry, so perhaps > shared_data_off is OK, but it feels a bit inconsistent as well. Agree that if we were to rename one of them then the second one should be renamed to. I gave a second thought on it, and I think that this is the "data" part that lead to the confusion (as too generic), what about? shared_data_len -> shared_stats_len shared_data_off -> shared_stats_off That looks ok to me even in the snapshot context (shared is fine after all because that's where the stats come from). Attached a patch proposal doing so. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com >From 390fdbc5b3fb1f25df809d0541507b886f1b15ec Mon Sep 17 00:00:00 2001 From: Bertrand Drouvot Date: Tue, 9 Jul 2024 04:52:04 + Subject: [PATCH v1] Renaming two fields in PgStat_KindInfo Renaming shared_data_off to shared_stats_off and shared_data_len to shared_stats_len. "data" seems too generic and started to be confusing since b68b29bc8f makes use of the "_len" one in the snapshot context. Using "stats" instead as that's what those fields are linked to. --- src/backend/utils/activity/pgstat.c | 56 ++--- src/include/utils/pgstat_internal.h | 12 +++ 2 files changed, 34 insertions(+), 34 deletions(-) 84.2% src/backend/utils/activity/ 15.7% src/include/utils/ diff --git a/src/backend/utils/activity/pgstat.c b/src/backend/utils/activity/pgstat.c index 65b937a85f..bb777423fb 100644 --- a/src/backend/utils/activity/pgstat.c +++ b/src/backend/utils/activity/pgstat.c @@ -274,8 +274,8 @@ static const PgStat_KindInfo pgstat_kind_infos[PGSTAT_NUM_KINDS] = { .accessed_across_databases = true, .shared_size = sizeof(PgStatShared_Database), - .shared_data_off = offsetof(PgStatShared_Database, stats), - .shared_data_len = sizeof(((PgStatShared_Database *) 0)->stats), + .shared_stats_off = offsetof(PgStatShared_Database, stats), + .shared_stats_len = sizeof(((PgStatShared_Database *) 0)->stats), .pending_size = sizeof(PgStat_StatDBEntry), .flush_pending_cb = pgstat_database_flush_cb, @@ -288,8 +288,8 @@ static const PgStat_KindInfo pgstat_kind_infos[PGSTAT_NUM_KINDS] = { .fixed_amount = false, .shared_size = sizeof(PgStatShared_Relation), - .shared_data_off = offsetof(PgStatShared_Relation, stats), - .shared_data_len = sizeof(((PgStatShared_Relation *) 0)->stats), + .shared_stats_off = offsetof(PgStatShared_Relation, stats), + .shared_stats_len = sizeof(((PgStatShared_Relation *) 0)->stats), .pending_size = sizeof(PgStat_TableStatus), .flush_pending_cb = pgstat_relation_flush_cb, @@ -302,8 +302,8 @@ static const PgStat_KindInfo pgstat_kind_infos[PGSTAT_NUM_KINDS] = { .fixed_amount = false, .shared_size = sizeof(PgStatShared_Function), - .shared_data_off = offsetof(PgStatShared_Function, stats), - .shared_data_len = sizeof(((PgStatShared_Function *) 0)->stats), + .shared_stats_off = offsetof(PgStatShared_Function, stats), + .shared_stats_len = sizeof(((PgStatShared_Function *) 0)->stats), .pending_size = sizeof(PgStat_FunctionCounts), .flush_pending_cb = pgstat_function_flush_cb, @@ -317,8 +317,8 @@ static const PgStat_KindInfo pgstat_kind_infos[PGSTAT_NUM_KINDS] = { .accessed_across_databases = true, .shared_size = sizeof(PgStatShared_ReplSlot), - .shared_data_off = offsetof(PgStatShared_ReplSlot, stats), - .shared_data_len = sizeof(((PgStatShared_ReplSlot *) 0)->stats), + .shared_stats_off = offsetof(PgStatShared_ReplSlot, stats), + .shared_stats_len = sizeof(((PgStatShared_ReplSlot *) 0)->stats), .reset_timestamp_cb = pgstat_replslot_reset_timestamp_cb, .to_serialized_name = pgstat_replslot_to_serialized_name_cb, @@ -333,8 +333,8 @@ static const PgStat_KindInfo pgstat_kind_infos[PGSTAT_NUM_KINDS] = { .accessed_across_databases = true, .
Re: Pluggable cumulative statistics
Hi, On Mon, Jul 08, 2024 at 07:22:32AM +, Bertrand Drouvot wrote: > Except the above (which is just a Nit), 0001 LGTM. > Looking at 0002: It looks pretty straightforward, just one comment: + ptr = ((char *) ctl) + kind_info->shared_ctl_off; + kind_info->init_shmem_cb((void *) ptr); I don't think we need to cast ptr to void when calling init_shmem_cb(). Looking at some examples in the code, it does not look like we cast the argument to void when a function has (void *) as parameter (also there is examples in 0003 where it's not done, see next comments for 0003). So I think removing the cast here would be more consistent. Looking at 0003: It looks pretty straightforward. Also for example, here: + fputc(PGSTAT_FILE_ENTRY_FIXED, fpout); + write_chunk_s(fpout, ); write_chunk(fpout, ptr, info->shared_data_len); ptr is not casted to void when calling write_chunk() while its second parameter is a "void *". + ptr = ((char *) shmem) + info->shared_ctl_off + + info->shared_data_off; + + if (!read_chunk(fpin, ptr, Same here for read_chunk(). I think that's perfectly fine and that we should do the same in 0002 when calling init_shmem_cb() for consistency. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Pluggable cumulative statistics
Hi, On Mon, Jul 08, 2024 at 03:49:34PM +0900, Michael Paquier wrote: > On Mon, Jul 08, 2024 at 06:39:56AM +0000, Bertrand Drouvot wrote: > > + for (int kind = PGSTAT_KIND_FIRST_VALID; kind <= PGSTAT_KIND_LAST; > > kind++) > > + { > > + char *ptr; > > + const PgStat_KindInfo *info = pgstat_get_kind_info(kind); > > > > I wonder if we could avoid going through stats that are not fixed ones. > > What about > > doing something like? > > Same would apply for the read part added in 9004abf6206e. > > This becomes more relevant when the custom stats are added, as this > performs a scan across the full range of IDs supported. So this > choice is here for consistency, and to ease the pluggability. Gotcha. > > > 2 === > > > > + pgstat_build_snapshot_fixed(kind); > > + ptr = ((char *) ) + > > info->snapshot_ctl_off; > > + write_chunk(fpout, ptr, info->shared_data_len); > > > > I think that using "shared_data_len" is confusing here (was not the case in > > the > > context of 9004abf6206e). I mean it is perfectly correct but the wording > > "shared" > > looks weird to me when being used here. What it really is, is the size of > > the > > stats. What about renaming shared_data_len with stats_data_len? > > It is the stats data associated to a shared entry. I think that's OK, > but perhaps I'm just used to it as I've been staring at this area for > days. Yeah, what I meant to say is that one could think for example that's the PgStatShared_Archiver size while in fact it's the PgStat_ArchiverStats size. I think it's more confusing when writing the stats. Here we are manipulating "snapshot" and "snapshot" offsets. It was not that confusing when reading as we are manipulating "shmem" and "shared" offsets. As I said, the code is fully correct, that's just the wording here that sounds weird to me in the "snapshot" context. Except the above (which is just a Nit), 0001 LGTM. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Pluggable cumulative statistics
Hi, On Mon, Jul 08, 2024 at 02:30:23PM +0900, Michael Paquier wrote: > On Fri, Jul 05, 2024 at 09:35:19AM +0900, Michael Paquier wrote: > > On Thu, Jul 04, 2024 at 11:30:17AM +, Bertrand Drouvot wrote: > >> On Wed, Jul 03, 2024 at 06:47:15PM +0900, Michael Paquier wrote: > >>> - PgStat_Snapshot holds an array of void* also indexed by > >>> PGSTAT_NUM_KINDS, pointing to the fixed stats stored in the > >>> snapshots. > >> > >> Same, that's just a 96 bytes overhead (8 * PGSTAT_NUM_KINDS) as compared > >> to now. > > > > Still Andres does not seem to like that much, well ;) > > Please find attached a rebased patch set labelled v4. Thanks! > Built-in > fixed-numbered stats are still attached to the snapshot and shmem > control structures, and custom fixed stats kinds are tracked in the > same way as v3 with new members tracking data stored in > TopMemoryContext for the snapshots and shmem for the control data. > So, the custom and built-in stats kinds are separated into separate > parts of the structures, including the "valid" flags for the > snapshots. And this avoids any redirection when looking at the > built-in fixed-numbered stats. Yeap. > I've tried at address all the previous comments (there could be stuff > I've missed while rebasing, of course). Thanks! > The first three patches are refactoring pieces to make the rest more > edible, while 0004~ implement the main logic with templates in > modules/injection_points: > - 0001 refactors pgstat_write_statsfile() so as a loop om > PgStat_KindInfo is used to write the data. This is done with the > addition of snapshot_ctl_off in PgStat_KindInfo, to point to the area > in PgStat_Snapshot where the data is located for fixed stats. > 9004abf6206e has done the same for the read part. Looking at 0001: 1 == + for (int kind = PGSTAT_KIND_FIRST_VALID; kind <= PGSTAT_KIND_LAST; kind++) + { + char *ptr; + const PgStat_KindInfo *info = pgstat_get_kind_info(kind); I wonder if we could avoid going through stats that are not fixed ones. What about doing something like? " for (int kind = ; kind <= ; kind++); " Would probably need to change the indexing logic though. and then we could replace: + if (!info->fixed_amount) + continue; with an assert instead. Same would apply for the read part added in 9004abf6206e. 2 === + pgstat_build_snapshot_fixed(kind); + ptr = ((char *) ) + info->snapshot_ctl_off; + write_chunk(fpout, ptr, info->shared_data_len); I think that using "shared_data_len" is confusing here (was not the case in the context of 9004abf6206e). I mean it is perfectly correct but the wording "shared" looks weird to me when being used here. What it really is, is the size of the stats. What about renaming shared_data_len with stats_data_len? Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: walsender.c comment with no context is hard to understand
Hi, On Mon, Jul 08, 2024 at 11:20:45AM +0530, Amit Kapila wrote: > On Mon, Jul 8, 2024 at 11:08 AM Bertrand Drouvot > wrote: > > > > On Mon, Jul 08, 2024 at 08:46:19AM +0530, Amit Kapila wrote: > > > This sounds better but it is better to add just before we determine > > > am_cascading_walsender as is done in the attached. What do you think? > > > > Thanks! LGTM. > > > > I would like to push this to HEAD only as we don't see any bug that > this change can prevent. What do you think? > Yeah, fully agree. I don't see how the previous check location could produce any bug. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: walsender.c comment with no context is hard to understand
Hi, On Mon, Jul 08, 2024 at 08:46:19AM +0530, Amit Kapila wrote: > This sounds better but it is better to add just before we determine > am_cascading_walsender as is done in the attached. What do you think? Thanks! LGTM. > > BTW, is it possible that the promotion gets completed after we wait > for the required WAL and before assigning am_cascading_walsender? Yeah, I don't think there is anything that would prevent a promotion to happen and complete here. I did a few tests (pausing the walsender with gdb at various places and promoting the standby). > think even if that happens we can correctly determine the required > timeline because all the required WAL is already available, is that > correct Yeah that's correct. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: walsender.c comment with no context is hard to understand
Hi, On Fri, Jul 05, 2024 at 11:10:00AM +0530, Amit Kapila wrote: > On Fri, Jun 28, 2024 at 6:30 PM Bertrand Drouvot > wrote: > > > > On Fri, Jun 28, 2024 at 03:15:22PM +0530, Amit Kapila wrote: > > > On Fri, Jun 28, 2024 at 12:55 PM Peter Smith > > > wrote: > > > > > > > > > > I don't know whether your assumption is correct. AFAICS, those two > > > lines should be together. Let us ee if Bertrand remembers anything? > > > > > > > IIRC the WalSndWaitForWal() call has been moved to ensure that we can > > determine > > the timeline accurately. > > > > This part is understandable but I don't understand the part of the > comment (This is needed to determine am_cascading_walsender accurately > ..) atop a call to WalSndWaitForWal(). The am_cascading_walsender is > determined based on the results of RecoveryInProgress(). Can the wait > for WAL by using WalSndWaitForWal() change the result of > RecoveryInProgress()? No, but WalSndWaitForWal() must be called _before_ assigning "am_cascading_walsender = RecoveryInProgress();". The reason is that during a promotion am_cascading_walsender must be assigned _after_ the walsender is waked up (after the promotion). So that when the walsender exits WalSndWaitForWal(), then am_cascading_walsender is assigned "accurately" and so the timeline is. What I meant to say in this comment is that "am_cascading_walsender = RecoveryInProgress();" must be called _after_ "flushptr = WalSndWaitForWal(targetPagePtr + reqLen);". For example, swaping both lines would cause the 035_standby_logical_decoding.pl to fail during the promotion test as the walsender would read from the "previous" timeline and then produce things like: "ERROR: could not find record while sending logically-decoded data: invalid record length at 0/6427B20: expected at least 24, got 0" To avoid ambiguity should we replace? " /* * Make sure we have enough WAL available before retrieving the current * timeline. This is needed to determine am_cascading_walsender accurately * which is needed to determine the current timeline. */ " With: " /* * Make sure we have enough WAL available before retrieving the current * timeline. am_cascading_walsender must be assigned after * WalSndWaitForWal() (so that it is also correct when the walsender wakes * up after a promotion). */ " Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Pluggable cumulative statistics
ed stats (by moving the "stats" part at the header of their dedicated struct). That would mean having things like: " typedef struct PgStatShared_Archiver { PgStat_ArchiverStats stats; /* lock protects ->reset_offset as well as stats->stat_reset_timestamp */ LWLock lock; uint32 changecount; PgStat_ArchiverStats reset_offset; } PgStatShared_Archiver; " Not sure that's worth it though. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Restart pg_usleep when interrupted
Hi, On Fri, Jun 28, 2024 at 05:50:20PM -0500, Sami Imseih wrote: > > Thanks for the feedback! > > > On Jun 28, 2024, at 4:34 PM, Tom Lane wrote: > > > > Sami Imseih writes: > >> Reattaching the patch. > > > > I feel like this is fundamentally a wrong solution, for the reasons > > cited in the comment for pg_usleep: long sleeps are a bad idea > > because of the resulting uncertainty about whether we'll respond to > > interrupts and such promptly. An example here is that if we get > > a query cancel interrupt, we should probably not insist on finishing > > out the current sleep before responding. > > The case which brought up this discussion is the pg_usleep that > is called within the vacuum_delay_point being interrupted. > > When I read the same code comment you cited, it sounded to me > that “long sleeps” are those that are in seconds or minutes. The longest > vacuum delay allowed is 100ms. I think that with the proposed patch the actual wait time can be "long". Indeed, the time between the interruptions and restarts of the nanosleep() call will lead to drift (as mentioned in the nanosleep() man page). So, with a large number of interruptions, the actual wait time could be way longer than the expected wait time. To put numbers, I did a few tests with the patch (and with v2 shared in [1]): cost delay is 1ms and cost limit is 10. With 50 indexes and 10 parallel workers I can see things like: 2024-07-02 08:22:23.789 UTC [2189616] LOG: expected 1.00, actual 239.378368 2024-07-02 08:22:24.575 UTC [2189616] LOG: expected 0.10, actual 224.331737 2024-07-02 08:22:25.363 UTC [2189616] LOG: expected 1.30, actual 230.462793 2024-07-02 08:22:26.154 UTC [2189616] LOG: expected 1.00, actual 225.980803 Means we waited more than the max allowed cost delay (100ms). With 49 parallel workers, it's worst as I can see things like: 2024-07-02 08:26:36.069 UTC [2189807] LOG: expected 1.00, actual 1106.790136 2024-07-02 08:26:36.298 UTC [2189807] LOG: expected 1.00, actual 218.148985 The first actual wait time is about 1 second (it has been interrupted about 16300 times during this second). To avoid this drift, the nanosleep() man page suggests to use clock_nanosleep() with an absolute time value, that might be another idea to explore. [1]: https://www.postgresql.org/message-id/flat/ZmaXmWDL829fzAVX%40ip-10-97-1-34.eu-west-3.compute.internal Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Avoid orphaned objects dependencies, take 3
Hi, On Mon, Jul 01, 2024 at 09:39:17AM +, Bertrand Drouvot wrote: > Hi, > > On Wed, Jun 26, 2024 at 10:24:41AM +, Bertrand Drouvot wrote: > > Hi, > > > > On Fri, Jun 21, 2024 at 01:22:43PM +, Bertrand Drouvot wrote: > > > Another thought for the RelationRelationId class case: we could check if > > > there > > > is a lock first and if there is no lock then acquire one. That way that > > > would > > > ensure the relation is always locked (so no "risk" anymore), but OTOH it > > > may > > > add "unecessary" locking (see 2. mentioned previously). > > > > Please find attached v12 implementing this idea for the RelationRelationId > > class > > case. As mentioned, it may add unnecessary locking for 2. but I think that's > > worth it to ensure that we are always on the safe side of thing. This idea > > is > > implemented in LockNotPinnedObjectById(). > > Please find attached v13, mandatory rebase due to 0cecc908e97. In passing, > make > use of CheckRelationOidLockedByMe() added in 0cecc908e97. Please find attached v14, mandatory rebase due to 65b71dec2d5. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com >From 31767e9eb290909943d03408d9f8e827cf17e0bd Mon Sep 17 00:00:00 2001 From: Bertrand Drouvot Date: Fri, 29 Mar 2024 15:43:26 + Subject: [PATCH v14] Avoid orphaned objects dependencies It's currently possible to create orphaned objects dependencies, for example: Scenario 1: session 1: begin; drop schema schem; session 2: create a function in the schema schem session 1: commit; With the above, the function created in session 2 would be linked to a non existing schema. Scenario 2: session 1: begin; create a function in the schema schem session 2: drop schema schem; session 1: commit; With the above, the function created in session 1 would be linked to a non existing schema. To avoid those scenarios, a new lock (that conflicts with a lock taken by DROP) has been put in place before the dependencies are being recorded. With this in place, the drop schema in scenario 2 would be locked. Also, after the new lock attempt, the patch checks that the object still exists: with this in place session 2 in scenario 1 would be locked and would report an error once session 1 committs (that would not be the case should session 1 abort the transaction). If the object is dropped before the new lock attempt is triggered then the patch would also report an error (but with less details). The patch takes into account any type of objects except the ones that are pinned (they are not droppable because the system requires it). A special case is done for objects that belong to the RelationRelationId class. For those, we should be in one of the two following cases that would already prevent the relation to be dropped: 1. The relation is already locked (could be an existing relation or a relation that we are creating). 2. The relation is protected indirectly (i.e an index protected by a lock on its table, a table protected by a lock on a function that depends the table...) To avoid any risks for the RelationRelationId class case, we acquire a lock if there is none. That may add unnecessary lock for 2. but that's worth it. The patch adds a few tests for some dependency cases (that would currently produce orphaned objects): - schema and function (as the above scenarios) - alter a dependency (function and schema) - function and arg type - function and return type - function and function - domain and domain - table and type - server and foreign data wrapper --- src/backend/catalog/aclchk.c | 1 + src/backend/catalog/dependency.c | 121 +++- src/backend/catalog/heap.c| 7 + src/backend/catalog/index.c | 26 src/backend/catalog/objectaddress.c | 57 src/backend/catalog/pg_aggregate.c| 9 ++ src/backend/catalog/pg_attrdef.c | 1 + src/backend/catalog/pg_cast.c | 5 + src/backend/catalog/pg_collation.c| 1 + src/backend/catalog/pg_constraint.c | 26 src/backend/catalog/pg_conversion.c | 2 + src/backend/catalog/pg_depend.c | 40 +- src/backend/catalog/pg_operator.c | 19 +++ src/backend/catalog/pg_proc.c | 17 ++- src/backend/catalog/pg_publication.c | 7 + src/backend/catalog/pg_range.c| 6 + src/backend/catalog/pg_type.c | 39 ++ src/backend/catalog/toasting.c| 1 + src/backend/commands/alter.c | 4 + src/backend/commands/amcmds.c | 1 + src/backend/commands/cluster.c| 7 + src/backend/comman
Re: Surround CheckRelation[Oid]LockedByMe() with USE_ASSERT_CHECKING
Hi, On Mon, Jul 01, 2024 at 10:21:35AM -0400, Tom Lane wrote: > Michael Paquier writes: > > On Mon, Jul 01, 2024 at 06:42:46AM +, Bertrand Drouvot wrote: > >> I think it would make sense to declare / define those functions only for > >> assert enabled build: please find attached a tiny patch doing so. > > > Not convinced that's a good idea. What about out-of-core code that > > may use these routines for runtime checks in non-assert builds? > > Yeah. Also, I believe it's possible for an extension that's been > built with assertions enabled to be used with a core server that > wasn't. This is why, for example, ExceptionalCondition() is not > ifdef'd away in a non-assert build. Even if you think there's > no use for CheckRelation[Oid]LockedByMe except in assertions, > it'd still be plenty reasonable for an extension to call them > in assertions. Yeah good point, thanks for the feedback! I've withdrawn the CF entry. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Surround CheckRelation[Oid]LockedByMe() with USE_ASSERT_CHECKING
Hi, On Mon, Jul 01, 2024 at 05:01:46PM +0900, Michael Paquier wrote: > On Mon, Jul 01, 2024 at 06:42:46AM +0000, Bertrand Drouvot wrote: > > While working on a rebase for [1] due to 0cecc908e97, I noticed that > > CheckRelationLockedByMe() and CheckRelationOidLockedByMe() are used only in > > assertions. > > > > I think it would make sense to declare / define those functions only for > > assert enabled build: please find attached a tiny patch doing so. > > > > Thoughts? > > Not convinced that's a good idea. What about out-of-core code that > may use these routines for runtime checks in non-assert builds? Thanks for the feedback. Yeah that could be an issue for CheckRelationLockedByMe() (CheckRelationOidLockedByMe() is too recent to be a concern). Having said that 1. out of core could want to use CheckRelationOidLockedByMe() ( probably if it was already using CheckRelationLockedByMe()) and 2. I just submitted a rebase for [1] in which I thought that using CheckRelationOidLockedByMe() would be a good idea. So I think that we can get rid of this proposal. [1]: https://www.postgresql.org/message-id/ZoJ5RVtMziIa3TQp%40ip-10-97-1-34.eu-west-3.compute.internal Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Avoid orphaned objects dependencies, take 3
Hi, On Wed, Jun 26, 2024 at 10:24:41AM +, Bertrand Drouvot wrote: > Hi, > > On Fri, Jun 21, 2024 at 01:22:43PM +, Bertrand Drouvot wrote: > > Another thought for the RelationRelationId class case: we could check if > > there > > is a lock first and if there is no lock then acquire one. That way that > > would > > ensure the relation is always locked (so no "risk" anymore), but OTOH it may > > add "unecessary" locking (see 2. mentioned previously). > > Please find attached v12 implementing this idea for the RelationRelationId > class > case. As mentioned, it may add unnecessary locking for 2. but I think that's > worth it to ensure that we are always on the safe side of thing. This idea is > implemented in LockNotPinnedObjectById(). Please find attached v13, mandatory rebase due to 0cecc908e97. In passing, make use of CheckRelationOidLockedByMe() added in 0cecc908e97. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com >From 461ae5d2fa037b2b9fd285c2a328c8bc785eaec8 Mon Sep 17 00:00:00 2001 From: Bertrand Drouvot Date: Fri, 29 Mar 2024 15:43:26 + Subject: [PATCH v13] Avoid orphaned objects dependencies It's currently possible to create orphaned objects dependencies, for example: Scenario 1: session 1: begin; drop schema schem; session 2: create a function in the schema schem session 1: commit; With the above, the function created in session 2 would be linked to a non existing schema. Scenario 2: session 1: begin; create a function in the schema schem session 2: drop schema schem; session 1: commit; With the above, the function created in session 1 would be linked to a non existing schema. To avoid those scenarios, a new lock (that conflicts with a lock taken by DROP) has been put in place before the dependencies are being recorded. With this in place, the drop schema in scenario 2 would be locked. Also, after the new lock attempt, the patch checks that the object still exists: with this in place session 2 in scenario 1 would be locked and would report an error once session 1 committs (that would not be the case should session 1 abort the transaction). If the object is dropped before the new lock attempt is triggered then the patch would also report an error (but with less details). The patch takes into account any type of objects except the ones that are pinned (they are not droppable because the system requires it). A special case is done for objects that belong to the RelationRelationId class. For those, we should be in one of the two following cases that would already prevent the relation to be dropped: 1. The relation is already locked (could be an existing relation or a relation that we are creating). 2. The relation is protected indirectly (i.e an index protected by a lock on its table, a table protected by a lock on a function that depends the table...) To avoid any risks for the RelationRelationId class case, we acquire a lock if there is none. That may add unnecessary lock for 2. but that's worth it. The patch adds a few tests for some dependency cases (that would currently produce orphaned objects): - schema and function (as the above scenarios) - alter a dependency (function and schema) - function and arg type - function and return type - function and function - domain and domain - table and type - server and foreign data wrapper --- src/backend/catalog/aclchk.c | 1 + src/backend/catalog/dependency.c | 121 +++- src/backend/catalog/heap.c| 7 + src/backend/catalog/index.c | 26 src/backend/catalog/objectaddress.c | 57 src/backend/catalog/pg_aggregate.c| 9 ++ src/backend/catalog/pg_attrdef.c | 1 + src/backend/catalog/pg_cast.c | 5 + src/backend/catalog/pg_collation.c| 1 + src/backend/catalog/pg_constraint.c | 26 src/backend/catalog/pg_conversion.c | 2 + src/backend/catalog/pg_depend.c | 40 +- src/backend/catalog/pg_operator.c | 19 +++ src/backend/catalog/pg_proc.c | 17 ++- src/backend/catalog/pg_publication.c | 7 + src/backend/catalog/pg_range.c| 6 + src/backend/catalog/pg_type.c | 39 ++ src/backend/catalog/toasting.c| 1 + src/backend/commands/alter.c | 4 + src/backend/commands/amcmds.c | 1 + src/backend/commands/cluster.c| 7 + src/backend/commands/event_trigger.c | 1 + src/backend/commands/extension.c | 5 + src/backend/commands/foreigncmds.c| 7 + src/backend/commands/functioncmds.c | 6 + src/backend/commands/indexcmds.c | 2 + src/ba
Re: Surround CheckRelation[Oid]LockedByMe() with USE_ASSERT_CHECKING
Hi, On Mon, Jul 01, 2024 at 12:35:34PM +0530, Bharath Rupireddy wrote: > Hi, > > On Mon, Jul 1, 2024 at 12:12 PM Bertrand Drouvot > wrote: > > > > Hi hackers, > > > > While working on a rebase for [1] due to 0cecc908e97, I noticed that > > CheckRelationLockedByMe() and CheckRelationOidLockedByMe() are used only in > > assertions. > > > > I think it would make sense to declare / define those functions only for > > assert enabled build: please find attached a tiny patch doing so. > > > > Thoughts? > > If turning the CheckRelationXXXLocked() compile for non-assert builds, > why not do the same for LWLockHeldByMe, LWLockAnyHeldByMe and > LWLockHeldByMeInMode that are debug-only and being used in asserts? > While it might reduce the compiled binary size a bit for release > builds, we may have to be cautious about external or out of core > modules using them. Thanks for the feedback. CheckRelationOidLockedByMe() is new (as it has been added in 0cecc908e97. While its counterpart CheckRelationLockedByMe() has been added since a few years (2018) in commit b04aeb0a053, I thought it would make sense to surround both of them. While it's true that we could also surround LWLockHeldByMe() (added in e6cba71503f , 2004 and signature change in ea9df812d85, 2014), LWLockAnyHeldByMe() (added in eed959a457e, 2022) and LWLockHeldByMeInMode() (added in 016abf1fb83, 2016), I'm not sure we should (due to their "age" and as you said we have to be cautious about out of core modules / extensions that may use them). Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Surround CheckRelation[Oid]LockedByMe() with USE_ASSERT_CHECKING
Hi hackers, While working on a rebase for [1] due to 0cecc908e97, I noticed that CheckRelationLockedByMe() and CheckRelationOidLockedByMe() are used only in assertions. I think it would make sense to declare / define those functions only for assert enabled build: please find attached a tiny patch doing so. Thoughts? [1]: https://www.postgresql.org/message-id/flat/ZiYjn0eVc7pxVY45%40ip-10-97-1-34.eu-west-3.compute.internal Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com >From bee765581b6b356a6c9c2e4407243f1622b8e84a Mon Sep 17 00:00:00 2001 From: Bertrand Drouvot Date: Mon, 1 Jul 2024 05:38:30 + Subject: [PATCH v1] Surround CheckRelation[Oid]LockedByMe() with USE_ASSERT_CHECKING As CheckRelationLockedByMe() and CheckRelationOidLockedByMe() are used only in assertions, declare / define those functions only for assert enabled build. --- src/backend/storage/lmgr/lmgr.c | 2 ++ src/include/storage/lmgr.h | 2 ++ 2 files changed, 4 insertions(+) 50.0% src/backend/storage/lmgr/ 50.0% src/include/storage/ diff --git a/src/backend/storage/lmgr/lmgr.c b/src/backend/storage/lmgr/lmgr.c index 094522acb41..234cab4f7af 100644 --- a/src/backend/storage/lmgr/lmgr.c +++ b/src/backend/storage/lmgr/lmgr.c @@ -318,6 +318,7 @@ UnlockRelation(Relation relation, LOCKMODE lockmode) LockRelease(, lockmode, false); } +#ifdef USE_ASSERT_CHECKING /* * CheckRelationLockedByMe * @@ -352,6 +353,7 @@ CheckRelationOidLockedByMe(Oid relid, LOCKMODE lockmode, bool orstronger) return LockHeldByMe(, lockmode, orstronger); } +#endif /* * LockHasWaitersRelation diff --git a/src/include/storage/lmgr.h b/src/include/storage/lmgr.h index ce15125ac3b..3d5f989ae80 100644 --- a/src/include/storage/lmgr.h +++ b/src/include/storage/lmgr.h @@ -46,10 +46,12 @@ extern void UnlockRelationOid(Oid relid, LOCKMODE lockmode); extern void LockRelation(Relation relation, LOCKMODE lockmode); extern bool ConditionalLockRelation(Relation relation, LOCKMODE lockmode); extern void UnlockRelation(Relation relation, LOCKMODE lockmode); +#ifdef USE_ASSERT_CHECKING extern bool CheckRelationLockedByMe(Relation relation, LOCKMODE lockmode, bool orstronger); extern bool CheckRelationOidLockedByMe(Oid relid, LOCKMODE lockmode, bool orstronger); +#endif extern bool LockHasWaitersRelation(Relation relation, LOCKMODE lockmode); extern void LockRelationIdForSession(LockRelId *relid, LOCKMODE lockmode); -- 2.34.1
Re: Track the amount of time waiting due to cost_delay
Hi, On Fri, Jun 28, 2024 at 08:07:39PM +, Imseih (AWS), Sami wrote: > > 46ebdfe164 will interrupt the leaders sleep every time a parallel workers > > reports > > progress, and we currently don't handle interrupts by restarting the sleep > > with > > the remaining time. nanosleep does provide the ability to restart with the > > remaining > > time [1], but I don't think it's worth the effort to ensure more accurate > > vacuum delays for the leader process. > > After discussing offline with Bertrand, it may be better to have > a solution to deal with the interrupts and allows the sleep to continue to > completion. This will simplify this patch and will be useful > for other cases in which parallel workers need to send a message > to the leader. This is the thread [1] for that discussion. > > [1] > https://www.postgresql.org/message-id/01000190606e3d2a-116ead16-84d2-4449-8d18-5053da66b1f4-00%40email.amazonses.com > Yeah, I think it would make sense to put this thread on hold until we know more about [1] (you mentioned above) outcome. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: walsender.c comment with no context is hard to understand
Hi, On Fri, Jun 28, 2024 at 03:15:22PM +0530, Amit Kapila wrote: > On Fri, Jun 28, 2024 at 12:55 PM Peter Smith wrote: > > > > On Fri, Jun 28, 2024 at 4:18 PM Amit Kapila wrote: > > > > > > On Fri, Jun 28, 2024 at 5:15 AM Peter Smith wrote: > > > > > > > > On Thu, Jun 27, 2024 at 3:44 PM Michael Paquier > > > > wrote: > > > > > > > > > > On Wed, Jun 26, 2024 at 02:30:26PM +0530, vignesh C wrote: > > > > > > On Mon, 3 Jun 2024 at 11:21, Peter Smith > > > > > > wrote: > > > > > >> Perhaps the comment should say something like it used to: > > > > > >> /* Fail if there is not enough WAL available. This can happen > > > > > >> during > > > > > >> shutdown. */ > > > > > > > > > > > > Agree with this, +1 for this change. > > > > > > > > > > That would be an improvement. Would you like to send a patch with all > > > > > the areas you think could stand for improvements? > > > > > -- > > > > > > > > OK, I attached a patch equivalent of the suggestion in this thread. > > > > > > > > > > Shouldn't the check for flushptr (if (flushptr < targetPagePtr + > > > reqLen)) be moved immediately after the call to WalSndWaitForWal(). > > > The comment seems to suggests the same: "Make sure we have enough WAL > > > available before retrieving the current timeline. .." > > > > > > > Yes, as I wrote in the first post, those lines did once used to be > > adjacent in logical_read_xlog_page. > > > > I also wondered if they still belonged together, but I opted for the > > safest option of fixing only the comment instead of refactoring old > > code when no problem had been reported. > > > > AFAICT these lines became separated due to a 2023 patch [1], and you > > were one of the reviewers of that patch, so I assumed the code > > separation was deemed OK at that time. Unless it was some mistake that > > slipped past multiple reviewers? > > > > I don't know whether your assumption is correct. AFAICS, those two > lines should be together. Let us ee if Bertrand remembers anything? > IIRC the WalSndWaitForWal() call has been moved to ensure that we can determine the timeline accurately. I agree with Amit that it would make more sense to move the (flushptr < targetPagePtr + reqLen) check just after the flushptr assignement. I don't recall that we discussed any reason of not doing so. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: New standby_slot_names GUC in PG 17
Hi, On Wed, Jun 26, 2024 at 09:15:48AM +, Zhijie Hou (Fujitsu) wrote: > Renamed these to the names suggested by Amit. > > Attach the v2 patch set which addressed above and removed > the changes in release-17.sgml according to the comment from Amit. > Thanks! LGTM. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Avoid orphaned objects dependencies, take 3
Hi, On Fri, Jun 21, 2024 at 01:22:43PM +, Bertrand Drouvot wrote: > Another thought for the RelationRelationId class case: we could check if there > is a lock first and if there is no lock then acquire one. That way that would > ensure the relation is always locked (so no "risk" anymore), but OTOH it may > add "unecessary" locking (see 2. mentioned previously). Please find attached v12 implementing this idea for the RelationRelationId class case. As mentioned, it may add unnecessary locking for 2. but I think that's worth it to ensure that we are always on the safe side of thing. This idea is implemented in LockNotPinnedObjectById(). A few remarks: - there is one place where the relation is not visible (even if CommandCounterIncrement() is used). That's in TypeCreate(), because the new relation Oid is _not_ added to pg_class yet. Indeed, in heap_create_with_catalog(), AddNewRelationType() is called before AddNewRelationTuple()). I put a comment in this part of the code explaining why it's not necessary to call LockRelationOid() here. - some namespace related stuff is removed from "test_oat_hooks/expected/alter_table.out". That's due to the logic in cachedNamespacePath() and the fact that the same namespace related stuff is added prior in alter_table.out. - the patch touches 37 .c files, but that's mainly due to the fact that LockNotPinnedObjectById() has to be called in a lot of places. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com >From d6c91b19a585d0a6f5c0abacb9c59cc5acd795f7 Mon Sep 17 00:00:00 2001 From: Bertrand Drouvot Date: Fri, 29 Mar 2024 15:43:26 + Subject: [PATCH v12] Avoid orphaned objects dependencies It's currently possible to create orphaned objects dependencies, for example: Scenario 1: session 1: begin; drop schema schem; session 2: create a function in the schema schem session 1: commit; With the above, the function created in session 2 would be linked to a non existing schema. Scenario 2: session 1: begin; create a function in the schema schem session 2: drop schema schem; session 1: commit; With the above, the function created in session 1 would be linked to a non existing schema. To avoid those scenarios, a new lock (that conflicts with a lock taken by DROP) has been put in place before the dependencies are being recorded. With this in place, the drop schema in scenario 2 would be locked. Also, after the new lock attempt, the patch checks that the object still exists: with this in place session 2 in scenario 1 would be locked and would report an error once session 1 committs (that would not be the case should session 1 abort the transaction). If the object is dropped before the new lock attempt is triggered then the patch would also report an error (but with less details). The patch takes into account any type of objects except the ones that are pinned (they are not droppable because the system requires it). A special case is done for objects that belong to the RelationRelationId class. For those, we should be in one of the two following cases that would already prevent the relation to be dropped: 1. The relation is already locked (could be an existing relation or a relation that we are creating). 2. The relation is protected indirectly (i.e an index protected by a lock on its table, a table protected by a lock on a function that depends the table...) To avoid any risks for the RelationRelationId class case, we acquire a lock if there is none. That may add unnecessary lock for 2. but that's worth it. The patch adds a few tests for some dependency cases (that would currently produce orphaned objects): - schema and function (as the above scenarios) - alter a dependency (function and schema) - function and arg type - function and return type - function and function - domain and domain - table and type - server and foreign data wrapper --- contrib/test_decoding/expected/twophase.out | 3 +- src/backend/catalog/aclchk.c | 1 + src/backend/catalog/dependency.c | 125 - src/backend/catalog/heap.c| 7 + src/backend/catalog/index.c | 26 src/backend/catalog/objectaddress.c | 57 src/backend/catalog/pg_aggregate.c| 9 ++ src/backend/catalog/pg_attrdef.c | 1 + src/backend/catalog/pg_cast.c | 5 + src/backend/catalog/pg_collation.c| 1 + src/backend/catalog/pg_constraint.c | 26 src/backend/catalog/pg_conversion.c | 2 + src/backend/catalog/pg_depend.c | 36 - src/backend/catalog/pg_operator.c | 19 +++ src/backend/catalog/pg_proc.c | 17 ++- src/backend/catalog/pg_publication.c | 7 + src/backend/catalog/pg_range.c| 6 + src/backend/catalog/pg_typ
Re: New standby_slot_names GUC in PG 17
Hi, On Wed, Jun 26, 2024 at 11:39:45AM +0530, Amit Kapila wrote: > On Wed, Jun 26, 2024 at 10:19 AM Bertrand Drouvot > wrote: > > > > > > 2 > > > > Should we rename StandbySlotNamesConfigData too? > > > > How about SyncStandbySlotsConfigData? > > > 3 > > > > Should we rename SlotExistsInStandbySlotNames too? > > > > Similarly SlotExistsInSyncStandbySlots? > > > 4 > > > > Should we rename validate_standby_slots() too? > > > > And validate_sync_standby_slots()? > Thanks! All of the above proposal sound good to me. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: New standby_slot_names GUC in PG 17
Hi, On Wed, Jun 26, 2024 at 04:17:45AM +, Zhijie Hou (Fujitsu) wrote: > On Wednesday, June 26, 2024 9:40 AM Masahiko Sawada > wrote: > > > > On Tue, Jun 25, 2024 at 5:32 PM Amit Kapila > > wrote: > > > > > > I feel synchronized better indicates the purpose because we ensure > > > such slots are synchronized before we process changes for logical > > > failover slots. We already have a 'failover' option for logical slots > > > which could make things confusing if we add 'failover' where physical > > > slots need to be specified. > > > > Agreed. So +1 for synchronized_stnadby_slots. > > +1. > > Since there is a consensus on this name, I am attaching the patch to rename > the GUC to synchronized_stnadby_slots. I have confirmed that the regression > tests and pgindent passed for the patch. > Thanks for the patch! A few comments: 1 In the commit message: " The standby_slot_names GUC is intended to allow specification of physical standby slots that must be synchronized before they are visible to subscribers " Not sure that wording is correct, if we feel the need to explain the GUC, maybe repeat some wording from bf279ddd1c? 2 Should we rename StandbySlotNamesConfigData too? 3 Should we rename SlotExistsInStandbySlotNames too? 4 Should we rename validate_standby_slots() too? Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Track the amount of time waiting due to cost_delay
Hi, On Tue, Jun 25, 2024 at 01:12:16AM +, Imseih (AWS), Sami wrote: Thanks for the feedback! > >> 2. the leader being interrupted while waiting is also already happening on > >> master > >> due to the pgstat_progress_parallel_incr_param() calls in > >> parallel_vacuum_process_one_index() (that have been added in > >> 46ebdfe164). It has been the case "only" 36 times during my test case. > > 46ebdfe164 will interrupt the leaders sleep every time a parallel workers > reports > progress, and we currently don't handle interrupts by restarting the sleep > with > the remaining time. nanosleep does provide the ability to restart with the > remaining > time [1], but I don't think it's worth the effort to ensure more accurate > vacuum delays for the leader process. +1. I don't think it's necessary to have a 100% accurate delay for all the times the delay is involded. I think that's an heuristic parameter (among with cost limit). What matters at the end is by how much you've been able to pause the whole vacuum (and not by a sleep by sleep basis)). > > 1. Having a time based only approach to throttle > > I do agree with a time based approach overall. > > > > 1.1) the more parallel workers is used, the less the impact of the leader on > > the vacuum index phase duration/workload is (because the repartition is done > > on more processes). > > Did you mean " because the vacuum is done on more processes"? Yes. > When a leader is operating on a large index(s) during the entirety > of the vacuum operation, wouldn't more parallel workers end up > interrupting the leader more often? That's right but my point was about the impact on the "whole" duration time and "whole" workload (leader + workers included) and not about the number of times the leader is interrupted. If there is say 100 workers then interrupting the leader (1 process out of 101) is probably less of an issue as it means that there is a lot of work to be done to have those 100 workers busy. I don't think the size of the index the leader is vacuuming has an impact. I think that having the leader vacuuming a 100 GB index or 100 x 1GB indexes is the same (as long as all the other workers are actives during all that time). > > 3. A 1 second reporting "throttling" looks a reasonable threshold as: > > > 3.1 the idea is to have a significant impact when the leader could have been > > interrupted say hundred/thousand times per second. > > > 3.2 it does not make that much sense for any tools to sample > > pg_stat_progress_vacuum > > multiple times per second (so a one second reporting granularity seems ok). > > I feel 1 second may still be too frequent. Maybe we'll need more measurements but this is what my test case made of: vacuum_cost_delay = 1 vacuum_cost_limit = 10 8 parallel workers, 1 leader 21 indexes (about 1GB each, one 40MB), all in memory lead to: With 1 second reporting frequency, the leader has been interruped about 2500 times over 8m39s leading to about the same time as on master (8m43s). > What about 10 seconds ( or 30 seconds )? I'm not sure (may need more measurements) but it would probably complicate the reporting a bit (as with the current v3 we'd miss reporting the indexes that take less time than the threshold to complete). > I think this metric in particular will be mainly useful for vacuum runs that > are > running for minutes or more, making reporting every 10 or 30 seconds > still useful. Agree. OTOH, one could be interested to diagnose what happened during a say 5 seconds peak on I/O resource consumption/latency. Sampling pg_stat_progress_vacuum at 1 second interval and see by how much the vaccum has been paused during that time could help too (specially if it is made of a lot of parallel workers that could lead to a lot of I/O). But it would miss data if we are reporting at a larger rate. > It just occurred to me also that pgstat_progress_parallel_incr_param > should have a code comment that it will interrupt a leader process and > cause activity such as a sleep to end early. Good point, I'll add a comment for it. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: New standby_slot_names GUC in PG 17
Hi, On Tue, Jun 25, 2024 at 10:24:41AM +0530, Amit Kapila wrote: > On Tue, Jun 25, 2024 at 8:20 AM Masahiko Sawada wrote: > > > > On Tue, Jun 25, 2024 at 11:21 AM Zhijie Hou (Fujitsu) > > wrote: > > > > > > I agree the current name seems too generic and the suggested ' > > > synchronized_standby_slots ' > > > is better than the current one. > > > > > > Some other ideas could be: > > > > > > synchronize_slots_on_standbys: it indicates that the standbys that enabled > > > slot sync should be listed in this GUC. > > > > > > logical_replication_wait_slots: it means the logical replication(logical > > > Walsender process) will wait for these slots to advance the confirm flush > > > lsn before proceeding. > > > > I feel that the name that has some connection to "logical replication" > > also sounds good. Let me add some ideas: > > > > - logical_replication_synchronous_standby_slots (might be too long) > > - logical_replication_synchronous_slots > > > > I see your point about keeping logical_replication in the name but > that could also lead one to think that this list can contain logical > slots. Agree, and we may add the same functionality for physical replication slots in the future too (it has been discussed in the thread [1]). So I don't think "logical" should be part of the name. > OTOH, there is some value in keeping '_standby_' in the name as > that is more closely associated with physical standby's and this list > contains physical slots corresponding to physical standby's. So, my > preference is in order as follows: synchronized_standby_slots, > wait_for_standby_slots, logical_replication_wait_slots, > logical_replication_synchronous_slots, and > logical_replication_synchronous_standby_slots. > I like the idea of having "synchronize[d]" in the name as it makes think of the feature it is linked to [2]. The slots mentioned in this parameter are linked to the "primary_slot_name" parameter on the standby, so what about? synchronized_primary_slot_names It makes clear it is somehow linked to "primary_slot_name" and that we want them to be in sync. So I'd vote for (in that order); synchronized_primary_slot_names, synchronized_standby_slots [1]: https://www.postgresql.org/message-id/bb437218-73bc-34c3-b8fb-8c1be4ddaec9%40enterprisedb.com [2]: https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=93db6cbda037f1be9544932bd9a785dabf3ff712 Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Track the amount of time waiting due to cost_delay
Hi, On Sat, Jun 22, 2024 at 12:48:33PM +, Bertrand Drouvot wrote: > 1. vacuuming indexes time has been longer on master because with v2, the > leader > has been interrupted 342605 times while waiting, then making v2 "faster". > > 2. the leader being interrupted while waiting is also already happening on > master > due to the pgstat_progress_parallel_incr_param() calls in > parallel_vacuum_process_one_index() (that have been added in > 46ebdfe164). It has been the case "only" 36 times during my test case. > > I think that 2. is less of a concern but I think that 1. is something that > needs > to be addressed because the leader process is not honouring its cost delay > wait > time in a noticeable way (at least during my test case). > > I did not think of a proposal yet, just sharing my investigation as to why > v2 has been faster than master during the vacuuming indexes phase. I think that a reasonable approach is to make the reporting from the parallel workers to the leader less aggressive (means occur less frequently). Please find attached v3, that: - ensures that there is at least 1 second between 2 reports, per parallel worker, to the leader. - ensures that the reported delayed time is still correct (keep track of the delayed time between 2 reports). - does not add any extra pg_clock_gettime_ns() calls (as compare to v2). Remarks: 1. Having a time based only approach to throttle the reporting of the parallel workers sounds reasonable. I don't think that the number of parallel workers has to come into play as: 1.1) the more parallel workers is used, the less the impact of the leader on the vacuum index phase duration/workload is (because the repartition is done on more processes). 1.2) the less parallel workers is, the less the leader will be interrupted ( less parallel workers would report their delayed time). 2. The throttling is not based on the cost limit as that value is distributed proportionally among the parallel workers (so we're back to the previous point). 3. The throttling is not based on the actual cost delay value because the leader could be interrupted at the beginning, the midle or whatever part of the wait and we are more interested about the frequency of the interrupts. 3. A 1 second reporting "throttling" looks a reasonable threshold as: 3.1 the idea is to have a significant impact when the leader could have been interrupted say hundred/thousand times per second. 3.2 it does not make that much sense for any tools to sample pg_stat_progress_vacuum multiple times per second (so a one second reporting granularity seems ok). With this approach in place, v3 attached applied, during my test case: - the leader has been interrupted about 2500 times (instead of about 345000 times with v2) - the vacuum index phase duration is very close to the master one (it has been 4 seconds faster (over a 8 minutes 40 seconds duration time), instead of 3 minutes faster with v2). Thoughts? Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com >From 99f417c0bcd7c29e126fdccdd6214ea37db67379 Mon Sep 17 00:00:00 2001 From: Bertrand Drouvot Date: Mon, 24 Jun 2024 08:43:26 + Subject: [PATCH v3] Report the total amount of time that vacuum has been delayed due to cost delay This commit adds one column: time_delayed to the pg_stat_progress_vacuum system view to show the total amount of time in milliseconds that vacuum has been delayed. This uses the new parallel message type for progress reporting added by f1889729dd. In case of parallel worker, to avoid the leader to be interrupted too frequently (while it might be sleeping for cost delay), the report is done only if the last report has been done more than 1 second ago. Having a time based only approach to throttle the reporting of the parallel workers sounds reasonable. Indeed when deciding about the throttling: 1. The number of parallel workers should not come into play: 1.1) the more parallel workers is used, the less the impact of the leader on the vacuum index phase duration/workload is (because the repartition is done on more processes). 1.2) the less parallel workers is, the less the leader will be interrupted ( less parallel workers would report their delayed time). 2. The cost limit should not come into play as that value is distributed proportionally among the parallel workers (so we're back to the previous point). 3. The cost delay does not come into play as the leader could be interrupted at the beginning, the midle or whatever part of the wait and we are more interested about the frequency of the interrupts. 3. A 1 second reporting "throttling" looks a reasonable threshold as: 3.1 the idea is to have a significant impact when the leader could have been interrupted say hundred/thousand times per second. 3.2 it does n
Re: Track the amount of time waiting due to cost_delay
Hi, On Thu, Jun 13, 2024 at 11:56:26AM +, Bertrand Drouvot wrote: > == Observations > === > > As compare to v2: > > 1. scanning heap time is about the same > 2. vacuuming indexes time is about 3 minutes longer on master > 3. vacuuming heap time is about the same I had a closer look to understand why the vacuuming indexes time has been about 3 minutes longer on master. During the vacuuming indexes phase, the leader is helping vacuuming the indexes until it reaches WaitForParallelWorkersToFinish() (means when all the remaining indexes are currently handled by the parallel workers, the leader has nothing more to do and so it is waiting for the parallel workers to finish). During the time the leader process is involved in indexes vacuuming it is also subject to wait due to cost delay. But with v2 applied, the leader may be interrupted by the parallel workers while it is waiting (due to the new pgstat_progress_parallel_incr_param() calls that the patch is adding). So, with v2 applied, the leader is waiting less (as interrupted while waiting) when being involved in indexes vacuuming and that's why v2 is "faster" than master. To put some numbers, I did count the number of times the leader's pg_usleep() has been interrupted (by counting the number of times the nanosleep() did return a value < 0 in pg_usleep()). Here they are: v2: the leader has been interrupted about 342605 times master: the leader has been interrupted about 36 times The ones on master are mainly coming from the pgstat_progress_parallel_incr_param() calls in parallel_vacuum_process_one_index(). The additional ones on v2 are coming from the pgstat_progress_parallel_incr_param() calls added in vacuum_delay_point(). Conclusion == 1. vacuuming indexes time has been longer on master because with v2, the leader has been interrupted 342605 times while waiting, then making v2 "faster". 2. the leader being interrupted while waiting is also already happening on master due to the pgstat_progress_parallel_incr_param() calls in parallel_vacuum_process_one_index() (that have been added in 46ebdfe164). It has been the case "only" 36 times during my test case. I think that 2. is less of a concern but I think that 1. is something that needs to be addressed because the leader process is not honouring its cost delay wait time in a noticeable way (at least during my test case). I did not think of a proposal yet, just sharing my investigation as to why v2 has been faster than master during the vacuuming indexes phase. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Avoid orphaned objects dependencies, take 3
Hi, On Wed, Jun 19, 2024 at 02:11:50PM +, Bertrand Drouvot wrote: > To sum up, I did not see any cases that did not lead to 1. or 2., so I think > it's safe to not add an extra lock for the RelationRelationId case. If, for > any > reason, there is still cases that are outside 1. or 2. then they may lead to > orphaned dependencies linked to the RelationRelationId class. I think that's > fine to take that "risk" given that a. that would not be worst than currently > and b. I did not see any of those in our fleet currently (while I have seen a > non > negligible amount outside of the RelationRelationId case). Another thought for the RelationRelationId class case: we could check if there is a lock first and if there is no lock then acquire one. That way that would ensure the relation is always locked (so no "risk" anymore), but OTOH it may add "unecessary" locking (see 2. mentioned previously). I think I do prefer this approach to be on the safe side of thing, what do you think? Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Pluggable cumulative statistics
Hi, On Thu, Jun 20, 2024 at 09:46:30AM +0900, Michael Paquier wrote: > On Thu, Jun 13, 2024 at 04:59:50PM +0900, Michael Paquier wrote: > > - How should the persistence of the custom stats be achieved? > > Callbacks to give custom stats kinds a way to write/read their data, > > push everything into a single file, or support both? > > - Should this do like custom RMGRs and assign to each stats kinds ID > > that are set in stone rather than dynamic ones? > These two questions have been itching me in terms of how it would work > for extension developers, after noticing that custom RMGRs are used > more than I thought: > https://wiki.postgresql.org/wiki/CustomWALResourceManagers > > The result is proving to be nicer, shorter by 300 lines in total and > much simpler when it comes to think about the way stats are flushed > because it is possible to achieve the same result as the first patch > set without manipulating any of the code paths doing the read and > write of the pgstats file. I think it makes sense to follow the same "behavior" as the custom wal resource managers. That, indeed, looks much more simpler than v1. > In terms of implementation, pgstat.c's KindInfo data is divided into > two parts, for efficiency: > - The exiting in-core stats with designated initializers, renamed as > built-in stats kinds. > - The custom stats kinds are saved in TopMemoryContext, Agree that a backend lifetime memory area is fine for that purpose. > and can only > be registered with shared_preload_libraries. The patch reserves a set > of 128 harcoded slots for all the custom kinds making the lookups for > the KindInfos quite cheap. + MemoryContextAllocZero(TopMemoryContext, + sizeof(PgStat_KindInfo *) * PGSTAT_KIND_CUSTOM_SIZE); and that's only 8 * PGSTAT_KIND_CUSTOM_SIZE bytes in total. I had a quick look at the patches (have in mind to do more): > With that in mind, the patch set is more pleasant to the eye, and the > attached v2 consists of: > - 0001 and 0002 are some cleanups, same as previously to prepare for > the backend-side APIs. 0001 and 0002 look pretty straightforward at a quick look. > - 0003 adds the backend support to plug-in custom stats. 1 === It looks to me that there is a mix of "in core" and "built-in" to name the non custom stats. Maybe it's worth to just use one? As I can see (and as you said above) this is mainly inspired by the custom resource manager and 2 === and 3 === are probably copy/paste consequences. 2 === + if (pgstat_kind_custom_infos[idx] != NULL && + pgstat_kind_custom_infos[idx]->name != NULL) + ereport(ERROR, + (errmsg("failed to register custom cumulative statistics \"%s\" with ID %u", kind_info->name, kind), +errdetail("Custom resource manager \"%s\" already registered with the same ID.", + pgstat_kind_custom_infos[idx]->name))); s/Custom resource manager/Custom cumulative statistics/ 3 === + ereport(LOG, + (errmsg("registered custom resource manager \"%s\" with ID %u", + kind_info->name, kind))); s/custom resource manager/custom cumulative statistics/ > - 0004 includes documentation. Did not look yet. > - 0005 is an example of how to use them, with a TAP test providing > coverage. Did not look yet. As I said, I've in mind to do a more in depth review. I've noted the above while doing a quick read of the patches so thought it makes sense to share them now while at it. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Pluggable cumulative statistics
Hi, On Thu, Jun 13, 2024 at 04:59:50PM +0900, Michael Paquier wrote: > Hi all, > > 2) Make the shmem pgstats pluggable so as it is possible for extensions > to register their own stats kinds. Thanks for the patch! I like the idea of having custom stats (it has also been somehow mentioned in [1]). > 2) is actually something that can be used for more things than > just pg_stat_statements, because people love extensions and > statistics (spoiler: I do). +1 > * Making custom stats data persistent is an interesting problem, and > there are a couple of approaches I've considered: > ** Allow custom kinds to define callbacks to read and write data from > a source they'd want, like their own file through a fd. This has the > disadvantage to remove the benefit of c) above. > ** Store everything in the existing stats file, adding one type of > entry like 'S' and 'N' with a "custom" type, where the *name* of the > custom stats kind is stored instead of its ID computed from shared > memory. What about having 2 files? - One is the existing stats file - One "predefined" for all the custom stats (so what you've done minus the in-core stats). This one would not be configurable and the extensions will not need to know about it. Would that remove the benefit from c) that you mentioned up-thread? [1]: https://www.postgresql.org/message-id/20220818195124.c7ipzf6c5v7vxymc%40awork3.anarazel.de Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Avoid orphaned objects dependencies, take 3
Hi, On Mon, Jun 17, 2024 at 05:57:05PM +, Bertrand Drouvot wrote: > Hi, > > On Mon, Jun 17, 2024 at 12:24:46PM -0400, Robert Haas wrote: > > So to try to sum up here: I'm not sure I agree with this design. But I > > also feel like the design is not as clear and consistently implemented > > as it could be. So even if I just ignored the question of whether it's > > the right design, it feels like we're a ways from having something > > potentially committable here, because of issues like the ones I > > mentioned in the last paragraph. > > > > Agree. I'll now move on with the "XXX Do we need a lock for > RelationRelationId?" > comments that I put in v10 (see [1]) and study all the cases around them. A. I went through all of them, did some testing for all, and reached the conclusion that we must be in one of the two following cases that would already prevent the relation to be dropped: 1. The relation is already locked (could be an existing relation or a relation that we are creating). 2. The relation is protected indirectly (i.e an index protected by a lock on its table, a table protected by a lock on a function that depends of the table...). So we don't need to add a lock if this is a RelationRelationId object class for the cases above. As a consequence, I replaced the "XXX" related comments that were in v10 by another set of comments in v11 (attached) like "No need to call LockRelationOid() (through LockNotPinnedObject())". Reason is to make it clear in the code and also to ease the review. B. I explained in [1] (while sharing v10) that the object locking is now outside of the dependency code except for (and I explained why): recordDependencyOnExpr() recordDependencyOnSingleRelExpr() makeConfigurationDependencies() So I also did some testing, on the RelationRelationId case, for those and I reached the same conclusion as the one shared above. For A. and B. the testing has been done by adding a "ereport(WARNING.." at those places when a RelationRelationId is involved. Then I run "make check" and went to the failed tests (output were not the expected ones due to the extra "WARNING"), reproduced them with gdb and checked for the lock on the relation producing the "WARNING". All of those were linked to 1. or 2. Note that adding an assertion on an existing lock would not work for the cases described in 2. So, I'm now confident that we must be in 1. or 2. but it's also possible that I've missed some cases (though I think the way the testing has been done is not that weak). To sum up, I did not see any cases that did not lead to 1. or 2., so I think it's safe to not add an extra lock for the RelationRelationId case. If, for any reason, there is still cases that are outside 1. or 2. then they may lead to orphaned dependencies linked to the RelationRelationId class. I think that's fine to take that "risk" given that a. that would not be worst than currently and b. I did not see any of those in our fleet currently (while I have seen a non negligible amount outside of the RelationRelationId case). > Once done, I think that it will easier to 1.remove ambiguity, 2.document and > 3.do the "right" thing regarding the RelationRelationId object class. > Please find attached v11, where I added more detailed comments in the commit message and also in the code (I also removed the useless check on AuthMemRelationId). [1]: https://www.postgresql.org/message-id/ZnAVEBhlGvpDDVOD%40ip-10-97-1-34.eu-west-3.compute.internal Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com >From 50e11432960e0ad5d940d2e7d9557fc4770d8262 Mon Sep 17 00:00:00 2001 From: Bertrand Drouvot Date: Fri, 29 Mar 2024 15:43:26 + Subject: [PATCH v11] Avoid orphaned objects dependencies It's currently possible to create orphaned objects dependencies, for example: Scenario 1: session 1: begin; drop schema schem; session 2: create a function in the schema schem session 1: commit; With the above, the function created in session 2 would be linked to a non existing schema. Scenario 2: session 1: begin; create a function in the schema schem session 2: drop schema schem; session 1: commit; With the above, the function created in session 1 would be linked to a non existing schema. To avoid those scenarios, a new lock (that conflicts with a lock taken by DROP) has been put in place before the dependencies are being recorded. With this in place, the drop schema in scenario 2 would be locked. Also, after the new lock attempt, the patch checks that the object still exists: with this in place session 2 in scenario 1 would be locked and would report an error once session 1 committs (that would not be the case should session 1 abort the transaction). If the object is dropped before the n
Re: Avoid orphaned objects dependencies, take 3
Hi, On Mon, Jun 17, 2024 at 12:24:46PM -0400, Robert Haas wrote: > On Fri, Jun 14, 2024 at 3:54 AM Bertrand Drouvot > wrote: > > > Ah, right. So, I was assuming that, with either this version of your > > > patch or the earlier version, we'd end up locking the constraint > > > itself. Was I wrong about that? > > > > The child contraint itself is not locked when going through > > ConstraintSetParentConstraint(). > > > > While at it, let's look at a full example and focus on your concern. > > I'm not at the point of having a concern yet, honestly. I'm trying to > understand the design ideas. The commit message just says that we take > a conflicting lock, but it doesn't mention which object types that > principle does or doesn't apply to. I think the idea of skipping it > for cases where it's redundant with the relation lock could be the > right idea, but if that's what we're doing, don't we need to explain > the principle somewhere? And shouldn't we also apply it across all > object types that have the same property? Yeah, I still need to deeply study this area and document it. > Along the same lines: > > + /* > + * Those don't rely on LockDatabaseObject() when being dropped (see > + * AcquireDeletionLock()). Also it looks like they can not produce > + * orphaned dependent objects when being dropped. > + */ > + if (object->classId == RelationRelationId || object->classId == > AuthMemRelationId) > + return; > > "It looks like X cannot happen" is not confidence-inspiring. Yeah, it is not. It is just a "feeling" that I need to work on to remove any ambiguity and/or adjust the code as needed. > At the > very least, a better comment is needed here. But also, that relation > has no exception for AuthMemRelationId, only for RelationRelationId. > And also, the exception for RelationRelationId doesn't imply that we > don't need a conflicting lock here: the special case for > RelationRelationId in AcquireDeletionLock() is necessary because the > lock tag format is different for relations than for other database > objects, not because we don't need a lock at all. If the handling here > were really symmetric with what AcquireDeletionLock(), the coding > would be to call either LockRelationOid() or LockDatabaseObject() > depending on whether classid == RelationRelationId. Agree. > Now, that isn't > actually necessary, because we already have relation-locking calls > elsewhere, but my point is that the rationale this commit gives for > WHY it isn't necessary seems to me to be wrong in multiple ways. Agree. I'm not done with that part yet (should have made it more clear). > So to try to sum up here: I'm not sure I agree with this design. But I > also feel like the design is not as clear and consistently implemented > as it could be. So even if I just ignored the question of whether it's > the right design, it feels like we're a ways from having something > potentially committable here, because of issues like the ones I > mentioned in the last paragraph. > Agree. I'll now move on with the "XXX Do we need a lock for RelationRelationId?" comments that I put in v10 (see [1]) and study all the cases around them. Once done, I think that it will easier to 1.remove ambiguity, 2.document and 3.do the "right" thing regarding the RelationRelationId object class. [1]: https://www.postgresql.org/message-id/ZnAVEBhlGvpDDVOD%40ip-10-97-1-34.eu-west-3.compute.internal Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Allow logical failover slots to wait on synchronous replication
Hi, On Tue, Jun 11, 2024 at 10:00:46AM +, Bertrand Drouvot wrote: > Hi, > > On Mon, Jun 10, 2024 at 09:25:10PM -0500, Nathan Bossart wrote: > > On Mon, Jun 10, 2024 at 03:51:05PM -0700, John H wrote: > > > The existing 'standby_slot_names' isn't great for users who are running > > > clusters with quorum-based synchronous replicas. For instance, if > > > the user has synchronous_standby_names = 'ANY 3 (A,B,C,D,E)' it's a > > > bit tedious to have to reconfigure the standby_slot_names to set it to > > > the most updated 3 sync replicas whenever different sync replicas start > > > lagging. In the event that both GUCs are set, 'standby_slot_names' takes > > > precedence. > > > > Hm. IIUC you'd essentially need to set standby_slot_names to "A,B,C,D,E" > > to get the desired behavior today. That might ordinarily be okay, but it > > could cause logical replication to be held back unnecessarily if one of the > > replicas falls behind for whatever reason. A way to tie standby_slot_names > > to synchronous replication instead does seem like it would be useful in > > this case. > > FWIW, I have the same understanding and also think your proposal would be > useful in this case. A few random comments about v1: 1 + int mode = SyncRepWaitMode; It's set to SyncRepWaitMode and then never change. Worth to get rid of "mode"? 2 + static XLogRecPtr lsn[NUM_SYNC_REP_WAIT_MODE] = {InvalidXLogRecPtr}; I did some testing and saw that the lsn[] values were not always set to InvalidXLogRecPtr right after. It looks like that, in that case, we should avoid setting the lsn[] values at compile time. Then, what about? 1. remove the "static". or 2. keep the static but set the lsn[] values after its declaration. 3 - /* -* Return false if not all the standbys have caught up to the specified -* WAL location. -*/ - if (caught_up_slot_num != standby_slot_names_config->nslotnames) - return false; + if (!XLogRecPtrIsInvalid(lsn[mode]) && lsn[mode] >= wait_for_lsn) + return true; lsn[] values are/(should have been, see 2 above) just been initialized to InvalidXLogRecPtr so that XLogRecPtrIsInvalid(lsn[mode]) will always return true. I think this check is not needed then. 4 > > > I did some very brief pgbench runs to compare the latency. Client instance > > > was running pgbench and 10 logical clients while the Postgres box hosted > > > the writer and 5 synchronous replicas. > > > There's a hit to TPS Out of curiosity, did you compare with standby_slot_names_from_syncrep set to off and standby_slot_names not empty? Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Avoid orphaned objects dependencies, take 3
Hi, On Thu, Jun 13, 2024 at 04:52:09PM +, Bertrand Drouvot wrote: > Hi, > > On Thu, Jun 13, 2024 at 10:49:34AM -0400, Robert Haas wrote: > > On Fri, Jun 7, 2024 at 4:41 AM Bertrand Drouvot > > wrote: > > > Do you still find the code hard to maintain with v9? > > > > I don't think it substantially changes my concerns as compared with > > the earlier version. > > Thanks for the feedback, I'll give it more thoughts. Please find attached v10 that puts the object locking outside of the dependency code. It's done that way except for: recordDependencyOnExpr() recordDependencyOnSingleRelExpr() makeConfigurationDependencies() The reason is that I think that it would need part of the logic that his inside the above functions to be duplicated and I'm not sure that's worth it. For example, we would probably need to: - make additional calls to find_expr_references_walker() - make additional scan on the config map It's also not done outside of recordDependencyOnCurrentExtension() as: 1. I think it is clear enough that way (as it is clear that the lock is taken on a ExtensionRelationId object class). 2. why to include "commands/extension.h" in more places (locking would depend of "creating_extension" and "CurrentExtensionObject"), while 1.? Remarks: 1. depLockAndCheckObject() and friends in v9 have been renamed to LockNotPinnedObject() and friends (as the vast majority of their calls are now done outside of the dependency code). 2. regarding the concern around RelationRelationId (discussed in [1]), v10 adds a comment "XXX Do we need a lock for RelationRelationId?" at the places we may want to lock this object class. I did not think about it yet (but will do), I only added this comment at multiple places. I think that v10 is easier to follow (as compare to v9) as we can easily see for which object class we'll put a lock on. Thoughts? [1]: https://www.postgresql.org/message-id/Zmv3TPfJAyQXhIdu%40ip-10-97-1-34.eu-west-3.compute.internal Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com >From dc75e3255803617d21c55d77dc218904bd729d81 Mon Sep 17 00:00:00 2001 From: Bertrand Drouvot Date: Fri, 29 Mar 2024 15:43:26 + Subject: [PATCH v10] Avoid orphaned objects dependencies It's currently possible to create orphaned objects dependencies, for example: Scenario 1: session 1: begin; drop schema schem; session 2: create a function in the schema schem session 1: commit; With the above, the function created in session 2 would be linked to a non existing schema. Scenario 2: session 1: begin; create a function in the schema schem session 2: drop schema schem; session 1: commit; With the above, the function created in session 1 would be linked to a non existing schema. To avoid those scenarios, a new lock (that conflicts with a lock taken by DROP) has been put in place before the dependencies are being recorded. With this in place, the drop schema in scenario 2 would be locked. Also, after the new lock attempt, the patch checks that the object still exists: with this in place session 2 in scenario 1 would be locked and would report an error once session 1 committs (that would not be the case should session 1 abort the transaction). If the object is dropped before the new lock attempt is triggered then the patch would also report an error (but with less details). The patch adds a few tests for some dependency cases (that would currently produce orphaned objects): - schema and function (as the above scenarios) - alter a dependency (function and schema) - function and arg type - function and return type - function and function - domain and domain - table and type - server and foreign data wrapper --- src/backend/catalog/aclchk.c | 1 + src/backend/catalog/dependency.c | 109 ++- src/backend/catalog/heap.c| 8 ++ src/backend/catalog/index.c | 16 +++ src/backend/catalog/objectaddress.c | 57 src/backend/catalog/pg_aggregate.c| 9 ++ src/backend/catalog/pg_attrdef.c | 1 + src/backend/catalog/pg_cast.c | 5 + src/backend/catalog/pg_collation.c| 1 + src/backend/catalog/pg_constraint.c | 11 ++ src/backend/catalog/pg_conversion.c | 2 + src/backend/catalog/pg_depend.c | 27 +++- src/backend/catalog/pg_operator.c | 19 +++ src/backend/catalog/pg_proc.c | 17 ++- src/backend/catalog/pg_publication.c | 5 + src/backend/catalog/pg_range.c| 6 + src/backend/catalog/pg_type.c | 33 + src/backend/catalog/toasting.c| 1 + src/backend/commands/alter.c | 4 + src/backend/commands/amcmds.c |
Re: Avoid orphaned objects dependencies, take 3
Hi, On Thu, Jun 13, 2024 at 02:27:45PM -0400, Robert Haas wrote: > On Thu, Jun 13, 2024 at 12:52 PM Bertrand Drouvot > wrote: > > > > table_open(childRelId, ...) would lock any "ALTER TABLE > > > > DROP CONSTRAINT" > > > > already. Not sure I understand your concern here. > > > > > > I believe this is not true. This would take a lock on the table, not > > > the constraint itself. > > > > I agree that it would not lock the constraint itself. What I meant to say > > is that > > , nevertheless, the constraint can not be dropped. Indeed, the "ALTER TABLE" > > necessary to drop the constraint (ALTER TABLE DROP CONSTRAINT) > > would > > be locked by the table_open(childRelId, ...). > > Ah, right. So, I was assuming that, with either this version of your > patch or the earlier version, we'd end up locking the constraint > itself. Was I wrong about that? The child contraint itself is not locked when going through ConstraintSetParentConstraint(). While at it, let's look at a full example and focus on your concern. Let's do that with this gdb file: " $ cat gdb.txt b dependency.c:1542 command 1 printf "Will return for: classId %d and objectId %d\n", object->classId, object->objectId c end b dependency.c:1547 if object->classId == 2606 command 2 printf "Will lock constraint: classId %d and objectId %d\n", object->classId, object->objectId c end " knowing that: " Line 1542 is the return here in depLockAndCheckObject() (your concern): if (object->classId == RelationRelationId || object->classId == AuthMemRelationId) return; Line 1547 is the lock here in depLockAndCheckObject(): /* assume we should lock the whole object not a sub-object */ LockDatabaseObject(object->classId, object->objectId, 0, AccessShareLock); " So, with gdb attached to a session let's: 1. Create the parent relation CREATE TABLE upsert_test ( a INT, b TEXT ) PARTITION BY LIST (a); gdb produces: --- Will return for: classId 1259 and objectId 16384 Will return for: classId 1259 and objectId 16384 --- Oid 16384 is upsert_test, so I think the return (dependency.c:1542) is fine as we are creating the object (it can't be dropped as not visible to anyone else). 2. Create another relation (will be the child) CREATE TABLE upsert_test_2 (b TEXT, a int); gdb produces: --- Will return for: classId 1259 and objectId 16391 Will return for: classId 1259 and objectId 16394 Will return for: classId 1259 and objectId 16394 Will return for: classId 1259 and objectId 16391 --- Oid 16391 is upsert_test_2 Oid 16394 is pg_toast_16391 so I think the return (dependency.c:1542) is fine as we are creating those objects (can't be dropped as not visible to anyone else). 3. Attach the partition ALTER TABLE upsert_test ATTACH PARTITION upsert_test_2 FOR VALUES IN (2); gdb produces: --- Will return for: classId 1259 and objectId 16384 --- That's fine because we'd already had locked the relation 16384 through AlterTableLookupRelation()->RangeVarGetRelidExtended()->LockRelationOid(). 4. Add a constraint on the child relation ALTER TABLE upsert_test_2 add constraint bdtc2 UNIQUE (a); gdb produces: --- Will return for: classId 1259 and objectId 16391 Will lock constraint: classId 2606 and objectId 16397 --- That's fine because we'd already had locked the relation 16391 through AlterTableLookupRelation()->RangeVarGetRelidExtended()->LockRelationOid(). Oid 16397 is the constraint we're creating (bdtc2). 5. Add a constraint on the parent relation (this goes through ConstraintSetParentConstraint()) ALTER TABLE upsert_test add constraint bdtc1 UNIQUE (a); gdb produces: --- Will return for: classId 1259 and objectId 16384 Will lock constraint: classId 2606 and objectId 16399 Will return for: classId 1259 and objectId 16398 Will return for: classId 1259 and objectId 16391 Will lock constraint: classId 2606 and objectId 16399 Will return for: classId 1259 and objectId 16391 --- Regarding "Will return for: classId 1259 and objectId 16384": That's fine because we'd already had locked the relation 16384 through AlterTableLookupRelation()->RangeVarGetRelidExtended()->LockRelationOid(). Regarding "Will lock constraint: classId 2606 and objectId 16399": Oid 16399 is the constraint that we're creating. Regarding "Will return for: classId 1259 and objectId 16398": That's fine because Oid 16398 is an index that we're creating while creating the constraint (so the index can't be dropped as not visible to anyone else). Regarding "Will return for: classId 1259 and objectId 16391": That's fine because 16384 we'd be already locked (as mentioned above). And I think that's fine because trying to drop "upsert_test_2" (ak
Re: Avoid orphaned objects dependencies, take 3
Hi, On Thu, Jun 13, 2024 at 10:49:34AM -0400, Robert Haas wrote: > On Fri, Jun 7, 2024 at 4:41 AM Bertrand Drouvot > wrote: > > Do you still find the code hard to maintain with v9? > > I don't think it substantially changes my concerns as compared with > the earlier version. Thanks for the feedback, I'll give it more thoughts. > > > > but we're not similarly careful about other operations e.g. > > > ConstraintSetParentConstraint is called by DefineIndex which calls > > > table_open(childRelId, ...) first, but there's no logic in DefineIndex > > > to lock the constraint. > > > > table_open(childRelId, ...) would lock any "ALTER TABLE DROP > > CONSTRAINT" > > already. Not sure I understand your concern here. > > I believe this is not true. This would take a lock on the table, not > the constraint itself. I agree that it would not lock the constraint itself. What I meant to say is that , nevertheless, the constraint can not be dropped. Indeed, the "ALTER TABLE" necessary to drop the constraint (ALTER TABLE DROP CONSTRAINT) would be locked by the table_open(childRelId, ...). That's why I don't understand your concern with this particular example. But anyway, I'll double check your related concern: + if (object->classId == RelationRelationId || object->classId == AuthMemRelationId) + return; in depLockAndCheckObject(). Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Track the amount of time waiting due to cost_delay
Hi, On Tue, Jun 11, 2024 at 08:26:23AM +, Bertrand Drouvot wrote: > Hi, > > On Tue, Jun 11, 2024 at 04:07:05PM +0900, Masahiko Sawada wrote: > > > Thank you for the proposal and the patch. I understand the motivation > > of this patch. > > Thanks for looking at it! > > > Beside the point Nathan mentioned, I'm slightly worried > > that massive parallel messages could be sent to the leader process > > when the cost_limit value is low. > > I see, I can/will do some testing in this area and share the numbers. Here is the result of the test. It has been launched several times and it produced the same (surprising result) each time. == Context The testing has been done with this relation (large_tbl) and its indexes: postgres=# SELECT n.nspname, c.relname, count(*) AS buffers FROM pg_buffercache b JOIN pg_class c ON b.relfilenode = pg_relation_filenode(c.oid) AND b.reldatabase IN (0, (SELECT oid FROM pg_database WHERE datname = current_database())) JOIN pg_namespace n ON n.oid = c.relnamespace GROUP BY n.nspname, c.relname ORDER BY 3 DESC LIMIT 22; nspname | relname | buffers -++- public | large_tbl | 80 public | large_tbl_filler13 | 125000 public | large_tbl_filler6 | 125000 public | large_tbl_filler5 | 125000 public | large_tbl_filler3 | 125000 public | large_tbl_filler15 | 125000 public | large_tbl_filler4 | 125000 public | large_tbl_filler20 | 125000 public | large_tbl_filler18 | 125000 public | large_tbl_filler14 | 125000 public | large_tbl_filler8 | 125000 public | large_tbl_filler11 | 125000 public | large_tbl_filler19 | 125000 public | large_tbl_filler7 | 125000 public | large_tbl_filler1 | 125000 public | large_tbl_filler12 | 125000 public | large_tbl_filler9 | 125000 public | large_tbl_filler17 | 125000 public | large_tbl_filler16 | 125000 public | large_tbl_filler10 | 125000 public | large_tbl_filler2 | 125000 public | large_tbl_pkey |5486 (22 rows) All of them completly fit in memory (to avoid I/O read latency during the vacuum). The config, outside of default is: max_wal_size = 4GB shared_buffers = 30GB vacuum_cost_delay = 1 autovacuum_vacuum_cost_delay = 1 max_parallel_maintenance_workers = 8 max_parallel_workers = 10 vacuum_cost_limit = 10 autovacuum_vacuum_cost_limit = 10 My system is not overloaded, has enough resources to run this test and only this test is running. == Results == With v2 (attached) applied on master postgres=# VACUUM (PARALLEL 8) large_tbl; VACUUM Time: 1146873.016 ms (19:06.873) The duration is splitted that way: Vacuum phase: cumulative time (cumulative time delayed) === scanning heap: 00:08:16.414628 (time_delayed: 444370) vacuuming indexes: 00:14:55.314699 (time_delayed: 2545293) vacuuming heap: 00:19:06.814617 (time_delayed: 2767540) I sampled active sessions from pg_stat_activity (one second interval), here is the summary during the vacuuming indexes phase (ordered by count): leader_pid | pid | wait_event | count +++--- 452996 | 453225 | VacuumDelay| 366 452996 | 453223 | VacuumDelay| 363 452996 | 453226 | VacuumDelay| 362 452996 | 453224 | VacuumDelay| 361 452996 | 453222 | VacuumDelay| 359 452996 | 453221 | VacuumDelay| 359 | 452996 | VacuumDelay| 331 | 452996 | CPU|30 452996 | 453224 | WALWriteLock |23 452996 | 453222 | WALWriteLock |20 452996 | 453226 | WALWriteLock |20 452996 | 453221 | WALWriteLock |19 | 452996 | WalSync|18 452996 | 453225 | WALWriteLock |18 452996 | 453223 | WALWriteLock |16 | 452996 | WALWriteLock |15 452996 | 453221 | CPU|14 452996 | 453222 | CPU|14 452996 | 453223 | CPU|12 452996 | 453224 | CPU|10 452996 | 453226 | CPU|10 452996 | 453225 | CPU| 8 452996 | 453223 | WalSync| 4 452996 | 453221 | WalSync| 2 452996 | 453226 | WalWrite | 2 452996 | 453221 | WalWrite | 1 | 452996 | ParallelFinish | 1 452996 | 453224 | WalSync| 1 452996 | 453225 | WalSync| 1 452996 | 453222 | WalWrite | 1 452996 | 453225 | WalWrite | 1 452996 | 453222 | WalSync| 1 452996 | 4
Re: relfilenode statistics
Hi, On Tue, Jun 11, 2024 at 03:35:23PM +0900, Kyotaro Horiguchi wrote: > At Mon, 10 Jun 2024 08:09:56 +0000, Bertrand Drouvot > wrote in > > > > My idea was to move all that is in pg_statio_all_tables to relfilenode stats > > and 1) add new stats to pg_statio_all_tables (like heap_blks_written), 2) > > ensure > > the user can still retrieve the stats from pg_statio_all_tables in such a > > way > > that it survives a rewrite, 3) provide dedicated APIs to retrieve > > relfilenode stats but only for the current relfilenode, 4) document this > > behavior. This is what the POC patch is doing for heap_blks_written (would > > need to do the same for heap_blks_read and friends) except for the > > documentation > > part. What do you think? > > In my opinion, Thanks for looking at it! > it is certainly strange that bufmgr is aware of > relation kinds, but introducing relfilenode stats to avoid this skew > doesn't seem to be the best way, as it invites inconclusive arguments > like the one raised above. The fact that we transfer counters from old > relfilenodes to new ones indicates that we are not really interested > in counts by relfilenode. If that's the case, wouldn't it be simpler > to call pgstat_count_relation_buffer_read() from bufmgr.c and then > branch according to relkind within that function? If you're concerned > about the additional branch, some ingenuity may be needed. That may be doable for "read" activities but what about write activities? Do you mean not relying on relfilenode stats for reads but relying on relfilenode stats for writes? Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Track the amount of time waiting due to cost_delay
Hi, On Tue, Jun 11, 2024 at 01:13:48PM -0400, Robert Haas wrote: > On Tue, Jun 11, 2024 at 5:49 AM Bertrand Drouvot > wrote: > > As we can see the actual wait time is 30ms less than the intended wait time > > with > > this simple test. So I still think we should go with 1) actual wait time > > and 2) > > report the number of waits (as mentioned in [1]). Does that make sense to > > you? > > I like the idea of reporting the actual wait time better, +1 > provided > that we verify that doing so isn't too expensive. I think it probably > isn't, because in a long-running VACUUM there is likely to be disk > I/O, so the CPU overhead of a few extra gettimeofday() calls should be > fairly low by comparison. Agree. > I wonder if there's a noticeable hit when > everything is in-memory. I guess probably not, because with any sort > of normal configuration, we shouldn't be delaying after every block we > process, so the cost of those gettimeofday() calls should still be > getting spread across quite a bit of real work. I did some testing, with: shared_buffers = 12GB vacuum_cost_delay = 1 autovacuum_vacuum_cost_delay = 1 max_parallel_maintenance_workers = 0 max_parallel_workers = 0 added to a default config file. A table and all its indexes were fully in memory, the numbers are: postgres=# SELECT n.nspname, c.relname, count(*) AS buffers FROM pg_buffercache b JOIN pg_class c ON b.relfilenode = pg_relation_filenode(c.oid) AND b.reldatabase IN (0, (SELECT oid FROM pg_database WHERE datname = current_database())) JOIN pg_namespace n ON n.oid = c.relnamespace GROUP BY n.nspname, c.relname ORDER BY 3 DESC LIMIT 11; nspname | relname | buffers -+---+- public | large_tbl | 80 public | large_tbl_pkey|5486 public | large_tbl_filler7 |1859 public | large_tbl_filler4 |1859 public | large_tbl_filler1 |1859 public | large_tbl_filler6 |1859 public | large_tbl_filler3 |1859 public | large_tbl_filler2 |1859 public | large_tbl_filler5 |1859 public | large_tbl_filler8 |1859 public | large_tbl_version |1576 (11 rows) The observed timings when vacuuming this table are: On master: vacuum phase: cumulative duration - scanning heap: 00:00:37.808184 vacuuming indexes: 00:00:41.808176 vacuuming heap: 00:00:54.808156 On master patched with actual time delayed: vacuum phase: cumulative duration - scanning heap: 00:00:36.502104 (time_delayed: 22202) vacuuming indexes: 00:00:41.002103 (time_delayed: 23769) vacuuming heap: 00:00:54.302096 (time_delayed: 34886) As we can see there is no noticeable degradation while the vacuum entered about 34886 times in this instrumentation code path (cost_delay was set to 1). > That said, I'm not sure this experiment shows a real problem with the > idea of showing intended wait time. It does establish the concept that > repeated signals can throw our numbers off, but 30ms isn't much of a > discrepancy. Yeah, the idea was just to show how easy it is to create a 30ms discrepancy. > I'm worried about being off by a factor of two, or an > order of magnitude. I think we still don't know if that can happen, > but if we're going to show actual wait time anyway, then we don't need > to explore the problems with other hypothetical systems too much. Agree. > I'm not convinced that reporting the number of waits is useful. If we > were going to report a possibly-inaccurate amount of actual waiting, > then also reporting the number of waits might make it easier to figure > out when the possibly-inaccurate number was in fact inaccurate. But I > think it's way better to report an accurate amount of actual waiting, > and then I'm not sure what we gain by also reporting the number of > waits. Sami shared his thoughts in [1] and [2] and so did I in [3]. If some of us still don't think that reporting the number of waits is useful then we can probably start without it. [1]: https://www.postgresql.org/message-id/0EA474B6-BF88-49AE-82CA-C1A9A3C17727%40amazon.com [2]: https://www.postgresql.org/message-id/E12435E2-5FCA-49B0-9ADB-0E7153F95E2D%40amazon.com [3]: https://www.postgresql.org/message-id/ZmmGG4e%2BqTBD2kfn%40ip-10-97-1-34.eu-west-3.compute.internal Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Track the amount of time waiting due to cost_delay
Hi, On Tue, Jun 11, 2024 at 02:48:30PM -0400, Robert Haas wrote: > On Tue, Jun 11, 2024 at 2:47 PM Nathan Bossart > wrote: > > I'm struggling to think of a scenario in which the number of waits would be > > useful, assuming you already know the amount of time spent waiting. If we provide the actual time spent waiting, providing the number of waits would allow to see if there is a diff between the actual time and the intended time (i.e: number of waits * cost_delay, should the cost_delay be the same during the vacuum duration). That should trigger some thoughts if the diff is large enough. I think that what we are doing here is to somehow add instrumentation around the "WAIT_EVENT_VACUUM_DELAY" wait event. If we were to add instrumentation for wait events (generaly speaking) we'd probably also expose the number of waits per wait event (in addition to the time waited). Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Track the amount of time waiting due to cost_delay
Hi, On Tue, Jun 11, 2024 at 11:40:36AM -0500, Nathan Bossart wrote: > On Tue, Jun 11, 2024 at 07:25:11AM +0000, Bertrand Drouvot wrote: > > So I think that in v2 we could: 1) measure the actual wait time instead, 2) > > count the number of times the vacuum slept. We could also 3) reports the > > effective cost limit (as proposed by Nathan up-thread) (I think that 3) > > could > > be misleading but I'll yield to majority opinion if people think it's not). > > I still think the effective cost limit would be useful, if for no other > reason than to help reinforce that it is distributed among the autovacuum > workers. I also think it can be useful, my concern is more to put this information in pg_stat_progress_vacuum. What about Sawada-san proposal in [1]? (we could create a new view that would contain those data: per-worker dobalance, cost_lmit, cost_delay, active, and failsafe). [1]: https://www.postgresql.org/message-id/CAD21AoDOu%3DDZcC%2BPemYmCNGSwbgL1s-5OZkZ1Spd5pSxofWNCw%40mail.gmail.com Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Allow logical failover slots to wait on synchronous replication
Hi, On Mon, Jun 10, 2024 at 09:25:10PM -0500, Nathan Bossart wrote: > On Mon, Jun 10, 2024 at 03:51:05PM -0700, John H wrote: > > The existing 'standby_slot_names' isn't great for users who are running > > clusters with quorum-based synchronous replicas. For instance, if > > the user has synchronous_standby_names = 'ANY 3 (A,B,C,D,E)' it's a > > bit tedious to have to reconfigure the standby_slot_names to set it to > > the most updated 3 sync replicas whenever different sync replicas start > > lagging. In the event that both GUCs are set, 'standby_slot_names' takes > > precedence. > > Hm. IIUC you'd essentially need to set standby_slot_names to "A,B,C,D,E" > to get the desired behavior today. That might ordinarily be okay, but it > could cause logical replication to be held back unnecessarily if one of the > replicas falls behind for whatever reason. A way to tie standby_slot_names > to synchronous replication instead does seem like it would be useful in > this case. FWIW, I have the same understanding and also think your proposal would be useful in this case. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Track the amount of time waiting due to cost_delay
Hi, On Mon, Jun 10, 2024 at 05:58:13PM -0400, Robert Haas wrote: > I'm always suspicious of this sort of thing. I tend to find nothing > gives me the right answer unless I assume that the actual sleep times > are randomly and systematically different from the intended sleep > times but arbitrarily large amounts. I think we should at least do > some testing: if we measure both the intended sleep time and the > actual sleep time, how close are they? Does it change if the system is > under crushing load (which might elongate sleeps) or if we spam > SIGUSR1 against the vacuum process (which might shorten them)? Though I (now) think that it would make sense to record the actual delay time instead (see [1]), I think it's interesting to do some testing as you suggested. With record_actual_time.txt (attached) applied on top of v1, we can see the intended and actual wait time. On my system, "no load at all" except the vacuum running, I see no diff: Tue Jun 11 09:22:06 2024 (every 1s) pid | relid | phase | time_delayed | actual_time_delayed | duration ---+---+---+--+-+- 54754 | 16385 | scanning heap |41107 | 41107 | 00:00:42.301851 (1 row) Tue Jun 11 09:22:07 2024 (every 1s) pid | relid | phase | time_delayed | actual_time_delayed | duration ---+---+---+--+-+- 54754 | 16385 | scanning heap |42076 | 42076 | 00:00:43.301848 (1 row) Tue Jun 11 09:22:08 2024 (every 1s) pid | relid | phase | time_delayed | actual_time_delayed | duration ---+---+---+--+-+- 54754 | 16385 | scanning heap |43045 | 43045 | 00:00:44.301854 (1 row) But if I launch pg_reload_conf() 10 times in a row, I can see: Tue Jun 11 09:22:09 2024 (every 1s) pid | relid | phase | time_delayed | actual_time_delayed | duration ---+---+---+--+-+- 54754 | 16385 | scanning heap |44064 | 44034 | 00:00:45.302965 (1 row) Tue Jun 11 09:22:10 2024 (every 1s) pid | relid | phase | time_delayed | actual_time_delayed | duration ---+---+---+--+-+- 54754 | 16385 | scanning heap |45033 | 45003 | 00:00:46.301858 As we can see the actual wait time is 30ms less than the intended wait time with this simple test. So I still think we should go with 1) actual wait time and 2) report the number of waits (as mentioned in [1]). Does that make sense to you? [1]: https://www.postgresql.org/message-id/Zmf712A5xcOM9Hlg%40ip-10-97-1-34.eu-west-3.compute.internal Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql index 1345e99dcb..e4ba8de00a 100644 --- a/src/backend/catalog/system_views.sql +++ b/src/backend/catalog/system_views.sql @@ -1222,7 +1222,7 @@ CREATE VIEW pg_stat_progress_vacuum AS S.param4 AS heap_blks_vacuumed, S.param5 AS index_vacuum_count, S.param6 AS max_dead_tuple_bytes, S.param7 AS dead_tuple_bytes, S.param8 AS indexes_total, S.param9 AS indexes_processed, -S.param10 AS time_delayed +S.param10 AS time_delayed, S.param11 AS actual_time_delayed FROM pg_stat_get_progress_info('VACUUM') AS S LEFT JOIN pg_database D ON S.datid = D.oid; diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c index 2551408a86..bbb5002efe 100644 --- a/src/backend/commands/vacuum.c +++ b/src/backend/commands/vacuum.c @@ -2381,18 +2381,29 @@ vacuum_delay_point(void) /* Nap if appropriate */ if (msec > 0) { + instr_time delay_start; + instr_time delay_time; + if (msec > vacuum_cost_delay * 4) msec = vacuum_cost_delay * 4; pgstat_report_wait_start(WAIT_EVENT_VACUUM_DELAY); + INSTR_TIME_SET_CURRENT(delay_start); pg_usleep(msec * 1000); + INSTR_TIME_SET_CURRENT(delay_time); pgstat_report_wait_end(); /* Report the amount of time we slept */ + INSTR_TIME_SUBTRACT(delay_time, delay_start); if (VacuumSharedCostBalance != NULL) + { pgstat_progress_parallel_incr_param(PROGRESS_VACUUM_TIME_DELAYED, msec); + pgstat_progress_parallel_incr_param(PROGRESS_VACUUM_ACTUAL_TIME_DELA
Re: Track the amount of time waiting due to cost_delay
Hi, On Tue, Jun 11, 2024 at 04:07:05PM +0900, Masahiko Sawada wrote: > Thank you for the proposal and the patch. I understand the motivation > of this patch. Thanks for looking at it! > Beside the point Nathan mentioned, I'm slightly worried > that massive parallel messages could be sent to the leader process > when the cost_limit value is low. I see, I can/will do some testing in this area and share the numbers. > > FWIW when I want to confirm the vacuum delay effect, I often use the > information from the DEBUG2 log message in VacuumUpdateCosts() > function. Exposing these data (per-worker dobalance, cost_lmit, > cost_delay, active, and failsafe) somewhere in a view might also be > helpful for users for checking vacuum delay effects. Do you mean add time_delayed in pg_stat_progress_vacuum and cost_limit + the other data you mentioned above in another dedicated view? Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Track the amount of time waiting due to cost_delay
Hi, On Mon, Jun 10, 2024 at 05:58:13PM -0400, Robert Haas wrote: > On Mon, Jun 10, 2024 at 11:36 AM Nathan Bossart > wrote: > > Hm. Should we measure the actual time spent sleeping, or is a rough > > estimate good enough? I believe pg_usleep() might return early (e.g., if > > the process is signaled) or late, so this field could end up being > > inaccurate, although probably not by much. If we're okay with millisecond > > granularity, my first instinct is that what you've proposed is fine, but I > > figured I'd bring it up anyway. > > I bet you could also sleep for longer than planned, throwing the > numbers off in the other direction. Thanks for looking at it! Agree, that's how I read "or late" from Nathan's comment above. > I'm always suspicious of this sort of thing. I tend to find nothing > gives me the right answer unless I assume that the actual sleep times > are randomly and systematically different from the intended sleep > times but arbitrarily large amounts. I think we should at least do > some testing: if we measure both the intended sleep time and the > actual sleep time, how close are they? Does it change if the system is > under crushing load (which might elongate sleeps) or if we spam > SIGUSR1 against the vacuum process (which might shorten them)? OTOH Sami proposed in [1] to count the number of times the vacuum went into delay. I think that's a good idea. His idea makes me think that (in addition to the number of wait times) it would make sense to measure the "actual" sleep time (and not the intended one) then (so that one could measure the difference between the intended wait time (number of wait times * cost delay (if it does not change during the vacuum duration)) and the actual measured wait time). So I think that in v2 we could: 1) measure the actual wait time instead, 2) count the number of times the vacuum slept. We could also 3) reports the effective cost limit (as proposed by Nathan up-thread) (I think that 3) could be misleading but I'll yield to majority opinion if people think it's not). Thoughts? [1]: https://www.postgresql.org/message-id/A0935130-7C4B-4094-B6E4-C7D5086D50EF%40amazon.com Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Track the amount of time waiting due to cost_delay
Hi, On Mon, Jun 10, 2024 at 08:12:46PM +, Imseih (AWS), Sami wrote: > >> This sounds like useful information to me. > > > Thanks for looking at it! > > The VacuumDelay is the only visibility available to > gauge the cost_delay. Having this information > advertised by pg_stat_progress_vacuum as is being proposed > is much better. Thanks for looking at it! > However, I also think that the > "number of times" the vacuum went into delay will be needed > as well. Both values will be useful to tune cost_delay and cost_limit. Yeah, I think that's a good idea. With v1 one could figure out how many times the delay has been triggered but that does not work anymore if: 1) cost_delay changed during the vacuum duration or 2) the patch changes the way time_delayed is measured/reported (means get the actual wait time and not the theoritical time as v1 does). > > It may also make sense to accumulate the total_time in delay > and the number of times delayed in a cumulative statistics [0] > view to allow a user to trend this information overtime. > I don't think this info fits in any of the existing views, i.e. > pg_stat_database, so maybe a new view for cumulative > vacuum stats may be needed. This is likely a separate > discussion, but calling it out here. +1 > >> IIUC you'd need to get information from both pg_stat_progress_vacuum and > >> pg_stat_activity in order to know what percentage of time was being spent > >> in cost delay. Is that how you'd expect for this to be used in practice? > > > Yeah, one could use a query such as: > > > select p.*, now() - a.xact_start as duration from pg_stat_progress_vacuum p > > JOIN pg_stat_activity a using (pid) > > Maybe all progress views should just expose the > "beentry->st_activity_start_timestamp " > to let the user know when the current operation began. Yeah maybe, I think this is likely a separate discussion too, thoughts? Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Track the amount of time waiting due to cost_delay
On Mon, Jun 10, 2024 at 02:20:16PM -0500, Nathan Bossart wrote: > On Mon, Jun 10, 2024 at 05:48:22PM +0000, Bertrand Drouvot wrote: > > On Mon, Jun 10, 2024 at 10:36:42AM -0500, Nathan Bossart wrote: > >> I wonder if we should also > >> surface the effective cost limit for each autovacuum worker. > > > > I'm not sure about it as I think that it could be misleading: one could > > query > > pg_stat_progress_vacuum and conclude that the time_delayed he is seeing is > > due to _this_ cost_limit. But that's not necessary true as the cost_limit > > could > > have changed multiple times since the vacuum started. So, unless there is > > frequent sampling on pg_stat_progress_vacuum, displaying the time_delayed > > and > > the cost_limit could be misleadind IMHO. > > Well, that's true for the delay, too, right (at least as of commit > 7d71d3d)? Yeah right, but the patch exposes the total amount of time the vacuum has been delayed (not the cost_delay per say) which does not sound misleading to me. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Track the amount of time waiting due to cost_delay
Hi, On Mon, Jun 10, 2024 at 10:36:42AM -0500, Nathan Bossart wrote: > On Mon, Jun 10, 2024 at 06:05:13AM +0000, Bertrand Drouvot wrote: > > During the last pgconf.dev I attended Robert´s presentation about > > autovacuum and > > it made me remember of an idea I had some time ago: $SUBJECT > > This sounds like useful information to me. Thanks for looking at it! > I wonder if we should also > surface the effective cost limit for each autovacuum worker. I'm not sure about it as I think that it could be misleading: one could query pg_stat_progress_vacuum and conclude that the time_delayed he is seeing is due to _this_ cost_limit. But that's not necessary true as the cost_limit could have changed multiple times since the vacuum started. So, unless there is frequent sampling on pg_stat_progress_vacuum, displaying the time_delayed and the cost_limit could be misleadind IMHO. > > Currently one can change [autovacuum_]vacuum_cost_delay and > > [auto vacuum]vacuum_cost_limit but has no reliable way to measure the > > impact of > > the changes on the vacuum duration: one could observe the vacuum duration > > variation but the correlation to the changes is not accurate (as many others > > factors could impact the vacuum duration (load on the system, i/o > > latency,...)). > > IIUC you'd need to get information from both pg_stat_progress_vacuum and > pg_stat_activity in order to know what percentage of time was being spent > in cost delay. Is that how you'd expect for this to be used in practice? Yeah, one could use a query such as: select p.*, now() - a.xact_start as duration from pg_stat_progress_vacuum p JOIN pg_stat_activity a using (pid) for example. Worth to provide an example somewhere in the doc? > > pgstat_report_wait_start(WAIT_EVENT_VACUUM_DELAY); > > pg_usleep(msec * 1000); > > pgstat_report_wait_end(); > > + /* Report the amount of time we slept */ > > + if (VacuumSharedCostBalance != NULL) > > + > > pgstat_progress_parallel_incr_param(PROGRESS_VACUUM_TIME_DELAYED, msec); > > + else > > + > > pgstat_progress_incr_param(PROGRESS_VACUUM_TIME_DELAYED, msec); > > Hm. Should we measure the actual time spent sleeping, or is a rough > estimate good enough? I believe pg_usleep() might return early (e.g., if > the process is signaled) or late, so this field could end up being > inaccurate, although probably not by much. If we're okay with millisecond > granularity, my first instinct is that what you've proposed is fine, but I > figured I'd bring it up anyway. Thanks for bringing that up! I had the same thought when writing the code and came to the same conclusion. I think that's a good enough estimation and specially during a long running vacuum (which is probably the case where users care the most). Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: relfilenode statistics
Hi, On Fri, Jun 07, 2024 at 09:24:41AM -0400, Robert Haas wrote: > On Thu, Jun 6, 2024 at 11:17 PM Andres Freund wrote: > > If we just want to keep prior stats upon arelation rewrite, we can just copy > > the stats from the old relfilenode. Or we can decide that those stats don't > > really make sense anymore, and start from scratch. > > I think we need to think carefully about what we want the user > experience to be here. "Per-relfilenode stats" could mean "sometimes I > don't know the relation OID so I want to use the relfilenumber > instead, without changing the user experience" or it could mean "some > of these stats actually properly pertain to the relfilenode rather > than the relation so I want to associate them with the right object > and that will affect how the user sees things." We need to decide > which it is. If it's the former, then we need to examine whether the > goal of hiding the distinction between relfilenode stats and relation > stats from the user is in fact feasible. If it's the latter, then we > need to make sure the whole patch reflects that design, which would > include e.g. NOT copying stats from the old to the new relfilenode, > and which would also include documenting the behavior in a way that > will be understandable to users. Thanks for sharing your thoughts! Let's take the current heap_blks_read as an example: it currently survives a relation rewrite and I guess we don't want to change the existing user experience for it. Now say we want to add "heap_blks_written" (like in this POC patch) then I think that it makes sense for the user to 1) query this new stat from the same place as the existing heap_blks_read: from pg_statio_all_tables and 2) to have the same experience as far the relation rewrite is concerned (keep the previous stats). To achieve the rewrite behavior we could: 1) copy the stats from the OLD relfilenode to the relation (like in the POC patch) 2) copy the stats from the OLD relfilenode to the NEW one (could be in a dedicated field) The PROS of 1) is that the behavior is consistent with the current heap_blks_read and that the user could still see the current relfilenode stats (through a new API) if he wants to. > In my experience, the worst thing you can do in cases like this is be > somewhere in the middle. Then you tend to end up with stuff like: the > difference isn't supposed to be something that the user knows or cares > about, except that they do have to know and care because you haven't > thoroughly covered up the deception, and often they have to reverse > engineer the behavior because you didn't document what was really > happening because you imagined that they wouldn't notice. My idea was to move all that is in pg_statio_all_tables to relfilenode stats and 1) add new stats to pg_statio_all_tables (like heap_blks_written), 2) ensure the user can still retrieve the stats from pg_statio_all_tables in such a way that it survives a rewrite, 3) provide dedicated APIs to retrieve relfilenode stats but only for the current relfilenode, 4) document this behavior. This is what the POC patch is doing for heap_blks_written (would need to do the same for heap_blks_read and friends) except for the documentation part. What do you think? Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Proposal to add page headers to SLRU pages
Hi, On Tue, Mar 19, 2024 at 06:48:33AM +, Li, Yong wrote: > > Unfortunately, the test requires a setup of two different versions of PG. I > am not > aware of an existing test infrastructure which can run automated tests using > two > PGs. I did manually verify the output of pg_upgrade. I think there is something in t/002_pg_upgrade.pl (see src/bin/pg_upgrade/TESTING), that could be used to run automated tests using an old and a current version. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Test slots invalidations in 035_standby_logical_decoding.pl only if dead rows are removed
Hi, On Mon, Jun 10, 2024 at 03:39:34PM +0900, Michael Paquier wrote: > On Mon, Jun 10, 2024 at 06:29:17AM +0000, Bertrand Drouvot wrote: > > Thanks for the report! I think it makes sense to add it to the list of known > > failures. > > > > One way to deal with those corner cases could be to make use of injection > > points > > around places where RUNNING_XACTS is emitted, thoughts? > > Ah. You mean to force a wait in the code path generating the standby > snapshots for the sake of this test? Yeah. > That's interesting, I like it. Great, will look at it. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Test slots invalidations in 035_standby_logical_decoding.pl only if dead rows are removed
Hi Alexander, On Sat, Jun 08, 2024 at 07:00:00AM +0300, Alexander Lakhin wrote: > Hello Bertrand and Michael, > > 23.01.2024 11:07, Bertrand Drouvot wrote: > > On Tue, Jan 23, 2024 at 02:50:06PM +0900, Michael Paquier wrote: > > > > > Anyway, that's not the end of it. What should we do for snapshot > > > snapshot records coming from the bgwriter? > > What about? > > > > 3) depending on how stabilized this test (and others that suffer from > > "random" > > xl_running_xacts) is, then think about the bgwriter. > > A recent buildfarm failure [1] reminds me of that remaining question. > > It's hard to determine from this info, why row_removal_activeslot was not > invalidated, but running this test on a slowed down Windows VM, I (still) > get the same looking failures caused by RUNNING_XACTS appeared just before > `invalidating obsolete replication slot "row_removal_inactiveslot"`. > So I would consider this failure as yet another result of bgwriter activity > and add it to the list of known failures as such... Thanks for the report! I think it makes sense to add it to the list of known failures. One way to deal with those corner cases could be to make use of injection points around places where RUNNING_XACTS is emitted, thoughts? Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Track the amount of time waiting due to cost_delay
Hi hackers, During the last pgconf.dev I attended Robert’s presentation about autovacuum and it made me remember of an idea I had some time ago: $SUBJECT Please find attached a patch doing so by adding a new field (aka "time_delayed") to the pg_stat_progress_vacuum view. Currently one can change [autovacuum_]vacuum_cost_delay and [auto vacuum]vacuum_cost_limit but has no reliable way to measure the impact of the changes on the vacuum duration: one could observe the vacuum duration variation but the correlation to the changes is not accurate (as many others factors could impact the vacuum duration (load on the system, i/o latency,...)). This new field reports the time that the vacuum has to sleep due to cost delay: it could be useful to 1) measure the impact of the current cost_delay and cost_limit settings and 2) when experimenting new values (and then help for decision making for those parameters). The patch is relatively small thanks to the work that has been done in f1889729dd (to allow parallel worker to report to the leader). [1]: https://www.pgevents.ca/events/pgconfdev2024/schedule/session/29-how-autovacuum-goes-wrong-and-can-we-please-make-it-stop-doing-that/ Looking forward to your feedback, Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com >From 750dfc26cd6fcf5a5618c3fe35fc42d5b5c66f00 Mon Sep 17 00:00:00 2001 From: Bertrand Drouvot Date: Thu, 6 Jun 2024 12:35:57 + Subject: [PATCH v1] Report the total amount of time that vacuum has been delayed due to cost delay This commit adds one column: time_delayed to the pg_stat_progress_vacuum system view to show the total amount of time in milliseconds that vacuum has been delayed. This uses the new parallel message type for progress reporting added by f1889729dd. Bump catversion because this changes the definition of pg_stat_progress_vacuum. --- doc/src/sgml/monitoring.sgml | 11 +++ src/backend/catalog/system_views.sql | 3 ++- src/backend/commands/vacuum.c| 6 ++ src/include/catalog/catversion.h | 2 +- src/include/commands/progress.h | 1 + src/test/regress/expected/rules.out | 3 ++- 6 files changed, 23 insertions(+), 3 deletions(-) 47.6% doc/src/sgml/ 3.7% src/backend/catalog/ 26.8% src/backend/commands/ 7.5% src/include/catalog/ 4.1% src/include/commands/ 10.0% src/test/regress/expected/ diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml index 053da8d6e4..cdd0f0e533 100644 --- a/doc/src/sgml/monitoring.sgml +++ b/doc/src/sgml/monitoring.sgml @@ -6290,6 +6290,17 @@ FROM pg_stat_get_backend_idset() AS backendid; cleaning up indexes. + + + + time_delayed bigint + + + Total amount of time spent in milliseconds waiting due to vacuum_cost_delay + or autovacuum_vacuum_cost_delay. In case of parallel + vacuum the reported time is across all the workers and the leader. + + diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql index 53047cab5f..1345e99dcb 100644 --- a/src/backend/catalog/system_views.sql +++ b/src/backend/catalog/system_views.sql @@ -1221,7 +1221,8 @@ CREATE VIEW pg_stat_progress_vacuum AS S.param2 AS heap_blks_total, S.param3 AS heap_blks_scanned, S.param4 AS heap_blks_vacuumed, S.param5 AS index_vacuum_count, S.param6 AS max_dead_tuple_bytes, S.param7 AS dead_tuple_bytes, -S.param8 AS indexes_total, S.param9 AS indexes_processed +S.param8 AS indexes_total, S.param9 AS indexes_processed, +S.param10 AS time_delayed FROM pg_stat_get_progress_info('VACUUM') AS S LEFT JOIN pg_database D ON S.datid = D.oid; diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c index 48f8eab202..2551408a86 100644 --- a/src/backend/commands/vacuum.c +++ b/src/backend/commands/vacuum.c @@ -40,6 +40,7 @@ #include "catalog/pg_inherits.h" #include "commands/cluster.h" #include "commands/defrem.h" +#include "commands/progress.h" #include "commands/vacuum.h" #include "miscadmin.h" #include "nodes/makefuncs.h" @@ -2386,6 +2387,11 @@ vacuum_delay_point(void) pgstat_report_wait_start(WAIT_EVENT_VACUUM_DELAY); pg_usleep(msec * 1000); pgstat_report_wait_end(); + /* Report the amount of time we slept */ + if (VacuumSharedCostBalance != NULL) + pgstat_progress_parallel_incr_param(PROGRESS_VACUUM_TIME_DELAYED, msec); + else + pgstat_progress_incr_param(PROGRESS_VACUUM_TIME_DELAYED, msec); /* * We don't want to ignore postmaster death during very long vacuums diff --git a/src/include/catalog/catversion.h b/src/include/catalog/catversion.h index f0809c0e58..40b4f1d1e4 100644 --- a/src/include/catalog/catversion.h +++ b/src/include/catalog/catversion.h @@ -57,6 +5
Re: relfilenode statistics
Hi, On Thu, Jun 06, 2024 at 08:17:36PM -0700, Andres Freund wrote: > Hi, > > On 2024-06-06 12:27:49 -0400, Robert Haas wrote: > > On Wed, Jun 5, 2024 at 1:52 AM Bertrand Drouvot > > wrote: > > > I think we should keep the stats in the relation during relfilenode > > > changes. > > > As a POC, v1 implemented a way to do so during TRUNCATE (see the changes > > > in > > > table_relation_set_new_filelocator() and in pg_statio_all_tables): as you > > > can > > > see in the example provided up-thread the new heap_blks_written statistic > > > has > > > been preserved during the TRUNCATE. > > > > Yeah, I think there's something weird about this design. Somehow we're > > ending up with both per-relation and per-relfilenode counters: > > > > + pg_stat_get_blocks_written(C.oid) + > > pg_stat_get_relfilenode_blocks_written(d.oid, CASE WHEN > > C.reltablespace <> 0 THEN C.reltablespace ELSE d.dattablespace END, > > C.relfilenode) AS heap_blks_written, > > > > I'll defer to Andres if he thinks that's awesome, but to me it does > > not seem right to track some blocks written in a per-relation counter > > and others in a per-relfilenode counter. > > It doesn't immediately sound awesome. Nor really necessary? > > If we just want to keep prior stats upon arelation rewrite, we can just copy > the stats from the old relfilenode. Agree, that's another option. But I think that would be in another field like "cumulative_XXX" to ensure one could still retrieve stats that are "dedicated" to this particular "new" relfilenode. Thoughts? > Or we can decide that those stats don't > really make sense anymore, and start from scratch. > > > I *guess* I could see an occasional benefit in having both counter for "prior > relfilenodes" and "current relfilenode" - except that stats get reset manually > and upon crash anyway, making this less useful than if it were really > "lifetime" stats. Right but currently they are not lost during a relation rewrite. If we decide to not keep the relfilenode stats during a rewrite then things like heap_blks_read would stop surviving a rewrite (if we move it to relfilenode stats) while it currently does. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: relfilenode statistics
Hi, On Thu, Jun 06, 2024 at 08:38:06PM -0700, Andres Freund wrote: > Hi, > > On 2024-06-03 11:11:46 +, Bertrand Drouvot wrote: > > The main argument is that we currently don’t have writes counters for > > relations. > > The reason is that we don’t have the relation OID when writing buffers out. > > Tracking writes per relfilenode would allow us to track/consolidate writes > > per > > relation (example in the v1 patch and in the message up-thread). > > > > I think that adding instrumentation in this area (writes counters) could be > > beneficial (like it is for the ones we currently have for reads). > > > > Second argument is that this is also beneficial for the "Split index and > > table statistics into different types of stats" thread (mentioned in the > > previous > > message). It would allow us to avoid additional branches in some situations > > (like > > the one mentioned by Andres in the link I provided up-thread). > > I think there's another *very* significant benefit: > > Right now physical replication doesn't populate statistics fields like > n_dead_tup, which can be a huge issue after failovers, because there's little > information about what autovacuum needs to do. > > Auto-analyze *partially* can fix it at times, if it's lucky enough to see > enough dead tuples - but that's not a given and even if it works, is often > wildly inaccurate. > > > Once we put things like n_dead_tup into per-relfilenode stats, Hm - I had in mind to populate relfilenode stats only with stats that are somehow related to I/O activities. Which ones do you have in mind to put in relfilenode stats? Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Avoid orphaned objects dependencies, take 3
Hi, On Thu, Jun 06, 2024 at 04:00:23PM -0400, Robert Haas wrote: > On Thu, Jun 6, 2024 at 1:56 AM Bertrand Drouvot > wrote: > > v9 is more invasive (as it changes code in much more places) than v8 but it > > is > > easier to follow (as it is now clear where the new lock is acquired). > > Hmm, this definitely isn't what I had in mind. Possibly that's a sign > that what I had in mind was dumb, but for sure it's not what I > imagined. What I thought you were going to do was add calls like > LockDatabaseObject(NamespaceRelationId, schemaid, 0, AccessShareLock) > in various places, or perhaps LockRelationOid(reloid, > AccessShareLock), or whatever the case may be. I see what you’re saying, doing things like: LockDatabaseObject(TypeRelationId, returnType, 0, AccessShareLock); in ProcedureCreate() for example. > Here you've got stuff > like this: > > - record_object_address_dependencies(, addrs_auto, > -DEPENDENCY_AUTO); > + lock_record_object_address_dependencies(, addrs_auto, > + DEPENDENCY_AUTO); > > ...which to me looks like the locking is still pushed down inside the > dependency code. Yes but it’s now located in places where, I think, it’s easier to understand what’s going on (as compare to v8), except maybe for: recordDependencyOnExpr() makeOperatorDependencies() GenerateTypeDependencies() makeParserDependencies() makeDictionaryDependencies() makeTSTemplateDependencies() makeConfigurationDependencies() but probably for: heap_create_with_catalog() StorePartitionKey() index_create() AggregateCreate() CastCreate() CreateConstraintEntry() ProcedureCreate() RangeCreate() InsertExtensionTuple() CreateTransform() CreateProceduralLanguage() The reasons I keep it linked to the dependency code are: - To ensure we don’t miss anything (well, with the new Assert in place that’s probably a tangential argument) - It’s not only about locking the object: it’s also about 1) verifying the object is pinned, 2) checking it still exists and 3) provide a description in the error message if we can (in case the object does not exist anymore). Relying on an already build object (in the dependency code) avoid to 1) define the object(s) one more time or 2) create new functions that would do the same as isObjectPinned() and getObjectDescription() with a different set of arguments. That may sounds like weak arguments but it has been my reasoning. Do you still find the code hard to maintain with v9? > > And you also have stuff like this: > > ObjectAddressSet(referenced, RelationRelationId, childTableId); > + depLockAndCheckObject(); > recordDependencyOn(, , DEPENDENCY_PARTITION_SEC); > > But in depLockAndCheckObject you have: > > + if (object->classId == RelationRelationId || object->classId == > AuthMemRelationId) > + return; > > That doesn't seem right, because then it seems like the call isn't > doing anything, but there isn't really any reason for it to not be > doing anything. If we're dropping a dependency on a table, then it > seems like we need to have a lock on that table. Presumably the reason > why we don't end up with dangling dependencies in such cases now is > because we're careful about doing LockRelation() in the right places, Yeah, that's what I think: we're already careful when we deal with relations. > but we're not similarly careful about other operations e.g. > ConstraintSetParentConstraint is called by DefineIndex which calls > table_open(childRelId, ...) first, but there's no logic in DefineIndex > to lock the constraint. table_open(childRelId, ...) would lock any "ALTER TABLE DROP CONSTRAINT" already. Not sure I understand your concern here. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: How about using dirty snapshots to locate dependent objects?
On Thu, Jun 06, 2024 at 05:59:00PM +0530, Ashutosh Sharma wrote: > Hello everyone, > > At present, we use MVCC snapshots to identify dependent objects. This > implies that if a new dependent object is inserted within a transaction > that is still ongoing, our search for dependent objects won't include this > recently added one. Consequently, if someone attempts to drop the > referenced object, it will be dropped, and when the ongoing transaction > completes, we will end up having an entry for a referenced object that has > already been dropped. This situation can lead to an inconsistent state. > Below is an example illustrating this scenario: > > Session 1: > - create table t1(a int); > - insert into t1 select i from generate_series(1, 1000) i; > - create extension btree_gist; > - create index i1 on t1 using gist( a ); > > Session 2: (While the index creation in session 1 is in progress, drop the > btree_gist extension) > - drop extension btree_gist; > > Above command succeeds and so does the create index command running in > session 1, post this, if we try running anything on table t1, i1, it fails > with an error: "cache lookup failed for opclass ..." > > Attached is the patch that I have tried, which seems to be working for me. > It's not polished and thoroughly tested, but just sharing here to clarify > what I am trying to suggest. Please have a look and let me know your > thoughts. Thanks for the patch proposal! The patch does not fix the other way around: - session 1: BEGIN; DROP extension btree_gist; - session 2: create index i1 on t1 using gist( a ); - session 1: commits while session 2 is creating the index and does not address all the possible orphaned dependencies cases. There is an ongoing thread (see [1]) to fix the orphaned dependencies issue. v9 attached in [1] fixes the case you describe here. [1]: https://www.postgresql.org/message-id/flat/ZiYjn0eVc7pxVY45%40ip-10-97-1-34.eu-west-3.compute.internal Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: relfilenode statistics
Hi, On Mon, Jun 03, 2024 at 11:11:46AM +, Bertrand Drouvot wrote: > Yeah, I’ll update the commit message in V2 with better explanations once I get > feedback on V1 (should we decide to move on with the relfilenode stats idea). > Please find attached v2, mandatory rebase due to cd312adc56. In passing it provides a more detailed commit message (also making clear that the goal of this patch is to start the discussion and agree on the design before moving forward.) Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com >From 81d25e077c9f4eafa5304c257d1b39ee8a811628 Mon Sep 17 00:00:00 2001 From: Bertrand Drouvot Date: Thu, 16 Nov 2023 02:30:01 + Subject: [PATCH v2] Provide relfilenode statistics MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit We currently don’t have writes counters for relations. The reason is that we don’t have the relation OID when writing buffers out. Tracking writes per relfilenode would allow us to track/consolidate writes per relation. relfilenode stats is also beneficial for the "Split index and table statistics into different types of stats" work in progress: it would allow us to avoid additional branches in some situations. === Remarks === This is a POC patch. There is still work to do: there is more places we should add relfilenode counters, create more APIS to retrieve the relfilenode stats, the patch takes care of rewrite generated by TRUNCATE but there is more to care about like CLUSTER,VACUUM FULL. The new logic to retrieve stats in pg_statio_all_tables has been implemented only for the new blocks_written stat (we'd need to do the same for the existing buffer read / buffer hit if we agree on the approach implemented here). The goal of this patch is to start the discussion and agree on the design before moving forward. --- src/backend/access/rmgrdesc/xactdesc.c| 5 +- src/backend/catalog/storage.c | 8 ++ src/backend/catalog/system_functions.sql | 2 +- src/backend/catalog/system_views.sql | 5 +- src/backend/postmaster/checkpointer.c | 5 + src/backend/storage/buffer/bufmgr.c | 6 +- src/backend/storage/smgr/md.c | 7 ++ src/backend/utils/activity/pgstat.c | 39 -- src/backend/utils/activity/pgstat_database.c | 12 +- src/backend/utils/activity/pgstat_function.c | 13 +- src/backend/utils/activity/pgstat_relation.c | 112 -- src/backend/utils/activity/pgstat_replslot.c | 13 +- src/backend/utils/activity/pgstat_shmem.c | 19 ++- .../utils/activity/pgstat_subscription.c | 12 +- src/backend/utils/activity/pgstat_xact.c | 60 +++--- src/backend/utils/adt/pgstatfuncs.c | 34 +- src/include/access/tableam.h | 19 +++ src/include/access/xact.h | 1 + src/include/catalog/pg_proc.dat | 14 ++- src/include/pgstat.h | 19 ++- src/include/utils/pgstat_internal.h | 34 -- src/test/recovery/t/029_stats_restart.pl | 40 +++ .../recovery/t/030_stats_cleanup_replica.pl | 6 +- src/test/regress/expected/rules.out | 12 +- src/test/regress/expected/stats.out | 30 ++--- src/test/regress/sql/stats.sql| 30 ++--- src/test/subscription/t/026_stats.pl | 4 +- src/tools/pgindent/typedefs.list | 1 + 28 files changed, 415 insertions(+), 147 deletions(-) 4.6% src/backend/catalog/ 47.8% src/backend/utils/activity/ 6.5% src/backend/utils/adt/ 3.7% src/backend/ 3.3% src/include/access/ 3.3% src/include/catalog/ 6.2% src/include/utils/ 3.3% src/include/ 12.1% src/test/recovery/t/ 5.5% src/test/regress/expected/ 3.0% src/test/ diff --git a/src/backend/access/rmgrdesc/xactdesc.c b/src/backend/access/rmgrdesc/xactdesc.c index dccca201e0..c02b079645 100644 --- a/src/backend/access/rmgrdesc/xactdesc.c +++ b/src/backend/access/rmgrdesc/xactdesc.c @@ -319,10 +319,11 @@ xact_desc_stats(StringInfo buf, const char *label, appendStringInfo(buf, "; %sdropped stats:", label); for (i = 0; i < ndropped; i++) { - appendStringInfo(buf, " %d/%u/%u", + appendStringInfo(buf, " %d/%u/%u/%u", dropped_stats[i].kind, dropped_stats[i].dboid, - dropped_stats[i].objoid); + dropped_stats[i].objoid, + dropped_stats[i].relfile); } } } diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c index f56b3cc0f2..db6107cd90 100644 --- a/src/backend/catalog/storage.c +++ b/src/backend/catalog/storage.c @@ -33,6 +33,7 @@ #include "storage/smgr.h" #include "utils/hsearch.h" #include "utils/memutils.h" +#include "utils/pgstat_internal.h" #include "utils/rel.h" /
Re: Avoid orphaned objects dependencies, take 3
Hi, On Thu, May 23, 2024 at 02:10:54PM -0400, Robert Haas wrote: > On Thu, May 23, 2024 at 12:19 AM Bertrand Drouvot > wrote: > > The reason why we are using a dirty snapshot here is for the cases where we > > are > > recording a dependency on a referenced object that we are creating at the > > same > > time behind the scene (for example, creating a composite type while creating > > a relation). Without the dirty snapshot, then the object we are creating > > behind > > the scene (the composite type) would not be visible and we would wrongly > > assume > > that it has been dropped. > > The usual reason for using a dirty snapshot is that you want to see > uncommitted work by other transactions. It sounds like you're saying > you just need to see uncommitted work by the same transaction. If > that's true, I think using HeapTupleSatisfiesSelf would be clearer. Or > maybe we just need to put CommandCounterIncrement() calls in the right > places to avoid having the problem in the first place. Or maybe this > is another sign that we're doing the work at the wrong level. Thanks for having discussed your concern with Tom last week during pgconf.dev and shared the outcome to me. I understand your concern regarding code maintainability with the current approach. Please find attached v9 that: - Acquire the lock and check for object existence at an upper level, means before calling recordDependencyOn() and recordMultipleDependencies(). - Get rid of the SNAPSHOT_SELF snapshot usage and relies on CommandCounterIncrement() instead to ensure new entries are visible when we check for object existence (for the cases where we create additional object behind the scene: like composite type while creating a relation). - Add an assertion in recordMultipleDependencies() to ensure that we locked the object before recording the dependency (to ensure we don't miss any cases now that the lock is acquired at an upper level). A few remarks: My first attempt has been to move eliminate_duplicate_dependencies() out of record_object_address_dependencies() so that we get the calls in this order: eliminate_duplicate_dependencies() depLockAndCheckObjects() record_object_address_dependencies() What I'm doing instead in v9 is to rename record_object_address_dependencies() to lock_record_object_address_dependencies() and add depLockAndCheckObjects() in it at the right place. That way the caller of [lock_]record_object_address_dependencies() is not responsible of calling eliminate_duplicate_dependencies() (which would have been the case with my first attempt). We need to setup the LOCKTAG before calling the new Assert in recordMultipleDependencies(). So, using "#ifdef USE_ASSERT_CHECKING" here to not setup the LOCKTAG on a non Assert enabled build. v9 is more invasive (as it changes code in much more places) than v8 but it is easier to follow (as it is now clear where the new lock is acquired). Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com >From 6b6ee40fab0b011e9858a0a25624935c59732bfd Mon Sep 17 00:00:00 2001 From: Bertrand Drouvot Date: Fri, 29 Mar 2024 15:43:26 + Subject: [PATCH v9] Avoid orphaned objects dependencies It's currently possible to create orphaned objects dependencies, for example: Scenario 1: session 1: begin; drop schema schem; session 2: create a function in the schema schem session 1: commit; With the above, the function created in session 2 would be linked to a non existing schema. Scenario 2: session 1: begin; create a function in the schema schem session 2: drop schema schem; session 1: commit; With the above, the function created in session 1 would be linked to a non existing schema. To avoid those scenarios, a new lock (that conflicts with a lock taken by DROP) has been put in place when the dependencies are being recorded. With this in place, the drop schema in scenario 2 would be locked. Also, after the new lock attempt, the patch checks that the object still exists: with this in place session 2 in scenario 1 would be locked and would report an error once session 1 committs (that would not be the case should session 1 abort the transaction). If the object is dropped before the new lock attempt is triggered then the patch would also report an error (but with less details). The patch adds a few tests for some dependency cases (that would currently produce orphaned objects): - schema and function (as the above scenarios) - alter a dependency (function and schema) - function and arg type - function and return type - function and function - domain and domain - table and type - server and foreign data wrapper --- src/backend/catalog/aclchk.c | 1 + src/backend/catalog/dependency.c | 83 +-- src/backend/catalog/heap.c| 7 +- src/b
Re: relfilenode statistics
On Tue, Jun 04, 2024 at 09:26:27AM -0400, Robert Haas wrote: > On Mon, Jun 3, 2024 at 7:11 AM Bertrand Drouvot > wrote: > > We’d move the current buffer read and buffer hit counters from the relation > > stats > > to the relfilenode stats (while still being able to retrieve them from the > > pg_statio_all_tables/indexes views: see the example for the new > > heap_blks_written > > stat added in the patch). Generally speaking, I think that tracking > > counters at > > a common level (i.e relfilenode level instead of table or index level) is > > beneficial (avoid storing/allocating space for the same counters in multiple > > structs) and sounds more intuitive to me. > > Hmm. So if I CLUSTER or VACUUM FULL the relation, the relfilenode > changes. Does that mean I lose all of those stats? Is that a problem? > Or is it good? Or what? I think we should keep the stats in the relation during relfilenode changes. As a POC, v1 implemented a way to do so during TRUNCATE (see the changes in table_relation_set_new_filelocator() and in pg_statio_all_tables): as you can see in the example provided up-thread the new heap_blks_written statistic has been preserved during the TRUNCATE. Please note that the v1 POC only takes care of the new heap_blks_written stat and that the logic used in table_relation_set_new_filelocator() would probably need to be applied in rebuild_relation() or such to deal with CLUSTER or VACUUM FULL. For the relation, the new counter "blocks_written" has been added to the PgStat_StatTabEntry struct (it's not needed in the PgStat_TableCounts one as the relfilenode stat takes care of it). It's added in PgStat_StatTabEntry only to copy/preserve the relfilenode stats during rewrite operations and to retrieve the stats in pg_statio_all_tables. Then, if later we split the relation stats to index/table stats, we'd have blocks_written defined in less structs (as compare to doing the split without relfilenode stat in place). As mentioned up-thread, the new logic has been implemented in v1 only for the new blocks_written stat (we'd need to do the same for the existing buffer read / buffer hit if we agree on the approach implemented in v1). > I also thought about the other direction. Suppose I drop the a > relation and create a new one that gets a different relation OID but > the same relfilenode. But I don't think that's a problem: dropping the > relation should forcibly remove the old stats, so there won't be any > conflict in this case. Yeah. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: relfilenode statistics
Hi Robert, On Mon, May 27, 2024 at 09:10:13AM -0400, Robert Haas wrote: > Hi Bertrand, > > It would be helpful to me if the reasons why we're splitting out > relfilenodestats could be more clearly spelled out. I see Andres's > comment in the thread to which you linked, but it's pretty vague about > why we should do this ("it's not nice") and whether we should do this > ("I wonder if this is an argument for") and maybe that's all fine if > Andres is going to be the one to review and commit this, but even if > then it would be nice if the rest of us could follow along from home, > and right now I can't. Thanks for the feedback! You’re completely right, my previous message is missing clear explanation as to why I think that relfilenode stats could be useful. Let me try to fix this. The main argument is that we currently don’t have writes counters for relations. The reason is that we don’t have the relation OID when writing buffers out. Tracking writes per relfilenode would allow us to track/consolidate writes per relation (example in the v1 patch and in the message up-thread). I think that adding instrumentation in this area (writes counters) could be beneficial (like it is for the ones we currently have for reads). Second argument is that this is also beneficial for the "Split index and table statistics into different types of stats" thread (mentioned in the previous message). It would allow us to avoid additional branches in some situations (like the one mentioned by Andres in the link I provided up-thread). If we agree that the main argument makes sense to think about having relfilenode stats then I think using them as proposed in the second argument makes sense too: We’d move the current buffer read and buffer hit counters from the relation stats to the relfilenode stats (while still being able to retrieve them from the pg_statio_all_tables/indexes views: see the example for the new heap_blks_written stat added in the patch). Generally speaking, I think that tracking counters at a common level (i.e relfilenode level instead of table or index level) is beneficial (avoid storing/allocating space for the same counters in multiple structs) and sounds more intuitive to me. Also I think this is open door for new ideas: for example, with relfilenode statistics in place, we could probably also start thinking about tracking checksum errors per relfllenode. > The commit message is often a good place to spell this kind of thing > out, because then it's included with every version of the patch you > post, and may be of some use to the eventual committer in writing > their commit message. The body of the email where you post the patch > set can be fine, too. > Yeah, I’ll update the commit message in V2 with better explanations once I get feedback on V1 (should we decide to move on with the relfilenode stats idea). Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
relfilenode statistics
Hi hackers, Please find attached a POC patch to implement $SUBJECT. Adding relfilenode statistics has been proposed in [1]. The idea is to allow tracking dirtied blocks, written blocks,... on a per relation basis. The attached patch is not in a fully "polished" state yet: there is more places we should add relfilenode counters, create more APIS to retrieve the relfilenode stats But I think that it is in a state that can be used to discuss the approach it is implementing (so that we can agree or not on it) before moving forward. The approach that is implemented in this patch is the following: - A new PGSTAT_KIND_RELFILENODE is added - A new attribute (aka relfile) has been added to PgStat_HashKey so that we can record (dboid, spcOid and relfile) to identify a relfilenode entry - pgstat_create_transactional() is used in RelationCreateStorage() - pgstat_drop_transactional() is used in RelationDropStorage() - RelationPreserveStorage() will remove the entry from the list of dropped stats The current approach to deal with table rewrite is to: - copy the relfilenode stats in table_relation_set_new_filelocator() from the relfilenode stats entry to the shared table stats entry - in the pg_statio_all_tables view: add the table stats entry (that contains "previous" relfilenode stats (due to the above) that were linked to this relation ) to the current relfilenode stats linked to the relation An example is done in the attached patch for the new heap_blks_written field in pg_statio_all_tables. Outcome is: " postgres=# create table bdt (a int); CREATE TABLE postgres=# select heap_blks_written from pg_statio_all_tables where relname = 'bdt'; heap_blks_written --- 0 (1 row) postgres=# insert into bdt select generate_series(1,1); INSERT 0 1 postgres=# select heap_blks_written from pg_statio_all_tables where relname = 'bdt'; heap_blks_written --- 0 (1 row) postgres=# checkpoint; CHECKPOINT postgres=# select heap_blks_written from pg_statio_all_tables where relname = 'bdt'; heap_blks_written --- 45 (1 row) postgres=# truncate table bdt; TRUNCATE TABLE postgres=# select heap_blks_written from pg_statio_all_tables where relname = 'bdt'; heap_blks_written --- 45 (1 row) postgres=# insert into bdt select generate_series(1,1); INSERT 0 1 postgres=# select heap_blks_written from pg_statio_all_tables where relname = 'bdt'; heap_blks_written --- 45 (1 row) postgres=# checkpoint; CHECKPOINT postgres=# select heap_blks_written from pg_statio_all_tables where relname = 'bdt'; heap_blks_written --- 90 (1 row) " Some remarks: - My first attempt has been to call the pgstat_create_transactional() and pgstat_drop_transactional() at the same places it is done for the relations but that did not work well (mainly due to corner cases in case of rewrite). - Please don't take care of the pgstat_count_buffer_read() and pgstat_count_buffer_hit() calls in pgstat_report_relfilenode_buffer_read() and pgstat_report_relfilenode_buffer_hit(). Those stats will follow the same flow as the one done and explained above for the new heap_blks_written one ( should we agree on it). Looking forward to your comments, feedback. Regards, [1]: https://www.postgresql.org/message-id/20231113204439.r4lmys73tessqmak%40awork3.anarazel.de -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com >From e102c9d15c08c638879ece26008faee58cf4a07e Mon Sep 17 00:00:00 2001 From: Bertrand Drouvot Date: Thu, 16 Nov 2023 02:30:01 + Subject: [PATCH v1] Provide relfilenode statistics --- src/backend/access/rmgrdesc/xactdesc.c| 5 +- src/backend/catalog/storage.c | 8 ++ src/backend/catalog/system_functions.sql | 2 +- src/backend/catalog/system_views.sql | 5 +- src/backend/postmaster/checkpointer.c | 5 + src/backend/storage/buffer/bufmgr.c | 6 +- src/backend/storage/smgr/md.c | 7 ++ src/backend/utils/activity/pgstat.c | 39 -- src/backend/utils/activity/pgstat_database.c | 12 +- src/backend/utils/activity/pgstat_function.c | 13 +- src/backend/utils/activity/pgstat_relation.c | 112 -- src/backend/utils/activity/pgstat_replslot.c | 13 +- src/backend/utils/activity/pgstat_shmem.c | 19 ++- .../utils/activity/pgstat_subscription.c | 12 +- src/backend/utils/activity/pgstat_xact.c | 60 +++--- src/backend/utils/adt/pgstatfuncs.c | 34 +- src/include/access/tableam.h | 19 +++ src/include/access/xact.h | 1 + src/include/catalog/pg_proc.dat | 14 ++- src/include/pgstat.h | 19 ++- src/include/utils/
Re: Avoid orphaned objects dependencies, take 3
Hi, On Thu, May 23, 2024 at 02:10:54PM -0400, Robert Haas wrote: > On Thu, May 23, 2024 at 12:19 AM Bertrand Drouvot > wrote: > > The reason why we are using a dirty snapshot here is for the cases where we > > are > > recording a dependency on a referenced object that we are creating at the > > same > > time behind the scene (for example, creating a composite type while creating > > a relation). Without the dirty snapshot, then the object we are creating > > behind > > the scene (the composite type) would not be visible and we would wrongly > > assume > > that it has been dropped. > > The usual reason for using a dirty snapshot is that you want to see > uncommitted work by other transactions. It sounds like you're saying > you just need to see uncommitted work by the same transaction. Right. > If that's true, I think using HeapTupleSatisfiesSelf would be clearer. Oh thanks! I did not know about the SNAPSHOT_SELF snapshot type (I should have check all the snapshot types first though) and that's exactly what is needed here. Please find attached v8 making use of SnapshotSelf instead of a dirty snapshot. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com >From 2611fb874a476c1a8d665f97d973193beb70292a Mon Sep 17 00:00:00 2001 From: Bertrand Drouvot Date: Fri, 29 Mar 2024 15:43:26 + Subject: [PATCH v8] Avoid orphaned objects dependencies It's currently possible to create orphaned objects dependencies, for example: Scenario 1: session 1: begin; drop schema schem; session 2: create a function in the schema schem session 1: commit; With the above, the function created in session 2 would be linked to a non existing schema. Scenario 2: session 1: begin; create a function in the schema schem session 2: drop schema schem; session 1: commit; With the above, the function created in session 1 would be linked to a non existing schema. To avoid those scenarios, a new lock (that conflicts with a lock taken by DROP) has been put in place when the dependencies are being recorded. With this in place, the drop schema in scenario 2 would be locked. Also, after the new lock attempt, the patch checks that the object still exists: with this in place session 2 in scenario 1 would be locked and would report an error once session 1 committs (that would not be the case should session 1 abort the transaction). If the object is dropped before the new lock attempt is triggered then the patch would also report an error (but with less details). The patch adds a few tests for some dependency cases (that would currently produce orphaned objects): - schema and function (as the above scenarios) - alter a dependency (function and schema) - function and arg type - function and return type - function and function - domain and domain - table and type - server and foreign data wrapper --- src/backend/catalog/dependency.c | 59 src/backend/catalog/objectaddress.c | 66 + src/backend/catalog/pg_depend.c | 12 ++ src/backend/utils/errcodes.txt| 1 + src/include/catalog/dependency.h | 1 + src/include/catalog/objectaddress.h | 1 + .../expected/test_dependencies_locks.out | 129 ++ src/test/isolation/isolation_schedule | 1 + .../specs/test_dependencies_locks.spec| 89 9 files changed, 359 insertions(+) 24.3% src/backend/catalog/ 45.3% src/test/isolation/expected/ 28.2% src/test/isolation/specs/ diff --git a/src/backend/catalog/dependency.c b/src/backend/catalog/dependency.c index d4b5b2ade1..4271ed7c5b 100644 --- a/src/backend/catalog/dependency.c +++ b/src/backend/catalog/dependency.c @@ -1517,6 +1517,65 @@ AcquireDeletionLock(const ObjectAddress *object, int flags) } } +/* + * depLockAndCheckObject + * + * Lock the object that we are about to record a dependency on. + * After it's locked, verify that it hasn't been dropped while we + * weren't looking. If the object has been dropped, this function + * does not return! + */ +void +depLockAndCheckObject(const ObjectAddress *object) +{ + char *object_description; + + /* + * Those don't rely on LockDatabaseObject() when being dropped (see + * AcquireDeletionLock()). Also it looks like they can not produce + * orphaned dependent objects when being dropped. + */ + if (object->classId == RelationRelationId || object->classId == AuthMemRelationId) + return; + + object_description = getObjectDescription(object, true); + + /* assume we should lock the whole object not a sub-object */ + LockDatabaseObject(object->classId, object->objectId, 0, AccessShareLock); + + /* check if object still exists */ + if (!ObjectByIdExist(object, false)) + { + /* + * It might be possible that we are creating it (for example creating + * a compo
Re: PostgreSQL 17 Beta 1 release announcement draft
Hi, On Wed, May 22, 2024 at 07:01:54PM -0400, Jonathan S. Katz wrote: > Thanks. As such I made it: > > "which provides descriptions about wait events and can be combined with > `pg_stat_activity` to give more insight into why an active session is > waiting." > Thanks! Works for me. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Avoid orphaned objects dependencies, take 3
Hi, On Wed, May 22, 2024 at 10:48:12AM -0400, Robert Haas wrote: > On Wed, May 22, 2024 at 6:21 AM Bertrand Drouvot > wrote: > > I started initially with [1] but it was focusing on function-schema only. > > Yeah, that's what I thought we would want to do. And then just extend > that to the other cases. > > > Then I proposed [2] making use of a dirty snapshot when recording the > > dependency. > > But this approach appeared to be "scary" and it was still failing to close > > some race conditions. > > The current patch still seems to be using dirty snapshots for some > reason, which struck me as a bit odd. My intuition is that if we're > relying on dirty snapshots to solve problems, we likely haven't solved > the problems correctly, which seems consistent with your statement > about "failing to close some race conditions". But I don't think I > understand the situation well enough to be sure just yet. The reason why we are using a dirty snapshot here is for the cases where we are recording a dependency on a referenced object that we are creating at the same time behind the scene (for example, creating a composite type while creating a relation). Without the dirty snapshot, then the object we are creating behind the scene (the composite type) would not be visible and we would wrongly assume that it has been dropped. Note that the usage of the dirty snapshot is only when the object is first reported as "non existing" by the new ObjectByIdExist() function. > > I think there is 2 cases here: > > > > First case: the "conflicting" lock mode is for one of our own lock then > > LockCheckConflicts() > > would report this as a NON conflict. > > > > Second case: the "conflicting" lock mode is NOT for one of our own lock > > then LockCheckConflicts() > > would report a conflict. But I've the feeling that the existing code would > > already lock those sessions. > > > > Was your concern about "First case" or "Second case" or both? > > The second case, I guess. It's bad to take a weaker lock first and > then a stronger lock on the same object later, because it can lead to > deadlocks that would have been avoided if the stronger lock had been > taken at the outset. In the example I shared up-thread that would be the opposite: the Session 1 would take an AccessExclusiveLock lock on the object before taking an AccessShareLock during changeDependencyFor(). > Here, it seems like things would happen in the > other order: if we took two locks, we'd probably take the stronger > lock in the higher-level code and then the weaker lock in the > dependency code. Yeah, I agree. > That shouldn't break anything; it's just a bit > inefficient. Yeah, the second lock is useless in that case (like in the example up-thread). > My concern was really more about the maintainability of > the code. I fear that if we add code that takes heavyweight locks in > surprising places, we might later find the behavior difficult to > reason about. > I think I understand your concern about code maintainability but I'm not sure that adding locks while recording a dependency is that surprising. > Tom, what is your thought about that concern? +1 Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Avoid orphaned objects dependencies, take 3
Hi, On Tue, May 21, 2024 at 08:53:06AM -0400, Robert Haas wrote: > On Wed, May 15, 2024 at 6:31 AM Bertrand Drouvot > wrote: > > Please find attached v6 (only diff with v5 is moving the tests as suggested > > above). > > I don't immediately know what to think about this patch. Thanks for looking at it! > I've known about this issue for a long time, but I didn't think this was how > we > would fix it. I started initially with [1] but it was focusing on function-schema only. Then I proposed [2] making use of a dirty snapshot when recording the dependency. But this approach appeared to be "scary" and it was still failing to close some race conditions. Then, Tom proposed another approach in [2] which is that "creation DDL will have to take a lock on each referenced object that'd conflict with a lock taken by DROP". This is the one the current patch is trying to implement. > What you've done here instead is add locking at a much > lower level - whenever we are adding a dependency on an object, we > lock the object. The advantage of that approach is that we definitely > won't miss anything. Right, as there is much more than the ones related to schemas, for example: - function and arg type - function and return type - function and function - domain and domain - table and type - server and foreign data wrapper to name a few. > The disadvantage of that approach is that it > means we have some very low-level code taking locks, which means it's > not obvious which operations are taking what locks. Right, but the new operations are "only" the ones leading to recording or altering a dependency. > Maybe it could > even result in some redundancy, like the higher-level code taking a > lock also (possibly in a different mode) and then this code taking > another one. The one that is added here is in AccessShareLock mode. It could conflict with the ones in AccessExclusiveLock means (If I'm not missing any): - AcquireDeletionLock(): which is exactly what we want - get_object_address() - get_object_address_rv() - ExecAlterObjectDependsStmt() - ExecRenameStmt() - ExecAlterObjectDependsStmt() - ExecAlterOwnerStmt() - RemoveObjects() - AlterPublication() I think there is 2 cases here: First case: the "conflicting" lock mode is for one of our own lock then LockCheckConflicts() would report this as a NON conflict. Second case: the "conflicting" lock mode is NOT for one of our own lock then LockCheckConflicts() would report a conflict. But I've the feeling that the existing code would already lock those sessions. One example where it would be the case: Session 1: doing "BEGIN; ALTER FUNCTION noschemas.foo2() SET SCHEMA alterschema" would acquire the lock in AccessExclusiveLock during ExecAlterObjectSchemaStmt()->get_object_address()->LockDatabaseObject() (in the existing code and before the new call that would occur through changeDependencyFor()->depLockAndCheckObject() with the patch in place). Then, session 2: doing "alter function noschemas.foo2() owner to newrole;" would be locked in the existing code while doing ExecAlterOwnerStmt()->get_object_address()->LockDatabaseObject()). Means that in this example, the new lock this patch is introducing would not be responsible of session 2 beging locked. Was your concern about "First case" or "Second case" or both? [1]: https://www.postgresql.org/message-id/flat/5a9daaae-5538-b209-6279-e903c3ea2157%40amazon.com [2]: https://www.postgresql.org/message-id/flat/8369ff70-0e31-f194-2954-787f4d9e21dd%40amazon.com Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Avoid orphaned objects dependencies, take 3
Hi, On Sun, May 19, 2024 at 11:00:00AM +0300, Alexander Lakhin wrote: > Hello Bertrand, > > Probably, it's worth to avoid ERRCODE_INTERNAL_ERROR here in light of [1] > and [2], as these errors are not that abnormal (not Assert-like). > > [1] https://www.postgresql.org/message-id/Zic_GNgos5sMxKoa%40paquier.xyz > [2] https://commitfest.postgresql.org/48/4735/ > Thanks for mentioning the above examples, I agree that it's worth to avoid ERRCODE_INTERNAL_ERROR here: please find attached v7 that makes use of a new ERRCODE: ERRCODE_DEPENDENT_OBJECTS_DOES_NOT_EXIST I thought about this name as it is close enough to the already existing "ERRCODE_DEPENDENT_OBJECTS_STILL_EXIST" but I'm open to suggestion too. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com >From 79809f88311ef1cf4e8f3250e1508fe0b7d86602 Mon Sep 17 00:00:00 2001 From: Bertrand Drouvot Date: Fri, 29 Mar 2024 15:43:26 + Subject: [PATCH v7] Avoid orphaned objects dependencies It's currently possible to create orphaned objects dependencies, for example: Scenario 1: session 1: begin; drop schema schem; session 2: create a function in the schema schem session 1: commit; With the above, the function created in session 2 would be linked to a non existing schema. Scenario 2: session 1: begin; create a function in the schema schem session 2: drop schema schem; session 1: commit; With the above, the function created in session 1 would be linked to a non existing schema. To avoid those scenarios, a new lock (that conflicts with a lock taken by DROP) has been put in place when the dependencies are being recorded. With this in place, the drop schema in scenario 2 would be locked. Also, after the new lock attempt, the patch checks that the object still exists: with this in place session 2 in scenario 1 would be locked and would report an error once session 1 committs (that would not be the case should session 1 abort the transaction). If the object is dropped before the new lock attempt is triggered then the patch would also report an error (but with less details). The patch adds a few tests for some dependency cases (that would currently produce orphaned objects): - schema and function (as the above scenarios) - alter a dependency (function and schema) - function and arg type - function and return type - function and function - domain and domain - table and type - server and foreign data wrapper --- src/backend/catalog/dependency.c | 58 src/backend/catalog/objectaddress.c | 70 ++ src/backend/catalog/pg_depend.c | 12 ++ src/backend/utils/errcodes.txt| 1 + src/include/catalog/dependency.h | 1 + src/include/catalog/objectaddress.h | 1 + .../expected/test_dependencies_locks.out | 129 ++ src/test/isolation/isolation_schedule | 1 + .../specs/test_dependencies_locks.spec| 89 9 files changed, 362 insertions(+) 24.6% src/backend/catalog/ 45.1% src/test/isolation/expected/ 28.1% src/test/isolation/specs/ diff --git a/src/backend/catalog/dependency.c b/src/backend/catalog/dependency.c index d4b5b2ade1..89b2f18f98 100644 --- a/src/backend/catalog/dependency.c +++ b/src/backend/catalog/dependency.c @@ -1517,6 +1517,64 @@ AcquireDeletionLock(const ObjectAddress *object, int flags) } } +/* + * depLockAndCheckObject + * + * Lock the object that we are about to record a dependency on. + * After it's locked, verify that it hasn't been dropped while we + * weren't looking. If the object has been dropped, this function + * does not return! + */ +void +depLockAndCheckObject(const ObjectAddress *object) +{ + char *object_description; + + /* + * Those don't rely on LockDatabaseObject() when being dropped (see + * AcquireDeletionLock()). Also it looks like they can not produce + * orphaned dependent objects when being dropped. + */ + if (object->classId == RelationRelationId || object->classId == AuthMemRelationId) + return; + + object_description = getObjectDescription(object, true); + + /* assume we should lock the whole object not a sub-object */ + LockDatabaseObject(object->classId, object->objectId, 0, AccessShareLock); + + /* check if object still exists */ + if (!ObjectByIdExist(object, false)) + { + /* + * It might be possible that we are creating it (for example creating + * a composite type while creating a relation), so bypass the syscache + * lookup and use a dirty snaphot instead to cover this scenario. + */ + if (!ObjectByIdExist(object, true)) + { + /* + * If the object has been dropped before we get a chance to get + * its description, then emit a generic error message. That looks + * like a good compromise over extra complexity. + */ + if (object_description) +ereport(ERROR, + (errcode(ERRCODE_DEPE
Re: PostgreSQL 17 Beta 1 release announcement draft
Hi, On Sun, May 19, 2024 at 05:10:10PM -0400, Jonathan S. Katz wrote: > On 5/16/24 1:15 AM, Bertrand Drouvot wrote: > > Hi, > > > > On Wed, May 15, 2024 at 09:45:35PM -0400, Jonathan S. Katz wrote: > > > Hi, > > > > > > Attached is a copy of the PostgreSQL 17 Beta 1 release announcement draft. > > > > Thanks for working on it! > > > > I've one comment: > > > > > PostgreSQL 17 also introduces a new view, > > > [`pg_wait_events`](https://www.postgresql.org/docs/17/view-pg-wait-events.html), > > > which provides descriptions about wait events and can be combined with > > > `pg_stat_activity` to give more insight into an operation. > > > > Instead of "to give more insight into an operation", what about "to give > > more > > insight about what a session is waiting for (should it be active)"? > > I put: > > "to give more in insight into why a session is blocked." Thanks! > > Does that work? > I think using "waiting" is better (as the view is "pg_wait_events" and the join with pg_stat_activity would be on the "wait_event_type" and "wait_event" columns). The reason I mentioned "should it be active" is because wait_event and wait_event_type could be non empty in pg_stat_activity while the session is not in an active state anymore (then not waiting). A right query would be like the one in [1]: " SELECT a.pid, a.wait_event, w.description FROM pg_stat_activity a JOIN pg_wait_events w ON (a.wait_event_type = w.type AND a.wait_event = w.name) WHERE a.wait_event is NOT NULL and a.state = 'active'; " means filtering on the "active" state too, and that's what the description proposal I made was trying to highlight. [1]: https://www.postgresql.org/docs/devel/monitoring-stats.html Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: PostgreSQL 17 Beta 1 release announcement draft
Hi, On Wed, May 15, 2024 at 09:45:35PM -0400, Jonathan S. Katz wrote: > Hi, > > Attached is a copy of the PostgreSQL 17 Beta 1 release announcement draft. Thanks for working on it! I've one comment: > PostgreSQL 17 also introduces a new view, > [`pg_wait_events`](https://www.postgresql.org/docs/17/view-pg-wait-events.html), > which provides descriptions about wait events and can be combined with > `pg_stat_activity` to give more insight into an operation. Instead of "to give more insight into an operation", what about "to give more insight about what a session is waiting for (should it be active)"? Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Avoid orphaned objects dependencies, take 3
Hi, On Wed, May 15, 2024 at 08:31:43AM +, Bertrand Drouvot wrote: > Hi, > > On Wed, May 15, 2024 at 10:14:09AM +0900, Michael Paquier wrote: > > +++ > > b/src/test/modules/test_dependencies_locks/specs/test_dependencies_locks.spec > > > > > > This introduces a module with only one single spec. I could get > > behind an extra module if splitting the tests into more specs makes > > sense or if there is a restriction on installcheck. However, for > > one spec file filed with a bunch of objects, and note that I'm OK to > > let this single spec grow more for this range of tests, it seems to me > > that this would be better under src/test/isolation/. > > Yeah, I was not sure about this one (the location is from take 2 mentioned > up-thread). I'll look at moving the tests to src/test/isolation/. Please find attached v6 (only diff with v5 is moving the tests as suggested above). Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com >From 59f7befd34b8aa4ba0483a100e85bacbc1a76707 Mon Sep 17 00:00:00 2001 From: Bertrand Drouvot Date: Fri, 29 Mar 2024 15:43:26 + Subject: [PATCH v6] Avoid orphaned objects dependencies It's currently possible to create orphaned objects dependencies, for example: Scenario 1: session 1: begin; drop schema schem; session 2: create a function in the schema schem session 1: commit; With the above, the function created in session 2 would be linked to a non existing schema. Scenario 2: session 1: begin; create a function in the schema schem session 2: drop schema schem; session 1: commit; With the above, the function created in session 1 would be linked to a non existing schema. To avoid those scenarios, a new lock (that conflicts with a lock taken by DROP) has been put in place when the dependencies are being recorded. With this in place, the drop schema in scenario 2 would be locked. Also, after the new lock attempt, the patch checks that the object still exists: with this in place session 2 in scenario 1 would be locked and would report an error once session 1 committs (that would not be the case should session 1 abort the transaction). If the object is dropped before the new lock attempt is triggered then the patch would also report an error (but with less details). The patch adds a few tests for some dependency cases (that would currently produce orphaned objects): - schema and function (as the above scenarios) - alter a dependency (function and schema) - function and arg type - function and return type - function and function - domain and domain - table and type - server and foreign data wrapper --- src/backend/catalog/dependency.c | 54 src/backend/catalog/objectaddress.c | 70 ++ src/backend/catalog/pg_depend.c | 12 ++ src/include/catalog/dependency.h | 1 + src/include/catalog/objectaddress.h | 1 + .../expected/test_dependencies_locks.out | 129 ++ src/test/isolation/isolation_schedule | 1 + .../specs/test_dependencies_locks.spec| 89 8 files changed, 357 insertions(+) 24.1% src/backend/catalog/ 45.9% src/test/isolation/expected/ 28.6% src/test/isolation/specs/ diff --git a/src/backend/catalog/dependency.c b/src/backend/catalog/dependency.c index d4b5b2ade1..a49357bbe2 100644 --- a/src/backend/catalog/dependency.c +++ b/src/backend/catalog/dependency.c @@ -1517,6 +1517,60 @@ AcquireDeletionLock(const ObjectAddress *object, int flags) } } +/* + * depLockAndCheckObject + * + * Lock the object that we are about to record a dependency on. + * After it's locked, verify that it hasn't been dropped while we + * weren't looking. If the object has been dropped, this function + * does not return! + */ +void +depLockAndCheckObject(const ObjectAddress *object) +{ + char *object_description; + + /* + * Those don't rely on LockDatabaseObject() when being dropped (see + * AcquireDeletionLock()). Also it looks like they can not produce + * orphaned dependent objects when being dropped. + */ + if (object->classId == RelationRelationId || object->classId == AuthMemRelationId) + return; + + object_description = getObjectDescription(object, true); + + /* assume we should lock the whole object not a sub-object */ + LockDatabaseObject(object->classId, object->objectId, 0, AccessShareLock); + + /* check if object still exists */ + if (!ObjectByIdExist(object, false)) + { + /* + * It might be possible that we are creating it (for example creating + * a composite type while creating a relation), so bypass the syscache + * lookup and use a dirty snaphot instead to cover this scenario. + */ + if (!ObjectByIdExist(object, true)) + { + /* + * If the object has been dropped before we get a chance to get + * its description, then
Re: Avoid orphaned objects dependencies, take 3
Hi, On Tue, May 14, 2024 at 03:00:00PM +0300, Alexander Lakhin wrote: > Hi Bertrand, > > 09.05.2024 15:20, Bertrand Drouvot wrote: > > Oh I see, your test updates an existing dependency. v4 took care about > > brand new > > dependencies creation (recordMultipleDependencies()) but forgot to take care > > about changing an existing dependency (which is done in another code path: > > changeDependencyFor()). > > > > Please find attached v5 that adds: > > > > - a call to the new depLockAndCheckObject() function in > > changeDependencyFor(). > > - a test when altering an existing dependency. > > > > With v5 applied, I don't see the issue anymore. > > Me too. Thank you for the improved version! > I will test the patch in the background, but for now I see no other > issues with it. Thanks for confirming! Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Avoid orphaned objects dependencies, take 3
Hi, On Wed, May 15, 2024 at 10:14:09AM +0900, Michael Paquier wrote: > On Thu, May 09, 2024 at 12:20:51PM +0000, Bertrand Drouvot wrote: > > Oh I see, your test updates an existing dependency. v4 took care about > > brand new > > dependencies creation (recordMultipleDependencies()) but forgot to take care > > about changing an existing dependency (which is done in another code path: > > changeDependencyFor()). > > > > Please find attached v5 that adds: > > > > - a call to the new depLockAndCheckObject() function in > > changeDependencyFor(). > > - a test when altering an existing dependency. > > > > With v5 applied, I don't see the issue anymore. > > +if (object_description) > +ereport(ERROR, errmsg("%s does not exist", > object_description)); > +else > +ereport(ERROR, errmsg("a dependent object does not ex > > This generates an internal error code. Is that intended? Thanks for looking at it! Yes, it's like when say you want to create an object in a schema that does not exist (see get_namespace_oid()). > --- /dev/null > +++ > b/src/test/modules/test_dependencies_locks/specs/test_dependencies_locks.spec > > This introduces a module with only one single spec. I could get > behind an extra module if splitting the tests into more specs makes > sense or if there is a restriction on installcheck. However, for > one spec file filed with a bunch of objects, and note that I'm OK to > let this single spec grow more for this range of tests, it seems to me > that this would be better under src/test/isolation/. Yeah, I was not sure about this one (the location is from take 2 mentioned up-thread). I'll look at moving the tests to src/test/isolation/. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Log details for stats dropped more than once
Hi, On Wed, May 15, 2024 at 02:47:29PM +0900, Michael Paquier wrote: > On Tue, May 14, 2024 at 10:07:14AM +0000, Bertrand Drouvot wrote: > > While resuming the work on refilenode stats (mentioned in [1] but did not > > share > > the patch yet), I realized that my current POC patch is buggy enough to > > produce > > things like: > > > > 024-05-14 09:51:14.783 UTC [1788714] FATAL: can only drop stats once > > > > While the CONTEXT provides the list of dropped stats: > > > > 2024-05-14 09:51:14.783 UTC [1788714] CONTEXT: WAL redo at 0/D75F478 for > > Transaction/ABORT: 2024-05-14 09:51:14.782223+00; dropped stats: > > 2/16384/27512/0 2/16384/27515/0 2/16384/27516/0 > > Can refcount be useful to know in this errcontext? Thanks for looking at it! Do you mean as part of the added errdetail_internal()? If so, yeah I think it's a good idea (done that way in v2 attached). > > Attached a tiny patch to report the stat that generates the error. The > > patch uses > > errdetail_internal() as the extra details don't seem to be useful to average > > users. > > I think that's fine. Overall that looks like useful information for > debugging, so no objections from here. Thanks! BTW, I just realized that adding more details for this error has already been mentioned in [1] (and Andres did propose a slightly different version). [1]: https://www.postgresql.org/message-id/20240505160915.6boysum4f34siqct%40awork3.anarazel.de Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com >From 0866e8252f2038f3fdbd9cfabb214461689210df Mon Sep 17 00:00:00 2001 From: Bertrand Drouvot Date: Tue, 14 May 2024 09:10:50 + Subject: [PATCH v2] Log details for stats dropped more than once Adding errdetail_internal() in pgstat_drop_entry_internal() to display the PgStat_HashKey and the entry's refcount when the "can only drop stats once" error is reported. --- src/backend/utils/activity/pgstat_shmem.c | 7 ++- 1 file changed, 6 insertions(+), 1 deletion(-) 100.0% src/backend/utils/activity/ diff --git a/src/backend/utils/activity/pgstat_shmem.c b/src/backend/utils/activity/pgstat_shmem.c index 91591da395..0595c08d6e 100644 --- a/src/backend/utils/activity/pgstat_shmem.c +++ b/src/backend/utils/activity/pgstat_shmem.c @@ -785,7 +785,12 @@ pgstat_drop_entry_internal(PgStatShared_HashEntry *shent, * backends to release their references. */ if (shent->dropped) - elog(ERROR, "can only drop stats once"); + ereport(ERROR, +errmsg("can only drop stats once"), +errdetail_internal("Stats kind=%s dboid=%u objoid=%u refcount=%u.", + pgstat_get_kind_info(shent->key.kind)->name, + shent->key.dboid, shent->key.objoid, + pg_atomic_read_u32(>refcount))); shent->dropped = true; /* release refcount marking entry as not dropped */ -- 2.34.1
Log details for stats dropped more than once
Hi hackers, While resuming the work on refilenode stats (mentioned in [1] but did not share the patch yet), I realized that my current POC patch is buggy enough to produce things like: 024-05-14 09:51:14.783 UTC [1788714] FATAL: can only drop stats once While the CONTEXT provides the list of dropped stats: 2024-05-14 09:51:14.783 UTC [1788714] CONTEXT: WAL redo at 0/D75F478 for Transaction/ABORT: 2024-05-14 09:51:14.782223+00; dropped stats: 2/16384/27512/0 2/16384/27515/0 2/16384/27516/0 It's not clear which one generates the error (don't pay attention to the actual values, the issue comes from the new refilenode stats that I removed from the output). Attached a tiny patch to report the stat that generates the error. The patch uses errdetail_internal() as the extra details don't seem to be useful to average users. [1]: https://www.postgresql.org/message-id/ZbIdgTjR2QcFJ2mE%40ip-10-97-1-34.eu-west-3.compute.internal Looking forward to your feedback, Regards -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com >From f296d4092f5e87ed63a323db3f1bc9cfa807e2e8 Mon Sep 17 00:00:00 2001 From: Bertrand Drouvot Date: Tue, 14 May 2024 09:10:50 + Subject: [PATCH v1] Log details for stats dropped more than once Adding errdetail_internal() in pgstat_drop_entry_internal() to display the PgStat_HashKey when the "can only drop stats once" error is reported. --- src/backend/utils/activity/pgstat_shmem.c | 6 +- 1 file changed, 5 insertions(+), 1 deletion(-) 100.0% src/backend/utils/activity/ diff --git a/src/backend/utils/activity/pgstat_shmem.c b/src/backend/utils/activity/pgstat_shmem.c index 91591da395..1308c4439a 100644 --- a/src/backend/utils/activity/pgstat_shmem.c +++ b/src/backend/utils/activity/pgstat_shmem.c @@ -785,7 +785,11 @@ pgstat_drop_entry_internal(PgStatShared_HashEntry *shent, * backends to release their references. */ if (shent->dropped) - elog(ERROR, "can only drop stats once"); + ereport(ERROR, +errmsg("can only drop stats once"), +errdetail_internal("Stats kind=%s dboid=%u objoid=%u.", + pgstat_get_kind_info(shent->key.kind)->name, + shent->key.dboid, shent->key.objoid)); shent->dropped = true; /* release refcount marking entry as not dropped */ -- 2.34.1
Re: Avoid orphaned objects dependencies, take 3
Hi, On Tue, Apr 30, 2024 at 08:00:00PM +0300, Alexander Lakhin wrote: > Hi Bertrand, > > But I've discovered yet another possibility to get a broken dependency. Thanks for the testing and the report! > Please try this script: > res=0 > numclients=20 > for ((i=1;i<=100;i++)); do > for ((c=1;c<=numclients;c++)); do > echo " > CREATE SCHEMA s_$c; > CREATE CONVERSION myconv_$c FOR 'LATIN1' TO 'UTF8' FROM iso8859_1_to_utf8; > ALTER CONVERSION myconv_$c SET SCHEMA s_$c; > " | psql >psql1-$c.log 2>&1 & > echo "DROP SCHEMA s_$c RESTRICT;" | psql >psql2-$c.log 2>&1 & > done > wait > pg_dump -f db.dump || { echo "on iteration $i"; res=1; break; } > for ((c=1;c<=numclients;c++)); do > echo "DROP SCHEMA s_$c CASCADE;" | psql >psql3-$c.log 2>&1 > done > done > psql -c "SELECT * FROM pg_conversion WHERE connamespace NOT IN (SELECT oid > FROM pg_namespace);" > > It fails for me (with the v4 patch applied) as follows: > pg_dump: error: schema with OID 16392 does not exist > on iteration 1 > oid | conname | connamespace | conowner | conforencoding | contoencoding > | conproc | condefault > ---+--+--+--++---+---+ > 16396 | myconv_6 | 16392 | 10 | 8 | 6 > | iso8859_1_to_utf8 | f > Thanks for sharing the test, I'm able to reproduce the issue with v4. Oh I see, your test updates an existing dependency. v4 took care about brand new dependencies creation (recordMultipleDependencies()) but forgot to take care about changing an existing dependency (which is done in another code path: changeDependencyFor()). Please find attached v5 that adds: - a call to the new depLockAndCheckObject() function in changeDependencyFor(). - a test when altering an existing dependency. With v5 applied, I don't see the issue anymore. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com >From 6fc24a798210f730ab04833fa58074f142be968e Mon Sep 17 00:00:00 2001 From: Bertrand Drouvot Date: Fri, 29 Mar 2024 15:43:26 + Subject: [PATCH v5] Avoid orphaned objects dependencies It's currently possible to create orphaned objects dependencies, for example: Scenario 1: session 1: begin; drop schema schem; session 2: create a function in the schema schem session 1: commit; With the above, the function created in session 2 would be linked to a non existing schema. Scenario 2: session 1: begin; create a function in the schema schem session 2: drop schema schem; session 1: commit; With the above, the function created in session 1 would be linked to a non existing schema. To avoid those scenarios, a new lock (that conflicts with a lock taken by DROP) has been put in place when the dependencies are being recorded. With this in place, the drop schema in scenario 2 would be locked. Also, after the new lock attempt, the patch checks that the object still exists: with this in place session 2 in scenario 1 would be locked and would report an error once session 1 committs (that would not be the case should session 1 abort the transaction). If the object is dropped before the new lock attempt is triggered then the patch would also report an error (but with less details). The patch adds a few tests for some dependency cases (that would currently produce orphaned objects): - schema and function (as the above scenarios) - alter a dependency (function and schema) - function and arg type - function and return type - function and function - domain and domain - table and type - server and foreign data wrapper --- src/backend/catalog/dependency.c | 54 src/backend/catalog/objectaddress.c | 70 ++ src/backend/catalog/pg_depend.c | 12 ++ src/include/catalog/dependency.h | 1 + src/include/catalog/objectaddress.h | 1 + src/test/modules/Makefile | 1 + src/test/modules/meson.build | 1 + .../test_dependencies_locks/.gitignore| 3 + .../modules/test_dependencies_locks/Makefile | 14 ++ .../expected/test_dependencies_locks.out | 129 ++ .../test_dependencies_locks/meson.build | 12 ++ .../specs/test_dependencies_locks.spec| 89 12 files changed, 387 insertions(+) 22.9% src/backend/catalog/ 43.7% src/test/modules/test_dependencies_locks/expected/ 27.2% src/test/modules/test_dependencies_locks/specs/ 4.5% src/test/modules/test_dependencies_locks/ diff --git a/src/backend/catalog/dependency.c b/src/backend/catalog/dependency.c index d4b5b2ade1..a49357bbe2 100644 --- a/src/backend/catalog/dependency.c +++ b/src/backend/catalog/dependency.c @@ -15
Re: First draft of PG 17 release notes
Hi, On Thu, May 09, 2024 at 12:03:50AM -0400, Bruce Momjian wrote: > I have committed the first draft of the PG 17 release notes; you can > see the results here: > > https://momjian.us/pgsql_docs/release-17.html Thanks for working on that! > I welcome feedback. > Add system view pg_wait_events that reports wait event types (Michael > Paquier) Michael is the committer for 1e68e43d3f, the author is me. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Synchronizing slots from primary to standby
Hi, On Mon, Apr 29, 2024 at 11:58:09AM +, Zhijie Hou (Fujitsu) wrote: > On Monday, April 29, 2024 5:11 PM shveta malik wrote: > > > > On Mon, Apr 29, 2024 at 11:38 AM shveta malik > > wrote: > > > > > > On Mon, Apr 29, 2024 at 10:57 AM Zhijie Hou (Fujitsu) > > > wrote: > > > > > > > > On Friday, March 15, 2024 10:45 PM Bertrand Drouvot > > wrote: > > > > > > > > > > Hi, > > > > > > > > > > On Thu, Mar 14, 2024 at 02:22:44AM +, Zhijie Hou (Fujitsu) wrote: > > > > > > Hi, > > > > > > > > > > > > Since the standby_slot_names patch has been committed, I am > > > > > > attaching the last doc patch for review. > > > > > > > > > > > > > > > > Thanks! > > > > > > > > > > 1 === > > > > > > > > > > + continue subscribing to publications now on the new primary > > > > > + server > > > > > without > > > > > + any data loss. > > > > > > > > > > I think "without any data loss" should be re-worded in this > > > > > context. Data loss in the sense "data committed on the primary and > > > > > not visible on the subscriber in case of failover" can still occurs > > > > > (in case > > synchronous replication is not used). > > > > > > > > > > 2 === > > > > > > > > > > + If the result (failover_ready) of both above > > > > > steps is > > > > > + true, existing subscriptions will be able to continue without data > > loss. > > > > > + > > > > > > > > > > I don't think that's true if synchronous replication is not used. > > > > > Say, > > > > > > > > > > - synchronous replication is not used > > > > > - primary is not able to reach the standby anymore and > > > > > standby_slot_names is set > > > > > - new data is inserted into the primary > > > > > - then not replicated to subscriber (due to standby_slot_names) > > > > > > > > > > Then I think the both above steps will return true but data would > > > > > be lost in case of failover. > > > > > > > > Thanks for the comments, attach the new version patch which reworded > > > > the above places. Thanks! > Here is the V3 doc patch. Thanks! A few comments: 1 === + losing any data that has been flushed to the new primary server. Worth to add a few words about possible data loss, something like? Please note that in case synchronous replication is not used and standby_slot_names is set correctly, it might be possible to lose data that would have been committed on the old primary server (in case the standby was not reachable during that time for example). 2 === +test_sub=# SELECT + array_agg(slotname) AS slots + FROM + (( + SELECT r.srsubid AS subid, CONCAT('pg_', srsubid, '_sync_', srrelid, '_', ctl.system_identifier) AS slotname + FROM pg_control_system() ctl, pg_subscription_rel r, pg_subscription s + WHERE r.srsubstate = 'f' AND s.oid = r.srsubid AND s.subfailover + ) UNION ( I guess this format comes from ReplicationSlotNameForTablesync(). What about creating a SQL callable function on top of it and make use of it in the query above? (that would ensure to keep the doc up to date even if the format changes in ReplicationSlotNameForTablesync()). 3 === +test_sub=# SELECT + MAX(remote_lsn) AS remote_lsn_on_subscriber + FROM + (( + SELECT (CASE WHEN r.srsubstate = 'f' THEN pg_replication_origin_progress(CONCAT('pg_', r.srsubid, '_', r.srrelid), false) + WHEN r.srsubstate IN ('s', 'r') THEN r.srsublsn END) AS remote_lsn + FROM pg_subscription_rel r, pg_subscription s + WHERE r.srsubstate IN ('f', 's', 'r') AND s.oid = r.srsubid AND s.subfailover + ) UNION ( + SELECT pg_replication_origin_progress(CONCAT('pg_', s.oid), false) AS remote_lsn + FROM pg_subscription s + WHERE s.subfailover + )); What about adding a join to pg_replication_origin to get rid of the "hardcoded" format "CONCAT('pg_', r.srsubid, '_', r.srrelid)" and "CONCAT('pg_', s.oid)"? Idea behind 2 === and 3 === is to have the queries as generic as possible and not rely on a hardcoded format (that would be more difficult to maintain should those formats change in the future). Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Avoid orphaned objects dependencies, take 3
Hi, On Thu, Apr 25, 2024 at 09:00:00AM +0300, Alexander Lakhin wrote: > Hi, > > 25.04.2024 08:00, Bertrand Drouvot wrote: > > > > > though concurrent create/drop operations can still produce > > > the "cache lookup failed" error, which is probably okay, except that it is > > > an INTERNAL_ERROR, which assumed to be not easily triggered by users. > > I did not see any of those "cache lookup failed" during my testing > > with/without > > your script. During which test(s) did you see those with v3 applied? > > You can try, for example, table-trigger, or other tests that check for > "cache lookup failed" psql output only (maybe you'll need to increase the > iteration count). For instance, I've got (with v4 applied): > 2024-04-25 05:48:08.102 UTC [3638763] ERROR: cache lookup failed for > function 20893 > 2024-04-25 05:48:08.102 UTC [3638763] STATEMENT: CREATE TRIGGER modified_c1 > BEFORE UPDATE OF c ON t > FOR EACH ROW WHEN (OLD.c <> NEW.c) EXECUTE PROCEDURE > trigger_func('modified_c'); > > Or with function-function: > 2024-04-25 05:52:31.390 UTC [3711897] ERROR: cache lookup failed for > function 32190 at character 54 > 2024-04-25 05:52:31.390 UTC [3711897] STATEMENT: CREATE FUNCTION f1() > RETURNS int LANGUAGE SQL RETURN f() + 1; > -- > 2024-04-25 05:52:37.639 UTC [3720011] ERROR: cache lookup failed for > function 34465 at character 54 > 2024-04-25 05:52:37.639 UTC [3720011] STATEMENT: CREATE FUNCTION f1() > RETURNS int LANGUAGE SQL RETURN f() + 1; I see, so this is during object creation. It's easy to reproduce this kind of issue with gdb. For example set a breakpoint on SearchSysCache1() and during the create function f1() once it breaks on: #0 SearchSysCache1 (cacheId=45, key1=16400) at syscache.c:221 #1 0x5ad305beacd6 in func_get_detail (funcname=0x5ad308204d50, fargs=0x0, fargnames=0x0, nargs=0, argtypes=0x72ff9cc0, expand_variadic=true, expand_defaults=true, include_out_arguments=false, funcid=0x72ff9ba0, rettype=0x72ff9b9c, retset=0x72ff9b94, nvargs=0x72ff9ba4, vatype=0x72ff9ba8, true_typeids=0x72ff9bd8, argdefaults=0x72ff9be0) at parse_func.c:1622 #2 0x5ad305be7dd0 in ParseFuncOrColumn (pstate=0x5ad30823be98, funcname=0x5ad308204d50, fargs=0x0, last_srf=0x0, fn=0x5ad308204da0, proc_call=false, location=55) at parse_func.c:266 #3 0x5ad305bdffb0 in transformFuncCall (pstate=0x5ad30823be98, fn=0x5ad308204da0) at parse_expr.c:1474 #4 0x5ad305bdd2ee in transformExprRecurse (pstate=0x5ad30823be98, expr=0x5ad308204da0) at parse_expr.c:230 #5 0x5ad305bdec34 in transformAExprOp (pstate=0x5ad30823be98, a=0x5ad308204e20) at parse_expr.c:990 #6 0x5ad305bdd1a0 in transformExprRecurse (pstate=0x5ad30823be98, expr=0x5ad308204e20) at parse_expr.c:187 #7 0x5ad305bdd00b in transformExpr (pstate=0x5ad30823be98, expr=0x5ad308204e20, exprKind=EXPR_KIND_SELECT_TARGET) at parse_expr.c:131 #8 0x5ad305b96b7e in transformReturnStmt (pstate=0x5ad30823be98, stmt=0x5ad308204ee0) at analyze.c:2395 #9 0x5ad305b92213 in transformStmt (pstate=0x5ad30823be98, parseTree=0x5ad308204ee0) at analyze.c:375 #10 0x5ad305c6321a in interpret_AS_clause (languageOid=14, languageName=0x5ad308204c40 "sql", funcname=0x5ad308204ad8 "f100", as=0x0, sql_body_in=0x5ad308204ee0, parameterTypes=0x0, inParameterNames=0x0, prosrc_str_p=0x72ffa208, probin_str_p=0x72ffa200, sql_body_out=0x72ffa210, queryString=0x5ad3082040b0 "CREATE FUNCTION f100() RETURNS int LANGUAGE SQL RETURN f() + 1;") at functioncmds.c:953 #11 0x5ad305c63c93 in CreateFunction (pstate=0x5ad308186310, stmt=0x5ad308204f00) at functioncmds.c:1221 then drop function f() in another session. Then the create function f1() would: postgres=# CREATE FUNCTION f() RETURNS int LANGUAGE SQL RETURN f() + 1; ERROR: cache lookup failed for function 16400 This stuff does appear before we get a chance to call the new depLockAndCheckObject() function. I think this is what Tom was referring to in [1]: " So the only real fix for this would be to make every object lookup in the entire system do the sort of dance that's done in RangeVarGetRelidExtended. " The fact that those kind of errors appear also somehow ensure that no orphaned dependencies can be created. The patch does ensure that no orphaned depencies can occur after those "initial" look up are done (should the dependent object be dropped). I'm tempted to not add extra complexity to avoid those kind of errors and keep the patch as it is. All of those servicing the same goal: no orphaned depencies are created. [1]: https://www.postgresql.org/message-id/2872252.1630851337%40sss.pgh.pa.us Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Avoid orphaned objects dependencies, take 3
Hi, On Wed, Apr 24, 2024 at 03:00:00PM +0300, Alexander Lakhin wrote: > 24.04.2024 11:38, Bertrand Drouvot wrote: > > I gave more thought to v2 and the approach seems reasonable to me. > > Basically what > > it does is that in case the object is already dropped before we take the > > new lock > > (while recording the dependency) then the error message is a "generic" one > > (means > > it does not provide the description of the "already" dropped object). I > > think it > > makes sense to write the patch that way by abandoning the patch's ambition > > to > > tell the description of the dropped object in all the cases. > > > > Of course, I would be happy to hear others thought about it. > > > > Please find v3 attached (which is v2 with more comments). > > Thank you for the improved version! > > I can confirm that it fixes that case. Great, thanks for the test! > I've also tested other cases that fail on master (most of them also fail > with v1), please try/look at the attached script. Thanks for all those tests! > (There could be other broken-dependency cases, of course, but I think I've > covered all the representative ones.) Agree. Furthermore the way the patch is written should be agnostic to the object's kind that are part of the dependency. Having said that, that does not hurt to add more tests in this patch, so v4 attached adds some of your tests ( that would fail on the master branch without the patch applied). The way the tests are written in the patch are less "racy" that when triggered with your script. Indeed, I think that in the patch the dependent object can not be removed before the new lock is taken when recording the dependency. We may want to add injection points in the game if we feel the need. > All tested cases work correctly with v3 applied — Yeah, same on my side, I did run them too and did not find any issues. > I couldn't get broken > dependencies, Same here. > though concurrent create/drop operations can still produce > the "cache lookup failed" error, which is probably okay, except that it is > an INTERNAL_ERROR, which assumed to be not easily triggered by users. I did not see any of those "cache lookup failed" during my testing with/without your script. During which test(s) did you see those with v3 applied? Attached v4, simply adding more tests to v3. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com >From a9b34955fab0351b7b5a7ba6eb36f199f5a5822c Mon Sep 17 00:00:00 2001 From: Bertrand Drouvot Date: Fri, 29 Mar 2024 15:43:26 + Subject: [PATCH v4] Avoid orphaned objects dependencies It's currently possible to create orphaned objects dependencies, for example: Scenario 1: session 1: begin; drop schema schem; session 2: create a function in the schema schem session 1: commit; With the above, the function created in session 2 would be linked to a non existing schema. Scenario 2: session 1: begin; create a function in the schema schem session 2: drop schema schem; session 1: commit; With the above, the function created in session 1 would be linked to a non existing schema. To avoid those scenarios, a new lock (that conflicts with a lock taken by DROP) has been put in place when the dependencies are being recorded. With this in place, the drop schema in scenario 2 would be locked. Also, after the new lock attempt, the patch checks that the object still exists: with this in place session 2 in scenario 1 would be locked and would report an error once session 1 committs (that would not be the case should session 1 abort the transaction). If the object is dropped before the new lock attempt is triggered then the patch would also report an error (but with less details). The patch adds a few tests for some dependency cases (that would currently produce orphaned objects): - schema and function (as the above scenarios) - function and arg type - function and return type - function and function - domain and domain - table and type - server and foreign data wrapper --- src/backend/catalog/dependency.c | 54 + src/backend/catalog/objectaddress.c | 70 +++ src/backend/catalog/pg_depend.c | 6 + src/include/catalog/dependency.h | 1 + src/include/catalog/objectaddress.h | 1 + src/test/modules/Makefile | 1 + src/test/modules/meson.build | 1 + .../test_dependencies_locks/.gitignore| 3 + .../modules/test_dependencies_locks/Makefile | 14 +++ .../expected/test_dependencies_locks.out | 113 ++ .../test_dependencies_locks/meson.build | 12 ++ .../specs/test_dependencies_locks.spec| 78 12 files changed
Re: Avoid orphaned objects dependencies, take 3
Hi, On Tue, Apr 23, 2024 at 04:20:46PM +, Bertrand Drouvot wrote: > Please find attached v2 that should not produce the issue anymore (I launched > a > lot of attempts without any issues). v1 was not strong enough as it was not > always checking for the dependent object existence. v2 now always checks if > the > object still exists after the additional lock acquisition attempt while > recording > the dependency. > > I still need to think about v2 but in the meantime could you please also give > v2 a try on you side? I gave more thought to v2 and the approach seems reasonable to me. Basically what it does is that in case the object is already dropped before we take the new lock (while recording the dependency) then the error message is a "generic" one (means it does not provide the description of the "already" dropped object). I think it makes sense to write the patch that way by abandoning the patch's ambition to tell the description of the dropped object in all the cases. Of course, I would be happy to hear others thought about it. Please find v3 attached (which is v2 with more comments). Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com >From 32945912ceddad6b171fd8b345adefa4432af357 Mon Sep 17 00:00:00 2001 From: Bertrand Drouvot Date: Fri, 29 Mar 2024 15:43:26 + Subject: [PATCH v3] Avoid orphaned objects dependencies It's currently possible to create orphaned objects dependencies, for example: Scenario 1: session 1: begin; drop schema schem; session 2: create a function in the schema schem session 1: commit; With the above, the function created in session 2 would be linked to a non existing schema. Scenario 2: session 1: begin; create a function in the schema schem session 2: drop schema schem; session 1: commit; With the above, the function created in session 1 would be linked to a non existing schema. To avoid those scenarios, a new lock (that conflicts with a lock taken by DROP) has been put in place when the dependencies are being recorded. With this in place, the drop schema in scenario 2 would be locked. Also, after locking the object, the patch checks that the object still exists: with this in place session 2 in scenario 1 would be locked and would report an error once session 1 committs (that would not be the case should session 1 abort the transaction). The patch adds a few tests for some dependency cases (that would currently produce orphaned objects): - schema and function (as the above scenarios) - function and type - table and type --- src/backend/catalog/dependency.c | 54 ++ src/backend/catalog/objectaddress.c | 70 +++ src/backend/catalog/pg_depend.c | 6 ++ src/include/catalog/dependency.h | 1 + src/include/catalog/objectaddress.h | 1 + src/test/modules/Makefile | 1 + src/test/modules/meson.build | 1 + .../test_dependencies_locks/.gitignore| 3 + .../modules/test_dependencies_locks/Makefile | 14 .../expected/test_dependencies_locks.out | 49 + .../test_dependencies_locks/meson.build | 12 .../specs/test_dependencies_locks.spec| 39 +++ 12 files changed, 251 insertions(+) 40.5% src/backend/catalog/ 29.6% src/test/modules/test_dependencies_locks/expected/ 18.7% src/test/modules/test_dependencies_locks/specs/ 8.3% src/test/modules/test_dependencies_locks/ diff --git a/src/backend/catalog/dependency.c b/src/backend/catalog/dependency.c index d4b5b2ade1..a49357bbe2 100644 --- a/src/backend/catalog/dependency.c +++ b/src/backend/catalog/dependency.c @@ -1517,6 +1517,60 @@ AcquireDeletionLock(const ObjectAddress *object, int flags) } } +/* + * depLockAndCheckObject + * + * Lock the object that we are about to record a dependency on. + * After it's locked, verify that it hasn't been dropped while we + * weren't looking. If the object has been dropped, this function + * does not return! + */ +void +depLockAndCheckObject(const ObjectAddress *object) +{ + char *object_description; + + /* + * Those don't rely on LockDatabaseObject() when being dropped (see + * AcquireDeletionLock()). Also it looks like they can not produce + * orphaned dependent objects when being dropped. + */ + if (object->classId == RelationRelationId || object->classId == AuthMemRelationId) + return; + + object_description = getObjectDescription(object, true); + + /* assume we should lock the whole object not a sub-object */ + LockDatabaseObject(object->classId, object->objectId, 0, AccessShareLock); + + /* check if object still exists */ + if (!ObjectByIdExist(object, false)) + { + /* + * It might be possible that we are creating it (for example creating + * a composite type while creating a relation), so bypass the syscache
Re: Avoid orphaned objects dependencies, take 3
Hi, On Tue, Apr 23, 2024 at 04:59:09AM +, Bertrand Drouvot wrote: > Hi, > > On Mon, Apr 22, 2024 at 03:00:00PM +0300, Alexander Lakhin wrote: > > 22.04.2024 13:52, Bertrand Drouvot wrote: > > > > > > That's weird, I just launched it several times with the patch applied and > > > I'm not > > > able to see the seg fault (while I can see it constently failing on the > > > master > > > branch). > > > > > > Are you 100% sure you tested it against a binary with the patch applied? > > > > > > > Yes, at least I can't see what I'm doing wrong. Please try my > > self-contained script attached. > > Thanks for sharing your script! > > Yeah your script ensures the patch is applied before the repro is executed. > > I do confirm that I can also see the issue with the patch applied (but I had > to > launch multiple attempts, while on master one attempt is enough). > > I'll have a look. Please find attached v2 that should not produce the issue anymore (I launched a lot of attempts without any issues). v1 was not strong enough as it was not always checking for the dependent object existence. v2 now always checks if the object still exists after the additional lock acquisition attempt while recording the dependency. I still need to think about v2 but in the meantime could you please also give v2 a try on you side? Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com >From f80fdfc32e791463555aa318f26ff5e7363ac3ac Mon Sep 17 00:00:00 2001 From: Bertrand Drouvot Date: Fri, 29 Mar 2024 15:43:26 + Subject: [PATCH v2] Avoid orphaned objects dependencies It's currently possible to create orphaned objects dependencies, for example: Scenario 1: session 1: begin; drop schema schem; session 2: create a function in the schema schem session 1: commit; With the above, the function created in session 2 would be linked to a non existing schema. Scenario 2: session 1: begin; create a function in the schema schem session 2: drop schema schem; session 1: commit; With the above, the function created in session 1 would be linked to a non existing schema. To avoid those scenarios, a new lock (that conflicts with a lock taken by DROP) has been put in place when the dependencies are being recorded. With this in place, the drop schema in scenario 2 would be locked. Also, after locking the object, the patch checks that the object still exists: with this in place session 2 in scenario 1 would be locked and would report an error once session 1 committs (that would not be the case should session 1 abort the transaction). The patch adds a few tests for some dependency cases (that would currently produce orphaned objects): - schema and function (as the above scenarios) - function and type - table and type --- src/backend/catalog/dependency.c | 48 + src/backend/catalog/objectaddress.c | 70 +++ src/backend/catalog/pg_depend.c | 6 ++ src/include/catalog/dependency.h | 1 + src/include/catalog/objectaddress.h | 1 + src/test/modules/Makefile | 1 + src/test/modules/meson.build | 1 + .../test_dependencies_locks/.gitignore| 3 + .../modules/test_dependencies_locks/Makefile | 14 .../expected/test_dependencies_locks.out | 49 + .../test_dependencies_locks/meson.build | 12 .../specs/test_dependencies_locks.spec| 39 +++ 12 files changed, 245 insertions(+) 38.1% src/backend/catalog/ 30.7% src/test/modules/test_dependencies_locks/expected/ 19.5% src/test/modules/test_dependencies_locks/specs/ 8.7% src/test/modules/test_dependencies_locks/ diff --git a/src/backend/catalog/dependency.c b/src/backend/catalog/dependency.c index d4b5b2ade1..e3b66025dd 100644 --- a/src/backend/catalog/dependency.c +++ b/src/backend/catalog/dependency.c @@ -1517,6 +1517,54 @@ AcquireDeletionLock(const ObjectAddress *object, int flags) } } +/* + * depLockAndCheckObject + * + * Lock the object that we are about to record a dependency on. + * After it's locked, verify that it hasn't been dropped while we + * weren't looking. If the object has been dropped, this function + * does not return! + */ +void +depLockAndCheckObject(const ObjectAddress *object) +{ + char *object_description; + + /* + * Those don't rely on LockDatabaseObject() when being dropped (see + * AcquireDeletionLock()). Also it looks like they can not produce + * orphaned dependent objects when being dropped. + */ + if (object->classId == RelationRelationId || object->classId == AuthMemRelationId) + return; + + object_description = getObjectDescription(object, true); + + /* assume we should lock the whole object not a sub-object */
Re: Avoid orphaned objects dependencies, take 3
Hi, On Mon, Apr 22, 2024 at 03:00:00PM +0300, Alexander Lakhin wrote: > 22.04.2024 13:52, Bertrand Drouvot wrote: > > > > That's weird, I just launched it several times with the patch applied and > > I'm not > > able to see the seg fault (while I can see it constently failing on the > > master > > branch). > > > > Are you 100% sure you tested it against a binary with the patch applied? > > > > Yes, at least I can't see what I'm doing wrong. Please try my > self-contained script attached. Thanks for sharing your script! Yeah your script ensures the patch is applied before the repro is executed. I do confirm that I can also see the issue with the patch applied (but I had to launch multiple attempts, while on master one attempt is enough). I'll have a look. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Avoid orphaned objects dependencies, take 3
Hi, On Mon, Apr 22, 2024 at 01:00:00PM +0300, Alexander Lakhin wrote: > Hi Bertrand, > > 22.04.2024 11:45, Bertrand Drouvot wrote: > > Hi, > > > > This new thread is a follow-up of [1] and [2]. > > > > Problem description: > > > > We have occasionally observed objects having an orphaned dependency, the > > most common case we have seen is functions not linked to any namespaces. > > > > ... > > > > Looking forward to your feedback, > > This have reminded me of bug #17182 [1]. Thanks for the link to the bug! > Unfortunately, with the patch applied, the following script: > > for ((i=1;i<=100;i++)); do > ( { for ((n=1;n<=20;n++)); do echo "DROP SCHEMA s;"; done } | psql ) > >psql1.log 2>&1 & > echo " > CREATE SCHEMA s; > CREATE FUNCTION s.func1() RETURNS int LANGUAGE SQL AS 'SELECT 1;'; > CREATE FUNCTION s.func2() RETURNS int LANGUAGE SQL AS 'SELECT 2;'; > CREATE FUNCTION s.func3() RETURNS int LANGUAGE SQL AS 'SELECT 3;'; > CREATE FUNCTION s.func4() RETURNS int LANGUAGE SQL AS 'SELECT 4;'; > CREATE FUNCTION s.func5() RETURNS int LANGUAGE SQL AS 'SELECT 5;'; > " | psql >psql2.log 2>&1 & > wait > psql -c "DROP SCHEMA s CASCADE" >psql3.log > done > echo " > SELECT pg_identify_object('pg_proc'::regclass, pp.oid, 0), pp.oid FROM > pg_proc pp > LEFT JOIN pg_namespace pn ON pp.pronamespace = pn.oid WHERE pn.oid IS NULL" > | psql > > still ends with: > server closed the connection unexpectedly > This probably means the server terminated abnormally > before or while processing the request. > > 2024-04-22 09:54:39.171 UTC|||662633dc.152bbc|LOG: server process (PID > 1388378) was terminated by signal 11: Segmentation fault > 2024-04-22 09:54:39.171 UTC|||662633dc.152bbc|DETAIL: Failed process was > running: SELECT pg_identify_object('pg_proc'::regclass, pp.oid, 0), pp.oid > FROM pg_proc pp > LEFT JOIN pg_namespace pn ON pp.pronamespace = pn.oid WHERE pn.oid IS > NULL > Thanks for sharing the script. That's weird, I just launched it several times with the patch applied and I'm not able to see the seg fault (while I can see it constently failing on the master branch). Are you 100% sure you tested it against a binary with the patch applied? Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Disallow changing slot's failover option in transaction block
Hi, On Mon, Apr 22, 2024 at 11:32:14AM +0530, shveta malik wrote: > On Mon, Apr 22, 2024 at 5:57 AM Zhijie Hou (Fujitsu) > wrote: > > Attach the V3 patch which also addressed Shveta[1] and Bertrand[2]'s > > comments. Thanks! > Tested the patch, works well. Same here. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Patch to avoid orphaned dependencies
Hi, On Wed, Mar 23, 2022 at 12:49:06PM -0400, Tom Lane wrote: > Realistically, if we want to prevent this type of problem, then all > creation DDL will have to take a lock on each referenced object that'd > conflict with a lock taken by DROP. This might not be out of reach: > I think we do already take such locks while dropping objects. The > reference-side lock could be taken by the recordDependency mechanism > itself, ensuring that we don't miss anything; and that would also > allow us to not bother taking such a lock on pinned objects, which'd > greatly cut the cost (though not to zero). Thanks for the idea (and sorry for the delay replying to it)! I had a look at it and just created a new thread [1] based on your proposal. [1]: https://www.postgresql.org/message-id/flat/ZiYjn0eVc7pxVY45%40ip-10-97-1-34.eu-west-3.compute.internal Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Avoid orphaned objects dependencies, take 3
Hi, This new thread is a follow-up of [1] and [2]. Problem description: We have occasionally observed objects having an orphaned dependency, the most common case we have seen is functions not linked to any namespaces. Examples to produce such orphaned dependencies: Scenario 1: session 1: begin; drop schema schem; session 2: create a function in the schema schem session 1: commit; With the above, the function created in session 2 would be linked to a non existing schema. Scenario 2: session 1: begin; create a function in the schema schem session 2: drop schema schem; session 1: commit; With the above, the function created in session 1 would be linked to a non existing schema. A patch has been initially proposed to fix this particular (function-to-namespace) dependency (see [1]), but there could be much more scenarios (like the function-to-datatype one highlighted by Gilles in [1] that could lead to a function having an invalid parameter datatype). As Tom said there are dozens more cases that would need to be considered, and a global approach to avoid those race conditions should be considered instead. A first global approach attempt has been proposed in [2] making use of a dirty snapshot when recording the dependency. But this approach appeared to be "scary" and it was still failing to close some race conditions (see [2] for details). Then, Tom proposed another approach in [2] which is that "creation DDL will have to take a lock on each referenced object that'd conflict with a lock taken by DROP". This is what the attached patch is trying to achieve. It does the following: 1) A new lock (that conflicts with a lock taken by DROP) has been put in place when the dependencies are being recorded. Thanks to it, the drop schema in scenario 2 would be locked (resulting in an error should session 1 committs). 2) After locking the object while recording the dependency, the patch checks that the object still exists. Thanks to it, session 2 in scenario 1 would be locked and would report an error once session 1 committs (that would not be the case should session 1 abort the transaction). The patch also adds a few tests for some dependency cases (that would currently produce orphaned objects): - schema and function (as the above scenarios) - function and type - table and type (which is I think problematic enough, as involving a table into the game, to fix this stuff as a whole). [1]: https://www.postgresql.org/message-id/flat/a4f55089-7cbd-fe5d-a9bb-19adc6418...@darold.net#9af5cdaa9e80879beb1def3604c976e8 [2]: https://www.postgresql.org/message-id/flat/8369ff70-0e31-f194-2954-787f4d9e21dd%40amazon.com Please note that I'm not used to with this area of the code so that the patch might not take the option proposed by Tom the "right" way. Adding the patch to the July CF. Looking forward to your feedback, Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com >From 6798e85b0679dfccfa665007b06d9f1a0e39e0c6 Mon Sep 17 00:00:00 2001 From: Bertrand Drouvot Date: Fri, 29 Mar 2024 15:43:26 + Subject: [PATCH v1] Avoid orphaned objects dependencies It's currently possible to create orphaned objects dependencies, for example: Scenario 1: session 1: begin; drop schema schem; session 2: create a function in the schema schem session 1: commit; With the above, the function created in session 2 would be linked to a non existing schema. Scenario 2: session 1: begin; create a function in the schema schem session 2: drop schema schem; session 1: commit; With the above, the function created in session 1 would be linked to a non existing schema. To avoid those scenarios, a new lock (that conflicts with a lock taken by DROP) has been put in place when the dependencies are being recorded. With this in place, the drop schema in scenario 2 would be locked. Also, after locking the object, the patch checks that the object still exists: with this in place session 2 in scenario 1 would be locked and would report an error once session 1 committs (that would not be the case should session 1 abort the transaction). The patch adds a few tests for some dependency cases (that would currently produce orphaned objects): - schema and function (as the above scenarios) - function and type - table and type --- src/backend/catalog/dependency.c | 42 ++ src/backend/catalog/objectaddress.c | 57 +++ src/backend/catalog/pg_depend.c | 6 ++ src/include/catalog/dependency.h | 1 + src/include/catalog/objectaddress.h | 1 + src/test/modules/Makefile | 1 + src/test/modules/meson.build | 1 + .../test_dependencies_locks/.gitignore| 3 + .../modules/test_dependencies_locks/Makefile | 14 + .../expected/test_dependencies_locks.out | 49 .../te
Re: Disallow changing slot's failover option in transaction block
Hi, On Fri, Apr 19, 2024 at 12:39:40AM +, Zhijie Hou (Fujitsu) wrote: > Here is V2 patch which includes the changes for CREATE SUBSCRIPTION as > suggested. Since we don't connect pub to alter slot when (create_slot=false) > anymore, the restriction that disallows failover=true when connect=false is > also removed. Thanks! + specified in the subscription. When creating the slot, ensure the slot + property failover matches the counterpart parameter + of the subscription. The slot could be created before or after the subscription is created, so I think it needs a bit of rewording (as here it sounds like the sub is already created), , something like? "Always ensure the slot property failover matches the counterpart parameter of the subscription and vice versa." Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: promotion related handling in pg_sync_replication_slots()
Hi, On Thu, Apr 18, 2024 at 05:36:05PM +0530, shveta malik wrote: > Please find v8 attached. Changes are: Thanks! A few comments: 1 === @@ -1440,7 +1461,7 @@ ReplSlotSyncWorkerMain(char *startup_data, size_t startup_data_len) * slotsync_worker_onexit() but that will need the connection to be made * global and we want to avoid introducing global for this purpose. */ - before_shmem_exit(slotsync_failure_callback, PointerGetDatum(wrconn)); + before_shmem_exit(slotsync_worker_disconnect, PointerGetDatum(wrconn)); The comment above this change still states "Register the failure callback once we have the connection", I think it has to be reworded a bit now that v8 is making use of slotsync_worker_disconnect(). 2 === +* Register slotsync_worker_onexit() before we register +* ReplicationSlotShmemExit() in BaseInit(), to ensure that during exit of +* slot sync worker, ReplicationSlotShmemExit() is called first, followed +* by slotsync_worker_onexit(). Startup process during promotion waits for Worth to mention in shmem_exit() (where it "while (--before_shmem_exit_index >= 0)" or before the shmem_exit() definition) that ReplSlotSyncWorkerMain() relies on this LIFO behavior? (not sure if there is other "strong" LIFO requirement in other part of the code). 3 === +* Startup process during promotion waits for slot sync to finish and it +* does that by checking the 'syncing' flag. worth to mention ShutDownSlotSync()? 4 === I did a few tests manually (launching ShutDownSlotSync() through gdb / with and without sync worker and with / without pg_sync_replication_slots() running concurrently) and it looks like it works as designed. Having said that, the logic that is in place to take care of the corner cases described up-thread seems reasonable to me. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: promotion related handling in pg_sync_replication_slots()
Hi, On Tue, Apr 16, 2024 at 02:06:45PM +0530, shveta malik wrote: > On Tue, Apr 16, 2024 at 1:55 PM Zhijie Hou (Fujitsu) > wrote: > > I personally feel adding the additional check for sync_replication_slots may > > not improve the situation here. Because the GUC sync_replication_slots can > > change at any point, the GUC could be false when performing this addition > > check > > and is set to true immediately after the check, so It could not simplify > > the logic > > anyway. > > +1. > I feel doc and "cannot synchronize replication slots concurrently" > check should suffice. > > In the scenario which Hou-San pointed out, if after performing the > GUC check in SQL function, this GUC is enabled immediately and say > worker is started sooner than the function could get chance to sync, > in that case as well, SQL function will ultimately get error "cannot > synchronize replication slots concurrently", even though GUC is > enabled. Thus, I feel we should stick with samer error in all > scenarios. Okay, fine by me, let's forget about checking sync_replication_slots then. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Disallow changing slot's failover option in transaction block
Hi, On Tue, Apr 16, 2024 at 06:32:11AM +, Zhijie Hou (Fujitsu) wrote: > Hi, > > Kuroda-San reported an issue off-list that: > > If user execute ALTER SUBSCRIPTION SET (failover) command inside a txn block > and rollback, only the subscription option change can be rolled back, while > the > replication slot's failover change is preserved. Nice catch, thanks! > To fix it, I think we can prevent user from executing ALTER SUBSCRIPTION set > (failover) inside a txn block, which is also consistent to the ALTER > SUBSCRIPTION REFRESH/DROP SUBSCRIPTION command. Attach a small > patch to address this. Agree. The patch looks pretty straightforward to me. Worth to add this case in the doc? (where we already mention that "Commands ALTER SUBSCRIPTION ... REFRESH PUBLICATION and ALTER SUBSCRIPTION ... {SET|ADD|DROP} PUBLICATION ... with refresh option as true cannot be executed inside a transaction block" Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: promotion related handling in pg_sync_replication_slots()
Hi, On Tue, Apr 16, 2024 at 10:00:04AM +0530, shveta malik wrote: > Please find v5 addressing above comments. Thanks! @@ -1634,9 +1677,14 @@ SyncReplicationSlots(WalReceiverConn *wrconn) { PG_ENSURE_ERROR_CLEANUP(slotsync_failure_callback, PointerGetDatum(wrconn)); { + check_flags_and_set_sync_info(InvalidPid); + Given the fact that if the sync worker is running it won't be possible to trigger a manual sync with pg_sync_replication_slots(), what about also checking the "sync_replication_slots" value at the start of SyncReplicationSlots() and emmit an error if sync_replication_slots is set to on? (The message could explicitly states that it's not possible to use the function if sync_replication_slots is set to on). Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: promotion related handling in pg_sync_replication_slots()
Hi, On Tue, Apr 16, 2024 at 08:21:04AM +0530, Amit Kapila wrote: > On Mon, Apr 15, 2024 at 7:47 PM Bertrand Drouvot > wrote: > > > > On Mon, Apr 15, 2024 at 06:29:49PM +0530, Amit Kapila wrote: > > > On Mon, Apr 15, 2024 at 6:21 PM Bertrand Drouvot > > > wrote: > > > > > > > > On Mon, Apr 15, 2024 at 03:38:53PM +0530, shveta malik wrote: > > > > > Now both worker and SQL > > > > > function set 'syncing' when they start and reset it when they exit. > > > > > > > > It means that it's not possible anymore to trigger a manual sync if > > > > sync_replication_slots is on. Indeed that would trigger: > > > > > > > > postgres=# select pg_sync_replication_slots(); > > > > ERROR: cannot synchronize replication slots concurrently > > > > > > > > That looks like an issue to me, thoughts? > > > > > > > > > > This is intentional as of now for the sake of keeping > > > implementation/code simple. It is not difficult to allow them but I am > > > not sure whether we want to add another set of conditions allowing > > > them in parallel. > > > > I think that the ability to launch a manual sync before a switchover would > > be > > missed. Except for this case I don't think that's an issue to prevent them > > to > > run in parallel. > > > > I think if the slotsync worker is available, it can do that as well. Right, but one has no control as to when the sync is triggered. > There is no clear use case for allowing them in parallel and I feel it > would add more confusion when it can work sometimes but not other > times. However, if we receive some report from the field where there > is a real demand for such a thing, it should be easy to achieve. For > example, I can imagine that we can have sync_state that has values > 'started', 'in_progress' , and 'finished'. This should allow us to > achieve what the current proposed patch is doing along with allowing > the API to work in parallel when the sync_state is not 'in_progress'. > > I think for now let's restrict their usage in parallel and make the > promotion behavior consistent both for worker and API. Okay, let's do it that way. Is it worth to add a few words in the doc related to pg_sync_replication_slots() though? (to mention it can not be used if the sync slot worker is running). Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: promotion related handling in pg_sync_replication_slots()
Hi, On Mon, Apr 15, 2024 at 06:29:49PM +0530, Amit Kapila wrote: > On Mon, Apr 15, 2024 at 6:21 PM Bertrand Drouvot > wrote: > > > > On Mon, Apr 15, 2024 at 03:38:53PM +0530, shveta malik wrote: > > > Now both worker and SQL > > > function set 'syncing' when they start and reset it when they exit. > > > > It means that it's not possible anymore to trigger a manual sync if > > sync_replication_slots is on. Indeed that would trigger: > > > > postgres=# select pg_sync_replication_slots(); > > ERROR: cannot synchronize replication slots concurrently > > > > That looks like an issue to me, thoughts? > > > > This is intentional as of now for the sake of keeping > implementation/code simple. It is not difficult to allow them but I am > not sure whether we want to add another set of conditions allowing > them in parallel. I think that the ability to launch a manual sync before a switchover would be missed. Except for this case I don't think that's an issue to prevent them to run in parallel. > And that too in an unpredictable way as the API will > work only for the time slot sync worker is not performing the sync. Yeah but then at least you would know that there is "really" a sync in progress (which is not the case currently with v4, as the sync worker being started is enough to prevent a manual sync even if a sync is not in progress). Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: promotion related handling in pg_sync_replication_slots()
Hi, On Mon, Apr 15, 2024 at 03:38:53PM +0530, shveta malik wrote: > On Mon, Apr 15, 2024 at 2:29 PM Amit Kapila wrote: > > > > On Fri, Apr 12, 2024 at 5:25 PM shveta malik wrote: > > > > > > Please find v3 addressing race-condition and one other comment. > > > > > > Up-thread it was suggested that, probably, checking > > > SlotSyncCtx->syncing should be sufficient in ShutDownSlotSync(). On > > > re-thinking, it might not be. Slot sync worker sets and resets > > > 'syncing' with each sync-cycle, and thus we need to rely on worker's > > > pid in ShutDownSlotSync(), as there could be a window where promotion > > > is triggered and 'syncing' is not set for worker, while the worker is > > > still running. This implementation of setting and resetting syncing > > > with each sync-cycle looks better as compared to setting syncing > > > during the entire life-cycle of the worker. So, I did not change it. > > > > > > > To retain this we need to have different handling for 'syncing' for > > workers and function which seems like more maintenance burden than the > > value it provides. Moreover, in SyncReplicationSlots(), we are calling > > a function after acquiring spinlock which is not our usual coding > > practice. > > Okay. Changed it to consistent handling. Thanks for the patch! > Now both worker and SQL > function set 'syncing' when they start and reset it when they exit. It means that it's not possible anymore to trigger a manual sync if sync_replication_slots is on. Indeed that would trigger: postgres=# select pg_sync_replication_slots(); ERROR: cannot synchronize replication slots concurrently That looks like an issue to me, thoughts? Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Synchronizing slots from primary to standby
Hi, On Sat, Apr 06, 2024 at 10:13:00AM +0530, Amit Kapila wrote: > On Fri, Apr 5, 2024 at 8:05 PM Bertrand Drouvot > wrote: > > I think the new LSN can be visible only when the corresponding WAL is > written by XLogWrite(). I don't know what in XLogSetAsyncXactLSN() can > make it visible. In your experiment below, isn't it possible that in > the meantime WAL writer has written that WAL due to which you are > seeing an updated location? What I did is: session 1: select pg_current_wal_lsn();\watch 1 session 2: select pg_backend_pid(); terminal 1: tail -f logfile | grep -i snap terminal 2 : gdb -p ) at standby.c:1346 1346{ (gdb) n 1350 Then next, next until the DEBUG message is emitted (confirmed in terminal 1). At this stage the DEBUG message shows the new LSN while session 1 still displays the previous LSN. Then once XLogSetAsyncXactLSN() is done in the gdb session (terminal 2) then session 1 displays the new LSN. This is reproducible as desired. With more debugging I can see that when the spinlock is released in XLogSetAsyncXactLSN() then XLogWrite() is doing its job and then session 1 does see the new value (that happens in this order, and as you said that's expected). My point is that while the DEBUG message is emitted session 1 still see the old LSN (until the new LSN is vsible). I think that we should emit the DEBUG message once session 1 can see the new value (If not, I think the timestamp of the DEBUG message can be missleading during debugging purpose). > I think I am missing how exactly moving DEBUG2 can confirm the above theory. I meant to say that instead of seeing: 2024-04-05 04:37:05.074 UTC [3854278][background writer][:0] DEBUG: snapshot of 0+0 running transaction ids (lsn 0/398 oldest xid 740 latest complete 739 next xid 740) 2024-04-05 04:37:05.197 UTC [3866475][client backend][2/4:0] LOG: statement: SELECT '0/360' <= replay_lsn AND state = 'streaming' We would probably see something like: 2024-04-05 04:37:05. UTC [3866475][client backend][2/4:0] LOG: statement: SELECT '0/360' <= replay_lsn AND state = 'streaming' 2024-04-05 04:37:05.+xx UTC [3854278][background writer][:0] DEBUG: snapshot of 0+0 running transaction ids (lsn 0/398 oldest xid 740 latest complete 739 next xid 740) And then it would be clear that the query has ran before the new LSN is visible. > > If the theory is proven then I'm not sure we need the extra complexity of > > injection point here, maybe just relying on the slots created via SQL API > > could > > be enough. > > > > Yeah, that could be the first step. We can probably add an injection > point to control the bgwrite behavior and then add tests involving > walsender performing the decoding. But I think it is important to have > sufficient tests in this area as I see they are quite helpful in > uncovering the issues. > Yeah agree. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Synchronizing slots from primary to standby
Hi, On Fri, Apr 05, 2024 at 02:35:42PM +, Bertrand Drouvot wrote: > I think that maybe as a first step we should move the "elog(DEBUG2," message > as > proposed above to help debugging (that could help to confirm the above > theory). If you agree and think that makes sense, pleae find attached a tiny patch doing so. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com >From a414e81a4c5c88963b2412d1641f3de1262c15c0 Mon Sep 17 00:00:00 2001 From: Bertrand Drouvot Date: Fri, 5 Apr 2024 14:49:51 + Subject: [PATCH v1] Move DEBUG message in LogCurrentRunningXacts() Indeed the new LSN is visible to others session (say through pg_current_wal_lsn()) only after the spinlock (XLogCtl->info_lck) in XLogSetAsyncXactLSN() is released. So moving the DEBUG message after the XLogSetAsyncXactLSN() call seems more appropriate for debugging purpose. --- src/backend/storage/ipc/standby.c | 22 +++--- 1 file changed, 11 insertions(+), 11 deletions(-) 100.0% src/backend/storage/ipc/ diff --git a/src/backend/storage/ipc/standby.c b/src/backend/storage/ipc/standby.c index 87b04e51b3..ba549acf50 100644 --- a/src/backend/storage/ipc/standby.c +++ b/src/backend/storage/ipc/standby.c @@ -1366,6 +1366,17 @@ LogCurrentRunningXacts(RunningTransactions CurrRunningXacts) recptr = XLogInsert(RM_STANDBY_ID, XLOG_RUNNING_XACTS); + /* + * Ensure running_xacts information is synced to disk not too far in the + * future. We don't want to stall anything though (i.e. use XLogFlush()), + * so we let the wal writer do it during normal operation. + * XLogSetAsyncXactLSN() conveniently will mark the LSN as to-be-synced + * and nudge the WALWriter into action if sleeping. Check + * XLogBackgroundFlush() for details why a record might not be flushed + * without it. + */ + XLogSetAsyncXactLSN(recptr); + if (CurrRunningXacts->subxid_overflow) elog(DEBUG2, "snapshot of %d running transactions overflowed (lsn %X/%X oldest xid %u latest complete %u next xid %u)", @@ -1383,17 +1394,6 @@ LogCurrentRunningXacts(RunningTransactions CurrRunningXacts) CurrRunningXacts->latestCompletedXid, CurrRunningXacts->nextXid); - /* - * Ensure running_xacts information is synced to disk not too far in the - * future. We don't want to stall anything though (i.e. use XLogFlush()), - * so we let the wal writer do it during normal operation. - * XLogSetAsyncXactLSN() conveniently will mark the LSN as to-be-synced - * and nudge the WALWriter into action if sleeping. Check - * XLogBackgroundFlush() for details why a record might not be flushed - * without it. - */ - XLogSetAsyncXactLSN(recptr); - return recptr; } -- 2.34.1
Re: Synchronizing slots from primary to standby
Hi, On Fri, Apr 05, 2024 at 06:23:10PM +0530, Amit Kapila wrote: > On Fri, Apr 5, 2024 at 5:17 PM Amit Kapila wrote: > Thinking more on this, it doesn't seem related to > c9920a9068eac2e6c8fb34988d18c0b42b9bf811 as that commit doesn't change > any locking or something like that which impacts write positions. Agree. > I think what has happened here is that running_xact record written by > the background writer [1] is not written to the kernel or disk (see > LogStandbySnapshot()), before pg_current_wal_lsn() checks the > current_lsn to be compared with replayed LSN. Agree, I think it's not visible through pg_current_wal_lsn() yet. Also I think that the DEBUG message in LogCurrentRunningXacts() " elog(DEBUG2, "snapshot of %d+%d running transaction ids (lsn %X/%X oldest xid %u latest complete %u next xid %u)", CurrRunningXacts->xcnt, CurrRunningXacts->subxcnt, LSN_FORMAT_ARGS(recptr), CurrRunningXacts->oldestRunningXid, CurrRunningXacts->latestCompletedXid, CurrRunningXacts->nextXid); " should be located after the XLogSetAsyncXactLSN() call. Indeed, the new LSN is visible after the spinlock (XLogCtl->info_lck) in XLogSetAsyncXactLSN() is released, see: \watch on Session 1 provides: pg_current_wal_lsn 0/87D110 (1 row) Until: Breakpoint 2, XLogSetAsyncXactLSN (asyncXactLSN=8900936) at xlog.c:2579 2579XLogRecPtr WriteRqstPtr = asyncXactLSN; (gdb) n 2581boolwakeup = false; (gdb) 2584SpinLockAcquire(>info_lck); (gdb) 2585RefreshXLogWriteResult(LogwrtResult); (gdb) 2586sleeping = XLogCtl->WalWriterSleeping; (gdb) 2587prevAsyncXactLSN = XLogCtl->asyncXactLSN; (gdb) 2588if (XLogCtl->asyncXactLSN < asyncXactLSN) (gdb) 2589XLogCtl->asyncXactLSN = asyncXactLSN; (gdb) 2590SpinLockRelease(>info_lck); (gdb) p p/x (uint32) XLogCtl->asyncXactLSN $1 = 0x87d148 Then session 1 provides: pg_current_wal_lsn 0/87D148 (1 row) So, when we see in the log: 2024-04-05 04:37:05.074 UTC [3854278][background writer][:0] DEBUG: snapshot of 0+0 running transaction ids (lsn 0/398 oldest xid 740 latest complete 739 next xid 740) 2024-04-05 04:37:05.197 UTC [3866475][client backend][2/4:0] LOG: statement: SELECT '0/360' <= replay_lsn AND state = 'streaming' It's indeed possible that the new LSN was not visible yet (spinlock not released?) before the query began (because we can not rely on the time the DEBUG message has been emitted). > Note that the reason why > walsender has picked the running_xact written by background writer is > because it has checked after pg_current_wal_lsn() query, see LOGs [2]. > I think we can probably try to reproduce manually via debugger. > > If this theory is correct It think it is. > then I think we will need to use injection > points to control the behavior of bgwriter or use the slots created > via SQL API for syncing in tests. > > Thoughts? I think that maybe as a first step we should move the "elog(DEBUG2," message as proposed above to help debugging (that could help to confirm the above theory). If the theory is proven then I'm not sure we need the extra complexity of injection point here, maybe just relying on the slots created via SQL API could be enough. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Synchronizing slots from primary to standby
Hi, On Fri, Apr 05, 2024 at 04:09:01PM +0530, shveta malik wrote: > On Fri, Apr 5, 2024 at 10:09 AM Bertrand Drouvot > wrote: > > > > What about something like? > > > > ereport(LOG, > > errmsg("synchronized confirmed_flush_lsn for slot \"%s\" differs > > from remote slot", > > remote_slot->name), > > errdetail("Remote slot has LSN %X/%X but local slot has LSN %X/%X.", > > LSN_FORMAT_ARGS(remote_slot->restart_lsn), > > LSN_FORMAT_ARGS(slot->data.restart_lsn)); > > > > Regards, > > +1. Better than earlier. I will update and post the patch. > Thanks! BTW, I just realized that the LSN I used in my example in the LSN_FORMAT_ARGS() are not the right ones. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Introduce XID age and inactive timeout based replication slot invalidation
Hi, On Fri, Apr 05, 2024 at 11:21:43AM +0530, Bharath Rupireddy wrote: > On Wed, Apr 3, 2024 at 9:57 PM Bertrand Drouvot > wrote: > Please find the attached v36 patch. Thanks! A few comments: 1 === + +The timeout is measured from the time since the slot has become +inactive (known from its +inactive_since value) until it gets +used (i.e., its active is set to true). + That's right except when it's invalidated during the checkpoint (as the slot is not acquired in CheckPointReplicationSlots()). So, what about adding: "or a checkpoint occurs"? That would also explain that the invalidation could occur during checkpoint. 2 === + /* If the slot has been invalidated, recalculate the resource limits */ + if (invalidated) + { /If the slot/If a slot/? 3 === + * NB - this function also runs as part of checkpoint, so avoid raising errors s/NB - this/NB: This function/? (that looks more consistent with other comments in the code) 4 === + * Note that having a new function for RS_INVAL_INACTIVE_TIMEOUT cause instead I understand it's "the RS_INVAL_INACTIVE_TIMEOUT cause" but reading "cause instead" looks weird to me. Maybe it would make sense to reword this a bit. 5 === +* considered not active as they don't actually perform logical decoding. Not sure that's 100% accurate as we switched in fast forward logical in 2ec005b4e2. "as they perform only fast forward logical decoding (or not at all)", maybe? 6 === + if (RecoveryInProgress() && slot->data.synced) + return false; + + if (replication_slot_inactive_timeout == 0) + return false; What about just using one if? It's more a matter of taste but it also probably reduces the object file size a bit for non optimized build. 7 === + /* +* Do not invalidate the slots which are currently being synced from +* the primary to the standby. +*/ + if (RecoveryInProgress() && slot->data.synced) + return false; I think we don't need this check as the exact same one is done just before. 8 === +sub check_for_slot_invalidation_in_server_log +{ + my ($node, $slot_name, $offset) = @_; + my $invalidated = 0; + + for (my $i = 0; $i < 10 * $PostgreSQL::Test::Utils::timeout_default; $i++) + { + $node->safe_psql('postgres', "CHECKPOINT"); Wouldn't be better to wait for the replication_slot_inactive_timeout time before instead of triggering all those checkpoints? (it could be passed as an extra arg to wait_for_slot_invalidation()). 9 === # Synced slot mustn't get invalidated on the standby, it must sync invalidation # from the primary. So, we must not see the slot's invalidation message in server # log. ok( !$standby1->log_contains( "invalidating obsolete replication slot \"lsub1_sync_slot\"", $standby1_logstart), 'check that syned slot has not been invalidated on the standby'); Would that make sense to trigger a checkpoint on the standby before this test? I mean I think that without a checkpoint on the standby we should not see the invalidation in the log anyway. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Synchronizing slots from primary to standby
Hi, On Fri, Apr 05, 2024 at 09:43:35AM +0530, shveta malik wrote: > On Fri, Apr 5, 2024 at 9:22 AM Bertrand Drouvot > wrote: > > > > Hi, > > > > On Thu, Apr 04, 2024 at 05:31:45PM +0530, shveta malik wrote: > > > On Thu, Apr 4, 2024 at 2:59 PM shveta malik > > > wrote: > > 2 === > > > > + if (slot->data.confirmed_flush != > > remote_slot->confirmed_lsn) > > + elog(LOG, > > +"could not synchronize local slot > > \"%s\" LSN(%X/%X)" > > +" to remote slot's LSN(%X/%X) ", > > +remote_slot->name, > > + > > LSN_FORMAT_ARGS(slot->data.confirmed_flush), > > + > > LSN_FORMAT_ARGS(remote_slot->confirmed_lsn)); > > > > I don't think that the message is correct here. Unless I am missing > > something > > there is nothing in the following code path that would prevent the slot to > > be > > sync during this cycle. > > This is a sanity check, I will put a comment to indicate the same. Thanks! > We > want to ensure if anything changes in future, we get correct logs to > indicate that. Right, understood that way. > If required, the LOG msg can be changed. Kindly suggest if you have > anything better in mind. > What about something like? ereport(LOG, errmsg("synchronized confirmed_flush_lsn for slot \"%s\" differs from remote slot", remote_slot->name), errdetail("Remote slot has LSN %X/%X but local slot has LSN %X/%X.", LSN_FORMAT_ARGS(remote_slot->restart_lsn), LSN_FORMAT_ARGS(slot->data.restart_lsn)); Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Synchronizing slots from primary to standby
Hi, On Thu, Apr 04, 2024 at 05:31:45PM +0530, shveta malik wrote: > On Thu, Apr 4, 2024 at 2:59 PM shveta malik wrote: > > > > > > Prior to commit 2ec005b, this check was okay, as we did not expect > > restart_lsn of the synced slot to be ahead of remote since we were > > directly copying the lsns. But now when we use 'advance' to do logical > > decoding on standby, there is a possibility that restart lsn of the > > synced slot is ahead of remote slot, if there are running txns records > > found after reaching consistent-point while consuming WALs from > > restart_lsn till confirmed_lsn. In such a case, slot-sync's advance > > may end up serializing snapshots and setting restart_lsn to the > > serialized snapshot point, ahead of remote one. > > > > Fix: > > The sanity check needs to be corrected. Attached a patch to address the > > issue. > Thanks for reporting, explaining the issue and providing a patch. Regarding the patch: 1 === +* Attempt to sync lsns and xmins only if remote slot is ahead of local s/lsns/LSNs/? 2 === + if (slot->data.confirmed_flush != remote_slot->confirmed_lsn) + elog(LOG, +"could not synchronize local slot \"%s\" LSN(%X/%X)" +" to remote slot's LSN(%X/%X) ", +remote_slot->name, + LSN_FORMAT_ARGS(slot->data.confirmed_flush), + LSN_FORMAT_ARGS(remote_slot->confirmed_lsn)); I don't think that the message is correct here. Unless I am missing something there is nothing in the following code path that would prevent the slot to be sync during this cycle. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Autogenerate some wait events code and documentation
Hi, On Thu, Apr 04, 2024 at 07:14:47PM +0900, Michael Paquier wrote: > On Thu, Apr 04, 2024 at 09:28:36AM +0000, Bertrand Drouvot wrote: > > +# No "Backpatch" region here as code is generated automatically. > > > > What about "region here as has its own C code" (that would be more > > consistent > > with the comment in the "header" for the file). Done that way in v4. > > I'd add a "as -this section- has its own C code", for clarity. This > just looked a bit strange here. Sounds good, done that way in v5 attached. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com >From 1e24cdb02360147363495122fbfe79c2758ae4e8 Mon Sep 17 00:00:00 2001 From: Michael Paquier Date: Thu, 4 Apr 2024 15:34:37 +0900 Subject: [PATCH v5] Add "ABI_compatibility" regions in wait_event_names.txt When backpatching, adding an event will renumber others, which can make an extension report the wrong event until recompiled. This is due to the fact that generate-wait_event_types.pl sorts events to position them. The "ABI_compatibility" region added here ensures no ordering is done for the wait events found after this delimiter. --- .../activity/generate-wait_event_types.pl | 27 ++- .../utils/activity/wait_event_names.txt | 17 2 files changed, 43 insertions(+), 1 deletion(-) 100.0% src/backend/utils/activity/ diff --git a/src/backend/utils/activity/generate-wait_event_types.pl b/src/backend/utils/activity/generate-wait_event_types.pl index f1adf0e8e7..9768406db5 100644 --- a/src/backend/utils/activity/generate-wait_event_types.pl +++ b/src/backend/utils/activity/generate-wait_event_types.pl @@ -38,7 +38,9 @@ die "Not possible to specify --docs and --code simultaneously" open my $wait_event_names, '<', $ARGV[0] or die; +my @abi_compatibility_lines; my @lines; +my $abi_compatibility = 0; my $section_name; my $note; my $note_name; @@ -59,10 +61,27 @@ while (<$wait_event_names>) { $section_name = $_; $section_name =~ s/^.*- //; + $abi_compatibility = 0; next; } - push(@lines, $section_name . "\t" . $_); + # ABI_compatibility region, preserving ABI compatibility of the code + # generated. Any wait events listed in this part of the file + # will not be sorted during the code generation. + if (/^ABI_compatibility:$/) + { + $abi_compatibility = 1; + next; + } + + if ($gen_code && $abi_compatibility) + { + push(@abi_compatibility_lines, $section_name . "\t" . $_); + } + else + { + push(@lines, $section_name . "\t" . $_); + } } # Sort the lines based on the second column. @@ -70,6 +89,12 @@ while (<$wait_event_names>) my @lines_sorted = sort { uc((split(/\t/, $a))[1]) cmp uc((split(/\t/, $b))[1]) } @lines; +# If we are generating code, concat @lines_sorted and then @abi_compatibility_lines. +if ($gen_code) +{ + push(@lines_sorted, @abi_compatibility_lines); +} + # Read the sorted lines and populate the hash table foreach my $line (@lines_sorted) { diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt index 0d288d6b3d..5169be238c 100644 --- a/src/backend/utils/activity/wait_event_names.txt +++ b/src/backend/utils/activity/wait_event_names.txt @@ -26,6 +26,12 @@ # When adding a new wait event, make sure it is placed in the appropriate # ClassName section. # +# Wait events added in stable branches should be appended to the lists in +# the "ABI_compatibility:" region of their related ClassName section to preserve +# ABI compatibility of the C code generated from this file's data, respecting +# the order of any wait event already listed there. The "ABI_compatibility:" +# regions should remain empty on the master branch and on unreleased branches. +# # WaitEventLWLock and WaitEventLock have their own C code for their wait event # enums and function names. Hence, for these, only the SGML documentation is # generated. @@ -61,6 +67,7 @@ WAL_SENDER_MAIN "Waiting in main loop of WAL sender process." WAL_SUMMARIZER_WAL "Waiting in WAL summarizer for more WAL to be generated." WAL_WRITER_MAIN "Waiting in main loop of WAL writer process." +ABI_compatibility: # # Wait Events - Client @@ -83,6 +90,7 @@ WAIT_FOR_WAL_REPLAY "Waiting for a replay of the particular WAL position on the WAL_SENDER_WAIT_FOR_WAL "Waiting for WAL to be flushed in WAL sender process." WAL_SENDER_WRITE_DATA "Waiting for any activity when processing replies from WAL receiver in WAL sender process." +ABI_compatibility: # # Wait Events - IPC @@ -150,6 +158,7 @@ WAL_RECEIVER_WAIT_START "Waiting for startup process to send initial data for st WAL_SUMMARY_READY "Waiting for a new WAL summary to b