On Tue, Oct 14, 2025 at 9:51 PM Bertrand Drouvot <[email protected]> wrote: > > Hi, > > On Tue, Oct 14, 2025 at 04:06:44PM -0700, Masahiko Sawada wrote: > > On Sun, Oct 12, 2025 at 11:27 PM Bertrand Drouvot > > <[email protected]> wrote: > > > > > > Hi, > > > > > > On Thu, Oct 09, 2025 at 10:49:39AM +0800, suyu.cmj wrote: > > > > Hi, > > > > Thank you for the reference to commit 818fefd8fd4 and the related > > > > discussion thread. I understand the intent of introducing > > > > initial_restart_lsn was to preserve a consistent invalidation cause > > > > throughout the invalidation loop. > > > > However, I still have a few concerns about this design change: > > > > 1. I understand the intention to keep the invalidation cause > > > > consistent, but If a slot's restart_lsn advances significantly during > > > > the invalidation check—indicating it is actively in use—shouldn't we > > > > reconsider invalidating it? > > > > 2. What potential issues arise if we refrain from invalidating slots > > > > whose restart_lsn advances during the invalidation process? > > > > Intuitively, an actively used slot that has moved it's restart_lsn > > > > beyond the problematic point should not be marked invalid. > > > > 3. If the current approach is indeed correct, should we consider making > > > > PG15 and earlier consistent with this behavior? The behavioral > > > > difference across versions may lead to different operational outcomes > > > > in otherwise similar situations. > > > > I would appreciate your insights on these points. > > > > > > I agree that before 818fefd8fd4 the invalidation cause could move from > > > RS_INVAL_WAL_REMOVED to RS_INVAL_NONE if the slot restart lsn has been > > > able to > > > advance enough between the time we release the mutex and do the next > > > check. > > > > > > With 818fefd8fd4 that's not the case anymore and we keep WAL_REMOVED as > > > the > > > invalidation cause (even if the slot restart lsn has been able to advance > > > enough). > > > > > > That looks safe to use the pre 818fefd8fd4 behavior for the slot restart > > > lsn > > > case because the WAL files have not yet been removed by the > > > checkpointer/startup > > > process when it's busy in InvalidatePossiblyObsoleteSlot(). > > > > > > I think that we could get rid of the initial_restart_lsn and just use > > > s->data.restart_lsn here (while keeping initial xmin ones to preserve the > > > intent of 818fefd8fd4 for those). > > > > IIUC > > Thanks for looking at it! > > > with the proposed patch, it's possible that we report the slot > > invalidation once but don't actually invalidate the slot if slot's > > restart_lsn gets advanced and becomes greater than the oldestLSN after > > the report, is that right? > > We don't really report an "invalidation", what we report is: > > LOG: terminating process 3998707 to release replication slot "logical_slot" > DETAIL: The slot's restart_lsn 0/00842480 exceeds the limit by 2874240 bytes. > HINT: You might need to increase "max_slot_wal_keep_size". > > and we terminate the process: > > FATAL: terminating connection due to administrator command > > We are not reporting: > > DETAIL: This replication slot has been invalidated due to "wal_removed". > > and the slot is still valid. > > That's the pre 818fefd8fd4 behavior.
Thank you for the clarification! Understood. > Ideally, I think that we should not report anything and not terminate the > process. I did not look at it, maybe we could look at it as a second step > (first > step being to restore the pre 818fefd8fd4 behavior)? I find that reporting of terminating a process having an possibly obsolete slot is fine, but reading some related threads[1][2] it seems to me that a problem we want to avoid is that we report "terminated" without leading to an "obsolete" message. Does it make sense to report explicitly that the slot's restart_lsn gets recovered and we therefore skipped to invalidate it? Regards, [1] https://www.postgresql.org/message-id/flat/ZaTjW2Xh%2BTQUCOH0%40ip-10-97-1-34.eu-west-3.compute.internal [2] https://www.postgresql.org/message-id/[email protected] -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
