Hi,

On Tue, Oct 14, 2025 at 04:06:44PM -0700, Masahiko Sawada wrote:
> On Sun, Oct 12, 2025 at 11:27 PM Bertrand Drouvot
> <[email protected]> wrote:
> >
> > Hi,
> >
> > On Thu, Oct 09, 2025 at 10:49:39AM +0800, suyu.cmj wrote:
> > > Hi,
> > > Thank you for the reference to commit 818fefd8fd4 and the related 
> > > discussion thread. I understand the intent of introducing 
> > > initial_restart_lsn was to preserve a consistent invalidation cause 
> > > throughout the invalidation loop.
> > > However, I still have a few concerns about this design change:
> > > 1. I understand the intention to keep the invalidation cause consistent, 
> > > but If a slot's restart_lsn advances significantly during the 
> > > invalidation check—indicating it is actively in use—shouldn't we 
> > > reconsider invalidating it?
> > > 2. What potential issues arise if we refrain from invalidating slots 
> > > whose restart_lsn advances during the invalidation process? Intuitively, 
> > > an actively used slot that has moved it's restart_lsn beyond the 
> > > problematic point should not be marked invalid.
> > > 3. If the current approach is indeed correct, should we consider making 
> > > PG15 and earlier consistent with this behavior? The behavioral difference 
> > > across versions may lead to different operational outcomes in otherwise 
> > > similar situations.
> > > I would appreciate your insights on these points.
> >
> > I agree that before 818fefd8fd4 the invalidation cause could move from
> > RS_INVAL_WAL_REMOVED to RS_INVAL_NONE if the slot restart lsn has been able 
> > to
> > advance enough between the time we release the mutex and do the next check.
> >
> > With 818fefd8fd4 that's not the case anymore and we keep WAL_REMOVED as the
> > invalidation cause (even if  the slot restart lsn has been able to advance
> > enough).
> >
> > That looks safe to use the pre 818fefd8fd4 behavior for the slot restart lsn
> > case because the WAL files have not yet been removed by the 
> > checkpointer/startup
> > process when it's busy in InvalidatePossiblyObsoleteSlot().
> >
> > I think that we could get rid of the initial_restart_lsn and just use
> > s->data.restart_lsn here (while keeping initial xmin ones to preserve the
> > intent of 818fefd8fd4 for those).
> 
> IIUC

Thanks for looking at it!

> with the proposed patch, it's possible that we report the slot
> invalidation once but don't actually invalidate the slot if slot's
> restart_lsn gets advanced and becomes greater than the oldestLSN after
> the report, is that right?

We don't really report an "invalidation", what we report is:

LOG:  terminating process 3998707 to release replication slot "logical_slot"
DETAIL:  The slot's restart_lsn 0/00842480 exceeds the limit by 2874240 bytes.
HINT:  You might need to increase "max_slot_wal_keep_size".

and we terminate the process:

FATAL:  terminating connection due to administrator command

We are not reporting:

DETAIL:  This replication slot has been invalidated due to "wal_removed".

and the slot is still valid.

That's the pre 818fefd8fd4 behavior.

Ideally, I think that we should not report anything and not terminate the
process. I did not look at it, maybe we could look at it as a second step (first
step being to restore the pre 818fefd8fd4 behavior)?

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com


Reply via email to