Dear Vitaly,

> > Per my understanding, this happened because there is a lag that restart_lsn 
> > of
> > the slot is set, and it is protected by the system. Your idea is to ensure 
> > the
> > restart_lsn is protected by the system before obtaining on-memory LSN, 
> > right?
> 
> Not sure what you mean by on-memory LSN, but, the issue happens because we
> have
> a lag between restart_lsn assignment and update of
> XLogCtl->replicationSlotsMinLSN which is used to protect the WAL.

Sorry I should say "before obtaining replicationSlotMinLSN".

> Yes, I
> propose
> to ensure that the protection happens when we assign restart_lsn. It seems to 
> be
> wrong that we invalidate slots by its restart_lsn but protect the wal for
> slots using XLogCtl->replicationSlotsMinLSN.

Seems valid. There is another corner case that another restart_lsn can be set 
in-between,
but they have larger LSN than RedoRecPtr, right?

> Below I tried to write some summary and propose the patch which fixes the
> problem.

Sorry but it is too long to understand properly for me :-(.

> There is one subtle thing. Once, the operation of restart_lsn assignment is 
> not
> an atomic, the following scenario may happen theoretically:
> 1. Read GetRedoRecPtr() in the backend (ReplicationSlotReserveWal)
> 2. Assign a new redo LSN in the checkpointer
> 3. Call ReplicationSlotsComputeRequiredLSN() in the checkpointer
> 3. Assign the old redo LSN to restart_lsn
> 
> In this scenario, the restart_lsn will point to a previous redo LSN and it 
> will
> be not protected by the new redo LSN. This scenario is unlikely, but it can
> happen theoretically. I have no ideas how to deal with it, except of assigning
> restart_lsn under XLogCtl->info_lck lock to avoid concurrent modification of
> XLogCtl->RecoRecPtr until it is assigned to restart_lsn of a creating slot.

Oh, your point is there is another race condition, right? Do you have the 
reproducer for it?

Best regards,
Hayato Kuroda
FUJITSU LIMITED

Reply via email to