On Thursday, February 15, 2024 10:49 AM Amit Kapila <amit.kapil...@gmail.com> 
wrote:
> 
> On Wed, Feb 14, 2024 at 7:26 PM Bertrand Drouvot
> <bertranddrouvot...@gmail.com> wrote:
> >
> > On Wed, Feb 14, 2024 at 10:40:11AM +0000, Zhijie Hou (Fujitsu) wrote:
> > > On Wednesday, February 14, 2024 6:05 PM Amit Kapila
> <amit.kapil...@gmail.com> wrote:
> > > >
> > > > To ensure that restart_lsn has been moved to a recent position, we
> > > > need to log XLOG_RUNNING_XACTS and make sure the same is processed
> > > > as well by walsender. The attached patch does the required change.
> > > >
> > > > Hou-San can reproduce this problem by adding additional
> > > > checkpoints in the test and after applying the attached it fixes
> > > > the problem. Now, this patch is mostly based on the theory we
> > > > formed based on LOGs on BF and a reproducer by Hou-San, so still,
> > > > there is some chance that this doesn't fix the BF failures in which 
> > > > case I'll
> again look into those.
> > >
> > > I have verified that the patch can fix the issue on my machine(after
> > > adding few more checkpoints before slot invalidation test.) I also
> > > added one more check in the test to confirm the synced slot is not temp 
> > > slot.
> Here is the v2 patch.
> >
> > Thanks!
> >
> > +# To ensure that restart_lsn has moved to a recent WAL position, we
> > +need # to log XLOG_RUNNING_XACTS and make sure the same is processed
> > +as well $primary->psql('postgres', "CHECKPOINT");
> >
> > Instead of "CHECKPOINT" wouldn't a less heavy "SELECT
> pg_log_standby_snapshot();"
> > be enough?
> >
> 
> Yeah, that would be enough. However, the test still fails randomly due to the
> same reason. See [1]. So, as mentioned yesterday, now, I feel it is better to
> recreate the subscription/slot so that it can get the latest restart_lsn 
> rather than
> relying on pg_log_standby_snapshot() to move it.
> 
> > Not a big deal but maybe we could do the change while modifying
> > 040_standby_failover_slots_sync.pl in the next patch "Add a new slotsync
> worker".
> >
> 
> Right, we can do that or probably this test would have made more sense with a
> worker patch where we could wait for the slot to be synced.
> Anyway, let's try to recreate the slot/subscription idea. BTW, do you think 
> that
> adding a LOG when we are not able to sync will help in debugging such
> problems? I think eventually we can change it to DEBUG1 but for now, it can 
> help
> with stabilizing BF and or some other reported issues.

Here is the patch that attempts the re-create sub idea. I also think that a 
LOG/DEBUG
would be useful for such analysis, so the 0002 is to add such a log.

Best Regards,
Hou zj

Attachment: 0002-Add-a-log-if-remote-slot-didn-t-catch-up-to-locally-.patch
Description: 0002-Add-a-log-if-remote-slot-didn-t-catch-up-to-locally-.patch

Attachment: 0001-fix-BF-error-take-2.patch
Description: 0001-fix-BF-error-take-2.patch

Reply via email to