On Wed, Feb 14, 2024 at 7:26 PM Bertrand Drouvot <bertranddrouvot...@gmail.com> wrote: > > On Wed, Feb 14, 2024 at 10:40:11AM +0000, Zhijie Hou (Fujitsu) wrote: > > On Wednesday, February 14, 2024 6:05 PM Amit Kapila > > <amit.kapil...@gmail.com> wrote: > > > > > > To ensure that restart_lsn has been moved to a recent position, we need > > > to log > > > XLOG_RUNNING_XACTS and make sure the same is processed as well by > > > walsender. The attached patch does the required change. > > > > > > Hou-San can reproduce this problem by adding additional checkpoints in the > > > test and after applying the attached it fixes the problem. Now, this > > > patch is > > > mostly based on the theory we formed based on LOGs on BF and a reproducer > > > by Hou-San, so still, there is some chance that this doesn't fix the BF > > > failures in > > > which case I'll again look into those. > > > > I have verified that the patch can fix the issue on my machine(after adding > > few > > more checkpoints before slot invalidation test.) I also added one more > > check in > > the test to confirm the synced slot is not temp slot. Here is the v2 patch. > > Thanks! > > +# To ensure that restart_lsn has moved to a recent WAL position, we need > +# to log XLOG_RUNNING_XACTS and make sure the same is processed as well > +$primary->psql('postgres', "CHECKPOINT"); > > Instead of "CHECKPOINT" wouldn't a less heavy "SELECT > pg_log_standby_snapshot();" > be enough? >
Yeah, that would be enough. However, the test still fails randomly due to the same reason. See [1]. So, as mentioned yesterday, now, I feel it is better to recreate the subscription/slot so that it can get the latest restart_lsn rather than relying on pg_log_standby_snapshot() to move it. > Not a big deal but maybe we could do the change while modifying > 040_standby_failover_slots_sync.pl in the next patch "Add a new slotsync > worker". > Right, we can do that or probably this test would have made more sense with a worker patch where we could wait for the slot to be synced. Anyway, let's try to recreate the slot/subscription idea. BTW, do you think that adding a LOG when we are not able to sync will help in debugging such problems? I think eventually we can change it to DEBUG1 but for now, it can help with stabilizing BF and or some other reported issues. [1] - https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2024-02-15%2000%3A14%3A38 -- With Regards, Amit Kapila.