On Thursday, February 15, 2024 10:49 AM Amit Kapila <amit.kapil...@gmail.com> wrote: > > On Wed, Feb 14, 2024 at 7:26 PM Bertrand Drouvot > <bertranddrouvot...@gmail.com> wrote: > > > > On Wed, Feb 14, 2024 at 10:40:11AM +0000, Zhijie Hou (Fujitsu) wrote: > > > On Wednesday, February 14, 2024 6:05 PM Amit Kapila > <amit.kapil...@gmail.com> wrote: > > > > > > > > To ensure that restart_lsn has been moved to a recent position, we > > > > need to log XLOG_RUNNING_XACTS and make sure the same is processed > > > > as well by walsender. The attached patch does the required change. > > > > > > > > Hou-San can reproduce this problem by adding additional > > > > checkpoints in the test and after applying the attached it fixes > > > > the problem. Now, this patch is mostly based on the theory we > > > > formed based on LOGs on BF and a reproducer by Hou-San, so still, > > > > there is some chance that this doesn't fix the BF failures in which > > > > case I'll > again look into those. > > > > > > I have verified that the patch can fix the issue on my machine(after > > > adding few more checkpoints before slot invalidation test.) I also > > > added one more check in the test to confirm the synced slot is not temp > > > slot. > Here is the v2 patch. > > > > Thanks! > > > > +# To ensure that restart_lsn has moved to a recent WAL position, we > > +need # to log XLOG_RUNNING_XACTS and make sure the same is processed > > +as well $primary->psql('postgres', "CHECKPOINT"); > > > > Instead of "CHECKPOINT" wouldn't a less heavy "SELECT > pg_log_standby_snapshot();" > > be enough? > > > > Yeah, that would be enough. However, the test still fails randomly due to the > same reason. See [1]. So, as mentioned yesterday, now, I feel it is better to > recreate the subscription/slot so that it can get the latest restart_lsn > rather than > relying on pg_log_standby_snapshot() to move it. > > > Not a big deal but maybe we could do the change while modifying > > 040_standby_failover_slots_sync.pl in the next patch "Add a new slotsync > worker". > > > > Right, we can do that or probably this test would have made more sense with a > worker patch where we could wait for the slot to be synced. > Anyway, let's try to recreate the slot/subscription idea. BTW, do you think > that > adding a LOG when we are not able to sync will help in debugging such > problems? I think eventually we can change it to DEBUG1 but for now, it can > help > with stabilizing BF and or some other reported issues.
Here is the patch that attempts the re-create sub idea. I also think that a LOG/DEBUG would be useful for such analysis, so the 0002 is to add such a log. Best Regards, Hou zj
0002-Add-a-log-if-remote-slot-didn-t-catch-up-to-locally-.patch
Description: 0002-Add-a-log-if-remote-slot-didn-t-catch-up-to-locally-.patch
0001-fix-BF-error-take-2.patch
Description: 0001-fix-BF-error-take-2.patch