On Thu, Jan 11, 2024 at 4:35 PM vignesh C <vignes...@gmail.com> wrote: > > On further analysis, it was found that in the failing test, > CHECKPOINT_SHUTDOWN was started in a new page, so there was the WAL > page header present just before the CHECKPOINT_SHUTDOWN which was > causing the failure. We could alternatively reproduce the issue by > switching the WAL file before restarting the server like in the > attached test change patch. > There are a couple of ways to fix this issue a) one by switching the > WAL before the insertion of records so that the CHECKPOINT_SHUTDOWN > does not get inserted in a new page as in the attached test_fix.patch > b) by using pg_walinspect to check that the next WAL record is > CHECKPOINT_SHUTDOWN. I have to try this approach. > > Thanks to Bharath and Kuroda-san for offline discussions and helping > in getting to the root cause.
IIUC, the problem the commit e0b2eed tries to solve is to ensure there are no left-over decodable WAL records between confirmed_flush LSN and a shutdown checkpoint, which is what it is expected from the t/038_save_logical_slots_shutdown.pl. How about we have a PG function returning true if there are any decodable WAL records between the given start_lsn and end_lsn? Usage of this new function will make the tests more concrete and stable. This function doesn't have to be something really new, we can just turn binary_upgrade_logical_slot_has_caught_up to a general, non-binary PG function; this idea has come up before https://www.postgresql.org/message-id/CAA4eK1KZXaBgVOAdV8ZfG6AdDbKYFVz7teDa7GORgQ3RVYS93g%40mail.gmail.com. If okay, I can offer to write a patch. -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com