Hi, On Thu, Mar 28, 2024 at 04:38:19AM +0000, Zhijie Hou (Fujitsu) wrote: > Hi, > > When analyzing one BF error[1], we find an issue of slotsync: Since we don't > perform logical decoding for the synced slots when syncing the lsn/xmin of > slot, no logical snapshots will be serialized to disk. So, when user starts to > use these synced slots after promotion, it needs to re-build the consistent > snapshot from the restart_lsn if the WAL(xl_running_xacts) at restart_lsn > position indicates that there are running transactions. This however could > cause the data that before the consistent point to be missed[2].
I see, nice catch and explanation, thanks! > This issue doesn't exist on the primary because the snapshot at restart_lsn > should have been serialized to disk (SnapBuildProcessRunningXacts -> > SnapBuildSerialize), so even if the logical decoding restarts, it can find > consistent snapshot immediately at restart_lsn. Right. > To fix this, we could use the fast forward logical decoding to advance the > synced > slot's lsn/xmin when syncing these values instead of directly updating the > slot's info. This way, the snapshot will be serialized to disk when decoding. > If we could not reach to the consistent point at the remote restart_lsn, the > slot is marked as temp and will be persisted once it reaches the consistent > point. I am still analyzing the fix and will share once ready. Thanks! I'm wondering about the performance impact (even in fast_forward mode), might be worth to keep an eye on it. Should we create a 17 open item [1]? [1]: https://wiki.postgresql.org/wiki/PostgreSQL_17_Open_Items Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com