On Fri, Jun 12, 2026 at 11:07 AM Matthias van de Meent < [email protected]> wrote:
> On Thu, 11 Jun 2026 at 19:25, Salma El-Sayed <[email protected]> > wrote: > ... > > VACUUM will be taught to ignore the contents of BTP_MERGED_AWAY pages. > > The entries inside are not live data, they are ghost copies cached for > > exactly one case: > > a backward scan that was positioned between R and L at the moment of the > merge. > > No other reader ever sees them. Forward scans see BTP_MERGED_AWAY and > > skip via the right link. > > New backward scans already read R (which now holds L's data) before > > arriving at L, so they skip L too. > > TID safety is guaranteed by MergeXID. The only scanner that can reach > > L's "ghost copies" is one whose snapshot predates MergeXID. MVCC > > guarantees that the heap rows those TIDs point to cannot be recycled > > while any transaction predating MergeXID is still active, because > > those rows are still visible to that transaction. > > I don't think that's accurate. The index is not guaranteed to just > contain live or recently-dead rows: it is almost certainly going to > contain references to rows which are already dead to every session. > This is what VACUUM removes when it goes through index cleanup with > its calls into index_bulk_delete. > > It is important to also note that it's not guaranteed that all current > dead rows are being cleaned up by a current index_bulk_delete() call - > if insufficient maintenance_work_mem was allocated, then only a subset > of currently dead rows will be included in the current > index_bulk_delete() call, and it's likely that cleanup will happen > again very soon. > > So, if you merge a page that contains references to dead rows (which > you can't necessarily know when merging), the BTP_MERGED_AWAY page > might therefore contain references to rows which are (soon) ALL_DEAD > and would be cleaned up in the next VACUUM scan. So, if the page is > BTP_MERGED_AWAY with dead tuples that have been removed everywhere > else but whose tuples aren't cleaned up by vacuum by principle, then > that's an issue for scanners that took very long to continue on to > that page. > Allow me to step in on this part of the conversation. BTP_MERGED_AWAY is mostly a FLAG page. That is really like a pre- BTP_HALF_DEAD page. NOBODY should be interested in the contents. Especially Vacuum. No NORMAL scans should read those contents. We are leaving the contents on this "L" page (of L,R, and L+R notation), for the 2 VERY specific Edge cases. 1) FWD Scans that already read "L", and finds the "L+R" instead of R. (it was caught in the middle due to locks, etc). It "knows" there are ghost copies left behind on "L" (BTP_MERGED_AWAY), and it goes back, reads them, and filters them out of the L+R results it just processed so the L records are NOT duplicated. 2) BWD Scans that read "R" before the merge, then finds "L" (BTP_MERGED_AWAY). It knows it did NOT read the L+R page. So, like the previous logic, it can safely read the "L" records. outside of those two cases, which are clearly edge cases. NOBODY should see those "ghost" records. We are using them just to make the algorithm clean and efficient without having to add something elsewhere to handle the fixups. Back to the question of VACUUM. And where your insight might be trying to help us from making a terrible mistake. My understanding is that there can be MANY "extra" references in any index to records that have been LONG deleted. And when the Table is read, those are removed from the result set (I believe it's why index-only scans are limited to mostly clean tables, freshly vacuumed, etc). So, our assumption is that, if we envision a RACE condition, where VACUUM is trying to remove the INDEX references. BUT our scan is already past that BLOCK. Our scan does not restart, know or care that vacuum wants to remove them from the index (there is no "fixup" process required for a scan). And that's our argument for keeping VACUUM ignorant of anything on BTP_MERGED_AWAY. Because NO "normal" scan will read those contents. Only one that was likely blocked by our activity, and needs to "fix itself" (to avoid dupes or skips). Any scan that runs after our lock doesn't care. If they ran before our lock, and finished, they don't care. The 2 edge cases care... Because it causes Skips (FWD scan), and Dupes (BWD scan), otherwise. We even avoid using BTP_HALF_DEAD in our case, to "keep this flag" for the edge cases. Technically if we could KNOW there are no more edge cases alive, we could have VACUUM always force this BTP_MERGED_AWAY into BTP_HALF_DEAD status. That was our original thought process. But you see that we cannot know we have an edge case... Until... No queries could possibly see past our transaction marker. Now, it's at this point that I wonder if we can actually just DELETE the BTP_MERGED_AWAY page, by VACUUM. But what we prefer, is that it gets marked half dead, and the TXNID_Block_No is used to flip the BTP_MERGED page(s) that follow (because there could have been page splits). And all of those pages that are past that horizon, WILL BE flipped back to NORMAL pages (from BTP_MERGED). For now, for testing and validation... We want to get it working first, and have it take a few steps so we can watch the table "self-recover" from the merge with the Autovacuum. This is why you will hear Salma defend vacuum "not looking" at the old "L" records, certainly not changing them. They are REQUIRED to be the snapshot for what was merged... Only for the 2 edge cases. Thanks for the comments. And for the record, I am one of the mentors on this GSoC for Salma. (Today's lesson for her, is how to defend your position, while being prepared for the public beat down (lol) where you might be wrong. Be wrong out loud! You learn faster!) Kirk
