On Fri, Nov 5, 2021 at 3:26 AM Andrey Borodin <x4...@yandex-team.ru> wrote:
> > 4 нояб. 2021 г., в 20:58, Peter Geoghegan <p...@bowt.ie> написал(а):
> > That's a pretty unlikely scenario. And even
> > if it happened it would only happen once (until the next time we get
> > unlucky). What are the chances of somebody noticing a more or less
> > once-off, slightly wrong answer?
>
> I'd say next to impossible, yet not impossible. Or, perhaps, I do not see 
> protection from this.

I think that that's probably all correct -- I would certainly make the
same guess. It's very unlikely to happen, and when it does happen it
happens only once.

> Moreover there's a "microvacuum". It kills tuples with BUFFER_LOCK_SHARE. 
> AFAIU it should take cleanup lock on buffer too?

No, because there is no heap vacuuming involved (because that doesn't
happen outside lazyvacuum.c). The work that VACUUM does inside
lazy_vacuum_heap_rel() is part of the problem here -- we need an
interlock between that work, and index-only scans. Making LP_DEAD
items in heap pages LP_UNUSED is only ever possible during a VACUUM
operation (I'm sure you know why). AFAICT there would be no bug at all
without that detail.

I believe that there have been several historic reasons why we need a
cleanup lock during nbtree VACUUM, and that there is only one
remaining reason for it today. So the history is unusually complicated. But
AFAICT it's always some kind of "interlock with heapam VACUUM" issue,
with TID recycling, with no protection from our MVCC snapshot. I would
say that that's the "real problem" here, when I get to first principles.

Attached draft patch attempts to explain things in this area within
the nbtree README. There is a much shorter comment about it within
vacuumlazy.c. I am concerned about GiST index-only scans themselves,
of course, but I discovered this issue when thinking carefully about
the concurrency rules for VACUUM -- I think it's valuable to formalize
and justify the general rules that index access methods must follow.

We can talk about this some more in NYC. See you soon!
--
Peter Geoghegan
From ea6612300e010f1f2b02119b5a0de95e46d1157d Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Wed, 3 Nov 2021 14:38:15 -0700
Subject: [PATCH v1] nbtree README: Improve VACUUM interlock section.

Also document a related issue for index-only scans in vacuumlazy.c.

Author: Peter Geoghegan <pg@bowt.ie>
Discussion: https://postgr.es/m/CAH2-Wz=PqOziyRSrnN5jAtfXWXY7-BJcHz9S355LH8Dt=5qxWQ@mail.gmail.com
---
 src/backend/access/heap/vacuumlazy.c |  10 ++
 src/backend/access/nbtree/README     | 145 ++++++++++++---------------
 2 files changed, 75 insertions(+), 80 deletions(-)

diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 282b44f87..8bfe196bf 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -2384,6 +2384,16 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
  * LP_DEAD items on the page that were determined to be LP_DEAD items back
  * when the same page was visited by lazy_scan_prune() (i.e. those whose TID
  * was recorded in the dead_items array at the time).
+ *
+ * We can opportunistically set the visibility map bit for the page here,
+ * which is valuable when lazy_scan_prune couldn't earlier on, owing only to
+ * the fact that there were LP_DEAD items that we'll now mark as unused.  This
+ * is why index AMs that support index-only scans have to hold a pin on an
+ * index page as an interlock against VACUUM while accessing the visibility
+ * map (which is really just a dense summary of visibility information in the
+ * heap).  If they didn't do this then there would be rare race conditions
+ * where a heap TID that is actually dead appears alive to an unlucky
+ * index-only scan.
  */
 static int
 lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
index 2a7332d07..c6f04d856 100644
--- a/src/backend/access/nbtree/README
+++ b/src/backend/access/nbtree/README
@@ -89,25 +89,28 @@ Page read locks are held only for as long as a scan is examining a page.
 To minimize lock/unlock traffic, an index scan always searches a leaf page
 to identify all the matching items at once, copying their heap tuple IDs
 into backend-local storage.  The heap tuple IDs are then processed while
-not holding any page lock within the index.  We do continue to hold a pin
-on the leaf page in some circumstances, to protect against concurrent
-deletions (see below).  In this state the scan is effectively stopped
-"between" pages, either before or after the page it has pinned.  This is
-safe in the presence of concurrent insertions and even page splits, because
-items are never moved across pre-existing page boundaries --- so the scan
-cannot miss any items it should have seen, nor accidentally return the same
-item twice.  The scan must remember the page's right-link at the time it
-was scanned, since that is the page to move right to; if we move right to
-the current right-link then we'd re-scan any items moved by a page split.
-We don't similarly remember the left-link, since it's best to use the most
-up-to-date left-link when trying to move left (see detailed move-left
-algorithm below).
+not holding any page lock within the index.  Plain indexscans can opt to
+hold a pin on the leaf page, to protect against concurrent heap TID
+recycling by VACUUM, but that has nothing to do with the B-Tree physical
+data structure itself.  See "VACUUM's superexclusive lock" section below
+for more information.
 
-In most cases we release our lock and pin on a page before attempting
-to acquire pin and lock on the page we are moving to.  In a few places
-it is necessary to lock the next page before releasing the current one.
-This is safe when moving right or up, but not when moving left or down
-(else we'd create the possibility of deadlocks).
+When an index scan finishes processing a leaf page, it has effectively
+stopped "between" pages.  This is safe in the presence of concurrent
+insertions and even page splits, because items are never moved across
+pre-existing page boundaries --- so the scan cannot miss any items it
+should have seen, nor accidentally return the same item twice.  The scan
+must remember the page's right-link at the time it is read for all this
+to work, since that is the page to visit next if the scan needs to
+continue.  It's more complicated with backwards scans, though -- see
+section below.
+
+In most cases we release our lock on a page before attempting to acquire
+a lock on the sibling page we are moving to.  In a few places (reached
+only during inserts or VACUUM) it might be necessary to lock the next
+page before releasing the lock on the current page.  This is safe when
+moving right or up, but not when moving left or down (else we'd create
+the possibility of deadlocks).
 
 Lehman and Yao fail to discuss what must happen when the root page
 becomes full and must be split.  Our implementation is to split the
@@ -163,56 +166,44 @@ pages (though suffix truncation is also considered).  Note we must include
 the incoming item in this calculation, otherwise it is possible to find
 that the incoming item doesn't fit on the split page where it needs to go!
 
-Deleting index tuples during VACUUM
------------------------------------
+VACUUM's superexclusive lock, unsafe concurrently heap TID recycling
+--------------------------------------------------------------------
 
-Before deleting a leaf item, we get a super-exclusive lock on the target
-page, so that no other backend has a pin on the page when the deletion
-starts.  This is not necessary for correctness in terms of the btree index
-operations themselves; as explained above, index scans logically stop
-"between" pages and so can't lose their place.  The reason we do it is to
-provide an interlock between VACUUM and indexscans.  Since VACUUM deletes
-index entries before reclaiming heap tuple line pointers, the
-super-exclusive lock guarantees that VACUUM can't reclaim for re-use a
-line pointer that an indexscanning process might be about to visit.  This
-guarantee works only for simple indexscans that visit the heap in sync
-with the index scan, not for bitmap scans.  We only need the guarantee
-when using non-MVCC snapshot rules; when using an MVCC snapshot, it
-doesn't matter if the heap tuple is replaced with an unrelated tuple at
-the same TID, because the new tuple won't be visible to our scan anyway.
-Therefore, a scan using an MVCC snapshot which has no other confounding
-factors will not hold the pin after the page contents are read.  The
-current reasons for exceptions, where a pin is still needed, are if the
-index is not WAL-logged or if the scan is an index-only scan.  If later
-work allows the pin to be dropped for all cases we will be able to
-simplify the vacuum code, since the concept of a super-exclusive lock
-for btree indexes will no longer be needed.
+Before deleting items from leaf pages, VACUUM gets a super-exclusive
+lock on the target page, so that no other backend has a pin on the page
+when the deletion starts.  This is not necessary for correctness in
+terms of the btree index operations themselves; as explained above,
+index scans logically stop "between" pages and so can't lose their
+place.  It's necessary to avoid unsafe concurrent recycling of heap
+TIDs.
 
-Because a pin is not always held, and a page can be split even while
-someone does hold a pin on it, it is possible that an indexscan will
-return items that are no longer stored on the page it has a pin on, but
-rather somewhere to the right of that page.  To ensure that VACUUM can't
-prematurely remove such heap tuples, we require btbulkdelete to obtain a
-super-exclusive lock on every leaf page in the index, even pages that
-don't contain any deletable tuples.  Any scan which could yield incorrect
-results if the tuple at a TID matching the scan's range and filter
-conditions were replaced by a different tuple while the scan is in
-progress must hold the pin on each index page until all index entries read
-from the page have been processed.  This guarantees that the btbulkdelete
-call cannot return while any indexscan is still holding a copy of a
-deleted index tuple if the scan could be confused by that.  Note that this
-requirement does not say that btbulkdelete must visit the pages in any
-particular order.  (See also simple deletion and bottom-up deletion,
-below.)
+Requiring superexclusive locks in nbtree VACUUM enables interlocking
+between heap vacuuming (where VACUUM recycles heap TIDs) and index scans
+that visit the heap "in sync".  Since VACUUM (but not simple deletion or
+bottom-up deletion) always removes index tuples before recycling heap
+line pointers (after btbulkdelete returns, during the second pass over
+the heap), the super-exclusive lock guarantees that VACUUM can't
+concurrently recycle a heap TID that a plain index scan might still need
+to visit.  Note that VACUUM must do this for every leaf page, not just
+those that are known to have index tuples that must be removed
+(optimizing away the super-exclusive lock would be wrong, since we're
+not concerned about the physical leaf page itself; the scan has already
+finished reading the leaf page at the point that it begins visiting its
+heap TIDs).
 
-There is no such interlocking for deletion of items in internal pages,
-since backends keep no lock nor pin on a page they have descended past.
-Hence, when a backend is ascending the tree using its stack, it must
-be prepared for the possibility that the item it wants is to the left of
-the recorded position (but it can't have moved left out of the recorded
-page).  Since we hold a lock on the lower page (per L&Y) until we have
-re-found the parent item that links to it, we can be assured that the
-parent item does still exist and can't have been deleted.
+The interlock only applies when an index scan opts-in by holding on to a
+pin on each just-read leaf page (until the scan is done visiting TIDs
+found on the leaf page in the heap).  Most index scans just drop the pin
+instead, which is generally preferable because it avoids unnecessarily
+blocking index vacuuming.  Only index scans using non-MVCC snapshots
+really need the interlock, because they cannot just depend on MVCC rules
+to avoiding returning unrelated heap tuples that happened to reuse the
+original heap line pointer.  (Actually, certain implementation
+restrictions that affect the kill_prior_tuple/LP_DEAD optimization also
+affect whether or not we hold a pin here, even with an MVCC snapshot;
+see simple deletion section below.  This doesn't change the fact that
+holding on to a pin is fundamentally optional for index scans that use
+an MVCC snapshot.)
 
 VACUUM's linear scan, concurrent page splits
 --------------------------------------------
@@ -518,21 +509,15 @@ that's required for the deletion process to perform granular removal of
 groups of dead TIDs from posting list tuples (without the situation ever
 being allowed to get out of hand).
 
-It's sufficient to have an exclusive lock on the index page, not a
-super-exclusive lock, to do deletion of LP_DEAD items.  It might seem
-that this breaks the interlock between VACUUM and indexscans, but that is
-not so: as long as an indexscanning process has a pin on the page where
-the index item used to be, VACUUM cannot complete its btbulkdelete scan
-and so cannot remove the heap tuple.  This is another reason why
-btbulkdelete has to get a super-exclusive lock on every leaf page, not only
-the ones where it actually sees items to delete.
-
-LP_DEAD setting by index scans cannot be sure that a TID whose index tuple
-it had planned on LP_DEAD-setting has not been recycled by VACUUM if it
-drops its pin in the meantime.  It must conservatively also remember the
-LSN of the page, and only act to set LP_DEAD bits when the LSN has not
-changed at all. (Avoiding dropping the pin entirely also makes it safe, of
-course.)
+LP_DEAD setting by index scans (via the kill_prior_tuple optimization)
+cannot be sure that a TID whose index tuple it had planned on
+LP_DEAD-setting has not been recycled by VACUUM if it drops its pin in
+the meantime.  It must conservatively also remember the LSN of the page,
+and only act to set LP_DEAD bits when the LSN has not changed at all.
+Avoiding dropping the pin entirely makes it safe even when the LSN has
+changed (see related discussion about VACUUM's superexclusive lock
+above), but in practice most index scans opt to not hold onto a pin, to
+avoid blocking VACUUM.
 
 Bottom-Up deletion
 ------------------
-- 
2.30.2

Reply via email to