Re: index prefetching

Andres Freund Tue, 31 Mar 2026 13:39:36 -0700

Hi,

On 2026-03-31 13:39:30 -0400, Peter Geoghegan wrote:
> On Fri, Mar 27, 2026 at 8:35 PM Peter Geoghegan <[email protected]> wrote:
> > Attached is v18, which addresses the tricky issues I skipped in v17.
>
> v19 is attached. It contains a few notable improvements over v18:
>
> * The first commit (the one that simply moves code into the new
> heapam_indexscan.c file) now moves heap_hot_search_buffer over there
> as well.
>
> Andres suggested this at one point, and I believe it wins us a small
> but significant performance benefit.


To me it also just makes sense.


> From 742dcbea7d184443d2c4ef3c095b60f47f870f72 Mon Sep 17 00:00:00 2001
> From: Peter Geoghegan <[email protected]>
> Date: Wed, 25 Mar 2026 00:22:17 -0400
> Subject: [PATCH v19 01/18] Move heapam_handler.c index scan code to new file.
>
> Move the heapam index fetch callbacks (index_fetch_begin,
> index_fetch_reset, index_fetch_end, and index_fetch_tuple) into a new
> dedicated file.  Also move heap_hot_search_buffer over.  This is a
> purely mechanical move with no functional impact.

I don't love that this renames arguments (s/call_again/heap_continue/) while
moving. Makes it harder to verify this is just the mechanical change it claims
to be.


> +/*-------------------------------------------------------------------------
> + *
> + * heapam_indexscan.c
> + *     heap table plain index scan and index-only scan code
> + *
> + * Portions Copyright (c) 1996-2026, PostgreSQL Global Development Group
> + * Portions Copyright (c) 1994, Regents of the University of California
> + *
> + *
> + * IDENTIFICATION
> + *     src/backend/access/heap/heapam_indexscan.c
> + *
> + *-------------------------------------------------------------------------
> + */
> +#include "postgres.h"
> +
> +#include "access/heapam.h"
> +#include "storage/predicate.h"

Should probably include access/relscan.h for +IndexFetchTableData, rather than
relying on the indirect include.



> From 5ca838ac32a8d43510325cb400be4ecea47ea8d2 Mon Sep 17 00:00:00 2001
> From: Peter Geoghegan <[email protected]>
> Date: Sun, 22 Mar 2026 02:36:57 -0400
> Subject: [PATCH v19 02/18] Add slot-based table AM index scan interface.
>
> Add table_index_getnext_slot, a new table AM callback that wraps both
> plain and index-only index scans that use amgettuple.  Two new
> TableAmRoutine callbacks are introduced -- one for plain scans and one
> for index-only scans -- which an upcoming commit that adds the
> amgetbatch interface will expand to four.  The appropriate callback is
> resolved once in index_beginscan, and called through a function pointer
> (xs_getnext_slot) on the IndexScanDesc when the table_index_getnext_slot
> shim function is called from executor nodes.
>
> This moves VM checks for index-only scans out of the executor and into
> heapam, enabling batching of visibility map lookups (though for now we
> continue to just perform retail lookups).

I'd also add that it's architecturally much cleaner this way.


> A small minority of callers (2 callers in total) continue to use the old
> table_index_fetch_tuple interface.  This is still necessary with callers
> that fundamentally need to pass a TID to the table AM.  Both callers
> perform some kind of constraint enforcement.

I think we may need to do something about this before all of this over. It's
just too confusing to have the generic sounding index_fetch_tuple() interfaces
that are barely ever used and *shouldn't* be used when, uhm, fetching a tuple
pointed to by an index.

At the very least it seems we should rename the ->index_fetch_tuple() to
->index_fetch_tid() or something and remove the call_again, all_dead
arguments, since they are never used in this context.  I suspect it should
*not* use the table_index_fetch_begin()/table_index_fetch_end() interface
either.


> XXX Do we still need IndexFetchTableData.rel?  We are now mostly using
> IndexScanDescData.heapRelation, but we could use it for absolutely
> everything if we were willing to revise the signature of the old
> table_index_fetch_tuple interface by giving it its own heapRelation arg.

I think we *should* change the argument of table_index_fetch_tuple().


> Index-only scan callers pass table_index_getnext_slot a TupleTableSlot
> (which the table AM needs internally for heap fetches), but continue to
> read their results from IndexScanDescData fields such as xs_itup (rather
> than from the slot itself).

Pretty this ain't.  But I guess it's been vaguely gross like this for quite a
while, so...


> This convention lets both plain and index-only scans use the same
> table_index_getnext_slot interface.

Not convinced that's really a useful goal.


> All callers can continue to rely on the scan descriptor's xs_heaptid field
> being set on each call. (execCurrentOf is the only caller that reads this
> field directly, to determine the current row of an index-only scan, but it
> seems like a good idea to do the same thing for all callers).

> XXX Should execCurrentOf continue to do this?  Can't it get the same TID
> from the table slot instead?

I think the tid from the slot is always the tid from the start of the HOT
chain, not the actual location of the tuple. That's important, as otherwise a
HOT pruning after determining the slot's tid would potentially make the tid
invalid.  Whereas, I think, xs_heaptid, is currently set to the offset of the
actual tuple, even if it's a just a HOT version.


> @@ -177,10 +181,13 @@ typedef struct IndexScanDescData
>       struct TupleDescData *xs_hitupdesc; /* rowtype descriptor of xs_hitup */
>
>       ItemPointerData xs_heaptid; /* result */
> -     bool            xs_heap_continue;       /* T if must keep walking, 
> potential
> -                                                                      * 
> further results */
>       IndexFetchTableData *xs_heapfetch;
>
> +     /* Resolved getnext_slot implementation, set by index_beginscan */
> +     bool            (*xs_getnext_slot) (struct IndexScanDescData *scan,
> +                                                                     
> ScanDirection direction,
> +                                                                     struct 
> TupleTableSlot *slot);
> +
>       bool            xs_recheck;             /* T means scan keys must be 
> rechecked */
>
>       /*

Hm. Why is it ok that we now don't have an equivalent of xs_heap_continue on
the scan level anymore?  I see there's a local heap_continue in
heapam_index_getnext_slot(), but isn't that insufficient e.g. for a
SNAPSHOT_ANY snapshot, where all the row versions would be visible?

Am I misunderstanding how this works?



> +/*
> + * Fetch the next tuple from an index scan into `slot`, scanning in the
> + * specified direction.  Returns true if a tuple satisfying the scan keys and
> + * the snapshot was found, false otherwise.  The tuple is stored in the
> + * specified slot.
> + *
> + * Dispatches through scan->xs_getnext_slot, which is resolved once by
> + * index_beginscan.
> + *
> + * On success, resources (like buffer pins) are likely to be held, and will 
> be
> + * released by a future table_index_getnext_slot or index_endscan call.
> + *
> + * Note: caller must check scan->xs_recheck, and perform rechecking of the
> + * scan keys if required.  We do not do that here because we don't have
> + * enough information to do it efficiently in the general case.
> + *
> + * For index-only scans, the callback also fills xs_itup/xs_itupdesc or
> + * xs_hitup/xs_hitupdesc (or both) so that index data can be returned without
> + * a heap fetch.
> + */
> +static inline bool
> +table_index_getnext_slot(IndexScanDesc iscan, ScanDirection direction,
> +                                              TupleTableSlot *slot)
> +{
> +     return iscan->xs_getnext_slot(iscan, direction, slot);
> +}

Hm. Does this actually belong in tableam.h? I'm a bit conflicted.


> @@ -2553,6 +2554,8 @@ static const TableAmRoutine heapam_methods = {
>       .index_fetch_begin = heapam_index_fetch_begin,
>       .index_fetch_reset = heapam_index_fetch_reset,
>       .index_fetch_end = heapam_index_fetch_end,
> +     .index_plain_amgettuple_getnext_slot = 
> heapam_index_plain_amgettuple_getnext_slot,
> +     .index_only_amgettuple_getnext_slot = 
> heapam_index_only_amgettuple_getnext_slot,
>       .index_fetch_tuple = heapam_index_fetch_tuple,
>
>       .tuple_insert = heapam_tuple_insert,

Wonder if it's perhaps worth deleting _slot and shortening getnext to
next. These are quite only names...


> +/* table_index_getnext_slot callback: amgettuple, plain index scan */
> +pg_attribute_hot bool
> +heapam_index_plain_amgettuple_getnext_slot(IndexScanDesc scan,
> +                                                                             
>    ScanDirection direction,
> +                                                                             
>    TupleTableSlot *slot)
> +{
> +     Assert(!scan->xs_want_itup);
> +     Assert(scan->indexRelation->rd_indam->amgettuple != NULL);
> +
> +     return heapam_index_getnext_slot(scan, direction, slot, false);
> +}
> +
> +/* table_index_getnext_slot callback: amgettuple, index-only scan */
> +pg_attribute_hot bool
> +heapam_index_only_amgettuple_getnext_slot(IndexScanDesc scan,
> +                                                                             
>   ScanDirection direction,
> +                                                                             
>   TupleTableSlot *slot)
> +{
> +     Assert(scan->xs_want_itup);
> +     Assert(scan->indexRelation->rd_indam->amgettuple != NULL);
> +
> +     return heapam_index_getnext_slot(scan, direction, slot, true);
> +}
> +
>  bool
>  heapam_index_fetch_tuple(struct IndexFetchTableData *scan,
>                                                ItemPointer tid,
> @@ -233,6 +274,18 @@ heapam_index_fetch_tuple(struct IndexFetchTableData 
> *scan,
>                                                TupleTableSlot *slot,
>                                                bool *heap_continue, bool 
> *all_dead)
>  {
> +
> +     return heapam_index_fetch_tuple_impl(scan, tid, snapshot, slot,
> +                                                                             
>  heap_continue, all_dead);
> +}
> +
> +static pg_attribute_always_inline bool
> +heapam_index_fetch_tuple_impl(struct IndexFetchTableData *scan,
> +                                                       ItemPointer tid,
> +                                                       Snapshot snapshot,
> +                                                       TupleTableSlot *slot,
> +                                                       bool *heap_continue, 
> bool *all_dead)
> +{
>       IndexFetchHeapData *hscan = (IndexFetchHeapData *) scan;
>       BufferHeapTupleTableSlot *bslot = (BufferHeapTupleTableSlot *) slot;
>       bool            got_heap_tuple;
> @@ -289,3 +342,172 @@ heapam_index_fetch_tuple(struct IndexFetchTableData 
> *scan,
>
>       return got_heap_tuple;
>  }
> +
> +/*
> + * Common implementation for both heapam_index_*_getnext_slot variants.
> + *
> + * The result is true if a tuple satisfying the scan keys and the snapshot 
> was
> + * found, false otherwise.  The tuple is stored in the specified slot.
> + *
> + * On success, resources (like buffer pins) are likely to be held, and will 
> be
> + * dropped by a future call here (or by a later call to 
> heapam_index_fetch_end
> + * through index_endscan).
> + *
> + * The index_only parameter is a compile-time constant at each call site,
> + * allowing the compiler to specialize the code for each variant.
> + */
> +static pg_attribute_always_inline bool
> +heapam_index_getnext_slot(IndexScanDesc scan, ScanDirection direction,
> +                                               TupleTableSlot *slot, bool 
> index_only)

For heapam_index_fetch_tuple_impl() you added _impl, but here you didn't.
Mild preference for also adding _impl here.


> +{
> +     IndexFetchHeapData *hscan = (IndexFetchHeapData *) scan->xs_heapfetch;
> +     ItemPointer tid;
> +     bool            heap_continue = false;

As mentioned above, it's not clear to me that it is correct for all scan types
to have a short-lived heap_continue.


> +     bool            all_visible = false;
> +     BlockNumber last_visited_block = InvalidBlockNumber;
> +     uint8           n_visited_pages = 0,
> +                             xs_visited_pages_limit = 0;
> +
> +     if (index_only)
> +             xs_visited_pages_limit = scan->xs_visited_pages_limit;

Did you check that it's useful to have xs_visited_pages_limit as a longer
lived variable? I suspect it's just going to add register pressure and will
need to be saved/restored across other calls, making it just as cheap to
always fetch it from scan.


> +     for (;;)
> +     {
> +             if (!heap_continue)
> +             {
> +                     /* Get the next TID from the index */
> +                     tid = index_getnext_tid(scan, direction);
> +
> +                     /* If we're out of index entries, we're done */
> +                     if (tid == NULL)
> +                             break;
> +
> +                     /* For index-only scans, check the visibility map */
> +                     if (index_only)
> +                             all_visible = VM_ALL_VISIBLE(scan->heapRelation,
> +                                                                             
>          ItemPointerGetBlockNumber(tid),
> +                                                                             
>          &hscan->xs_vmbuffer);
> +             }
> +
> +             Assert(ItemPointerIsValid(&scan->xs_heaptid));
> +
> +             if (index_only)
> +             {
> +                     /*
> +                      * We can skip the heap fetch if the TID references a 
> heap page on
> +                      * which all tuples are known visible to everybody.  In 
> any case,
> +                      * we'll use the index tuple not the heap tuple as the 
> data
> +                      * source.
> +                      */
> +                     if (!all_visible)
> +                     {
> +                             /*
> +                              * Rats, we have to visit the heap to check 
> visibility.
> +                              */
> +                             if (scan->instrument)
> +                                     scan->instrument->ntablefetches++;
> +
> +                             if (!heapam_index_fetch_heap(scan, slot, 
> &heap_continue))
> +                             {

Still find it somewhat weird that we have local 'tid' variable, that we do not
use pass to heapam_index_fetch_heap(), because heapam_index_fetch_heap()
relies on index_getnext_tid() having stored the to-be-fetched tid in
xs_heaptid.

But I guess that's something to change at a later date.


> +                                     /*
> +                                      * No visible tuple.  If caller set a 
> visited-pages limit
> +                                      * (only selfuncs.c does this), count 
> distinct heap pages
> +                                      * and give up once we've visited too 
> many.
> +                                      */
> +                                     if (unlikely(xs_visited_pages_limit > 
> 0))
> +                                     {
> +                                             BlockNumber blk = 
> ItemPointerGetBlockNumber(tid);
> +
> +                                             if (blk != last_visited_block)
> +                                             {
> +                                                     last_visited_block = 
> blk;
> +                                                     if (++n_visited_pages > 
> xs_visited_pages_limit)
> +                                                             return false;   
> /* give up */
> +                                             }
> +                                     }
> +                                     continue;       /* no visible tuple, 
> try next index entry */
> +                             }
> +
> +                             ExecClearTuple(slot);

Previously the ExecClearTuple() here had at least this comment:
/* We don't actually need the heap tuple for anything */

Now it looks even more confusing...


> @@ -313,7 +313,32 @@ visibilitymap_set(BlockNumber heapBlk,
>   * since we don't lock the visibility map page either, it's even possible 
> that
>   * someone else could have changed the bit just before we look at it, but yet
>   * we might see the old value.  It is the caller's responsibility to deal 
> with
> - * all concurrency issues!
> + * all concurrency issues!  In practice it can't be stale enough to matter 
> for
> + * the primary use case: index-only scans that check whether a heap fetch can
> + * be skipped.
> + *
> + * The argument for why it can't be stale enough to matter for the primary 
> use
> + * case is as follows:
> + *
> + * Inserts: we need to detect that a VM bit was cleared by an insert right
> + * away, because the new tuple is present in the index but not yet visible.
> + * Reading the TID from the index page (under a shared lock on the index
> + * buffer) is serialized with the insertion of the TID into the index (under
> + * an exclusive lock on the same index buffer).  Because the VM bit is 
> cleared
> + * before the index is updated, and locking/unlocking of the index page acts
> + * as a full memory barrier, we are sure to see the cleared bit whenever we
> + * see a recently-inserted TID.
> + *
> + * Deletes: the clearing of the VM bit by a delete is NOT serialized with the
> + * index page access, because deletes do not update the index page (only
> + * VACUUM removes the index TID).  So we may see a significantly stale value.
> + * However, we don't need to detect the delete right away, because the tuple
> + * remains visible until the deleting transaction commits or the statement
> + * ends (if it's our own transaction).  In either case, the lock on the VM
> + * buffer will have been released (acting as a write barrier) after clearing
> + * the bit.  And for us to have a snapshot that includes the deleting
> + * transaction (making the tuple invisible), we must have acquired
> + * ProcArrayLock after that time, acting as a read barrier.
>   */
>  uint8
>  visibilitymap_get_status(Relation rel, BlockNumber heapBlk, Buffer *vmbuf)

Nice. Much better place for the comment.


> diff --git a/src/backend/access/index/genam.c 
> b/src/backend/access/index/genam.c
> index 1408989c5..acc9f3e6a 100644
> --- a/src/backend/access/index/genam.c
> +++ b/src/backend/access/index/genam.c
> @@ -126,6 +126,8 @@ RelationGetIndexScan(Relation indexRelation, int nkeys, 
> int norderbys)
>       scan->xs_hitup = NULL;
>       scan->xs_hitupdesc = NULL;
>
> +     scan->xs_visited_pages_limit = 0;
> +
>       return scan;
>  }

Orthogonal: I suspect eventually we should pass the table to
RelationGetIndexScan(), have the tableam return how much space it needs, and
allocate it one chunk.




> From 85be7bf520fc2746f4ee73a0744c7aa04da55e52 Mon Sep 17 00:00:00 2001
> From: Peter Geoghegan <[email protected]>
> Date: Tue, 10 Mar 2026 14:40:35 -0400
> Subject: [PATCH v19 03/18] heapam: Track heap block in IndexFetchHeapData.

LGTM.


> Subject: [PATCH v19 04/18] heapam: Keep buffer pins across index rescans.
>
> Avoid dropping the heap page pin (xs_cbuf) and visibility map pin
> (xs_vmbuffer) during heapam_index_fetch_reset.  Retaining these pins
> saves cycles during tight nested loop joins and merge joins that
> frequently restore a saved mark, since the next tuple fetched after a
> rescan often falls on the same heap page.  It can also avoid repeated
> pinning and unpinning of the same buffer when rescans happen to revisit
> the same page.
>
> Preparation for an upcoming patch that will add the amgetbatch
> interface to enable optimizations such as I/O prefetching.
>
> Author: Peter Geoghegan <[email protected]>
> Reviewed-By: Andres Freund <[email protected]>
> Discussion: 
> https://postgr.es/m/CAH2-Wz=g=jtsydb4utb5su2zcvss7vbp+zmvvag6abocb+s...@mail.gmail.com
> ---
>  src/backend/access/heap/heapam_indexscan.c | 27 ++++++++++++----------
>  src/backend/access/index/indexam.c         |  6 ++---
>  2 files changed, 18 insertions(+), 15 deletions(-)
>
> diff --git a/src/backend/access/heap/heapam_indexscan.c 
> b/src/backend/access/heap/heapam_indexscan.c
> index 4b5e2b30d..82f135e1e 100644
> --- a/src/backend/access/heap/heapam_indexscan.c
> +++ b/src/backend/access/heap/heapam_indexscan.c
> @@ -59,18 +59,15 @@ heapam_index_fetch_reset(IndexFetchTableData *scan)
>  {
>       IndexFetchHeapData *hscan = (IndexFetchHeapData *) scan;
>
> -     if (BufferIsValid(hscan->xs_cbuf))
> -     {
> -             ReleaseBuffer(hscan->xs_cbuf);
> -             hscan->xs_cbuf = InvalidBuffer;
> -             hscan->xs_blk = InvalidBlockNumber;
> -     }
> +     /* Resets are a no-op (XXX amgetbatch commit resets xs_vm_items here) */
> +     (void) hscan;

I assume that's not a line you intend to commit?

LGTM otherwise.


Need to do something else for a while.  More another time.


Greetings,

Andres Freund

Re: index prefetching

Reply via email to