Hi, On 2026-03-31 13:39:30 -0400, Peter Geoghegan wrote: > On Fri, Mar 27, 2026 at 8:35 PM Peter Geoghegan <[email protected]> wrote: > > Attached is v18, which addresses the tricky issues I skipped in v17. > > v19 is attached. It contains a few notable improvements over v18: > > * The first commit (the one that simply moves code into the new > heapam_indexscan.c file) now moves heap_hot_search_buffer over there > as well. > > Andres suggested this at one point, and I believe it wins us a small > but significant performance benefit.
To me it also just makes sense. > From 742dcbea7d184443d2c4ef3c095b60f47f870f72 Mon Sep 17 00:00:00 2001 > From: Peter Geoghegan <[email protected]> > Date: Wed, 25 Mar 2026 00:22:17 -0400 > Subject: [PATCH v19 01/18] Move heapam_handler.c index scan code to new file. > > Move the heapam index fetch callbacks (index_fetch_begin, > index_fetch_reset, index_fetch_end, and index_fetch_tuple) into a new > dedicated file. Also move heap_hot_search_buffer over. This is a > purely mechanical move with no functional impact. I don't love that this renames arguments (s/call_again/heap_continue/) while moving. Makes it harder to verify this is just the mechanical change it claims to be. > +/*------------------------------------------------------------------------- > + * > + * heapam_indexscan.c > + * heap table plain index scan and index-only scan code > + * > + * Portions Copyright (c) 1996-2026, PostgreSQL Global Development Group > + * Portions Copyright (c) 1994, Regents of the University of California > + * > + * > + * IDENTIFICATION > + * src/backend/access/heap/heapam_indexscan.c > + * > + *------------------------------------------------------------------------- > + */ > +#include "postgres.h" > + > +#include "access/heapam.h" > +#include "storage/predicate.h" Should probably include access/relscan.h for +IndexFetchTableData, rather than relying on the indirect include. > From 5ca838ac32a8d43510325cb400be4ecea47ea8d2 Mon Sep 17 00:00:00 2001 > From: Peter Geoghegan <[email protected]> > Date: Sun, 22 Mar 2026 02:36:57 -0400 > Subject: [PATCH v19 02/18] Add slot-based table AM index scan interface. > > Add table_index_getnext_slot, a new table AM callback that wraps both > plain and index-only index scans that use amgettuple. Two new > TableAmRoutine callbacks are introduced -- one for plain scans and one > for index-only scans -- which an upcoming commit that adds the > amgetbatch interface will expand to four. The appropriate callback is > resolved once in index_beginscan, and called through a function pointer > (xs_getnext_slot) on the IndexScanDesc when the table_index_getnext_slot > shim function is called from executor nodes. > > This moves VM checks for index-only scans out of the executor and into > heapam, enabling batching of visibility map lookups (though for now we > continue to just perform retail lookups). I'd also add that it's architecturally much cleaner this way. > A small minority of callers (2 callers in total) continue to use the old > table_index_fetch_tuple interface. This is still necessary with callers > that fundamentally need to pass a TID to the table AM. Both callers > perform some kind of constraint enforcement. I think we may need to do something about this before all of this over. It's just too confusing to have the generic sounding index_fetch_tuple() interfaces that are barely ever used and *shouldn't* be used when, uhm, fetching a tuple pointed to by an index. At the very least it seems we should rename the ->index_fetch_tuple() to ->index_fetch_tid() or something and remove the call_again, all_dead arguments, since they are never used in this context. I suspect it should *not* use the table_index_fetch_begin()/table_index_fetch_end() interface either. > XXX Do we still need IndexFetchTableData.rel? We are now mostly using > IndexScanDescData.heapRelation, but we could use it for absolutely > everything if we were willing to revise the signature of the old > table_index_fetch_tuple interface by giving it its own heapRelation arg. I think we *should* change the argument of table_index_fetch_tuple(). > Index-only scan callers pass table_index_getnext_slot a TupleTableSlot > (which the table AM needs internally for heap fetches), but continue to > read their results from IndexScanDescData fields such as xs_itup (rather > than from the slot itself). Pretty this ain't. But I guess it's been vaguely gross like this for quite a while, so... > This convention lets both plain and index-only scans use the same > table_index_getnext_slot interface. Not convinced that's really a useful goal. > All callers can continue to rely on the scan descriptor's xs_heaptid field > being set on each call. (execCurrentOf is the only caller that reads this > field directly, to determine the current row of an index-only scan, but it > seems like a good idea to do the same thing for all callers). > XXX Should execCurrentOf continue to do this? Can't it get the same TID > from the table slot instead? I think the tid from the slot is always the tid from the start of the HOT chain, not the actual location of the tuple. That's important, as otherwise a HOT pruning after determining the slot's tid would potentially make the tid invalid. Whereas, I think, xs_heaptid, is currently set to the offset of the actual tuple, even if it's a just a HOT version. > @@ -177,10 +181,13 @@ typedef struct IndexScanDescData > struct TupleDescData *xs_hitupdesc; /* rowtype descriptor of xs_hitup */ > > ItemPointerData xs_heaptid; /* result */ > - bool xs_heap_continue; /* T if must keep walking, > potential > - * > further results */ > IndexFetchTableData *xs_heapfetch; > > + /* Resolved getnext_slot implementation, set by index_beginscan */ > + bool (*xs_getnext_slot) (struct IndexScanDescData *scan, > + > ScanDirection direction, > + struct > TupleTableSlot *slot); > + > bool xs_recheck; /* T means scan keys must be > rechecked */ > > /* Hm. Why is it ok that we now don't have an equivalent of xs_heap_continue on the scan level anymore? I see there's a local heap_continue in heapam_index_getnext_slot(), but isn't that insufficient e.g. for a SNAPSHOT_ANY snapshot, where all the row versions would be visible? Am I misunderstanding how this works? > +/* > + * Fetch the next tuple from an index scan into `slot`, scanning in the > + * specified direction. Returns true if a tuple satisfying the scan keys and > + * the snapshot was found, false otherwise. The tuple is stored in the > + * specified slot. > + * > + * Dispatches through scan->xs_getnext_slot, which is resolved once by > + * index_beginscan. > + * > + * On success, resources (like buffer pins) are likely to be held, and will > be > + * released by a future table_index_getnext_slot or index_endscan call. > + * > + * Note: caller must check scan->xs_recheck, and perform rechecking of the > + * scan keys if required. We do not do that here because we don't have > + * enough information to do it efficiently in the general case. > + * > + * For index-only scans, the callback also fills xs_itup/xs_itupdesc or > + * xs_hitup/xs_hitupdesc (or both) so that index data can be returned without > + * a heap fetch. > + */ > +static inline bool > +table_index_getnext_slot(IndexScanDesc iscan, ScanDirection direction, > + TupleTableSlot *slot) > +{ > + return iscan->xs_getnext_slot(iscan, direction, slot); > +} Hm. Does this actually belong in tableam.h? I'm a bit conflicted. > @@ -2553,6 +2554,8 @@ static const TableAmRoutine heapam_methods = { > .index_fetch_begin = heapam_index_fetch_begin, > .index_fetch_reset = heapam_index_fetch_reset, > .index_fetch_end = heapam_index_fetch_end, > + .index_plain_amgettuple_getnext_slot = > heapam_index_plain_amgettuple_getnext_slot, > + .index_only_amgettuple_getnext_slot = > heapam_index_only_amgettuple_getnext_slot, > .index_fetch_tuple = heapam_index_fetch_tuple, > > .tuple_insert = heapam_tuple_insert, Wonder if it's perhaps worth deleting _slot and shortening getnext to next. These are quite only names... > +/* table_index_getnext_slot callback: amgettuple, plain index scan */ > +pg_attribute_hot bool > +heapam_index_plain_amgettuple_getnext_slot(IndexScanDesc scan, > + > ScanDirection direction, > + > TupleTableSlot *slot) > +{ > + Assert(!scan->xs_want_itup); > + Assert(scan->indexRelation->rd_indam->amgettuple != NULL); > + > + return heapam_index_getnext_slot(scan, direction, slot, false); > +} > + > +/* table_index_getnext_slot callback: amgettuple, index-only scan */ > +pg_attribute_hot bool > +heapam_index_only_amgettuple_getnext_slot(IndexScanDesc scan, > + > ScanDirection direction, > + > TupleTableSlot *slot) > +{ > + Assert(scan->xs_want_itup); > + Assert(scan->indexRelation->rd_indam->amgettuple != NULL); > + > + return heapam_index_getnext_slot(scan, direction, slot, true); > +} > + > bool > heapam_index_fetch_tuple(struct IndexFetchTableData *scan, > ItemPointer tid, > @@ -233,6 +274,18 @@ heapam_index_fetch_tuple(struct IndexFetchTableData > *scan, > TupleTableSlot *slot, > bool *heap_continue, bool > *all_dead) > { > + > + return heapam_index_fetch_tuple_impl(scan, tid, snapshot, slot, > + > heap_continue, all_dead); > +} > + > +static pg_attribute_always_inline bool > +heapam_index_fetch_tuple_impl(struct IndexFetchTableData *scan, > + ItemPointer tid, > + Snapshot snapshot, > + TupleTableSlot *slot, > + bool *heap_continue, > bool *all_dead) > +{ > IndexFetchHeapData *hscan = (IndexFetchHeapData *) scan; > BufferHeapTupleTableSlot *bslot = (BufferHeapTupleTableSlot *) slot; > bool got_heap_tuple; > @@ -289,3 +342,172 @@ heapam_index_fetch_tuple(struct IndexFetchTableData > *scan, > > return got_heap_tuple; > } > + > +/* > + * Common implementation for both heapam_index_*_getnext_slot variants. > + * > + * The result is true if a tuple satisfying the scan keys and the snapshot > was > + * found, false otherwise. The tuple is stored in the specified slot. > + * > + * On success, resources (like buffer pins) are likely to be held, and will > be > + * dropped by a future call here (or by a later call to > heapam_index_fetch_end > + * through index_endscan). > + * > + * The index_only parameter is a compile-time constant at each call site, > + * allowing the compiler to specialize the code for each variant. > + */ > +static pg_attribute_always_inline bool > +heapam_index_getnext_slot(IndexScanDesc scan, ScanDirection direction, > + TupleTableSlot *slot, bool > index_only) For heapam_index_fetch_tuple_impl() you added _impl, but here you didn't. Mild preference for also adding _impl here. > +{ > + IndexFetchHeapData *hscan = (IndexFetchHeapData *) scan->xs_heapfetch; > + ItemPointer tid; > + bool heap_continue = false; As mentioned above, it's not clear to me that it is correct for all scan types to have a short-lived heap_continue. > + bool all_visible = false; > + BlockNumber last_visited_block = InvalidBlockNumber; > + uint8 n_visited_pages = 0, > + xs_visited_pages_limit = 0; > + > + if (index_only) > + xs_visited_pages_limit = scan->xs_visited_pages_limit; Did you check that it's useful to have xs_visited_pages_limit as a longer lived variable? I suspect it's just going to add register pressure and will need to be saved/restored across other calls, making it just as cheap to always fetch it from scan. > + for (;;) > + { > + if (!heap_continue) > + { > + /* Get the next TID from the index */ > + tid = index_getnext_tid(scan, direction); > + > + /* If we're out of index entries, we're done */ > + if (tid == NULL) > + break; > + > + /* For index-only scans, check the visibility map */ > + if (index_only) > + all_visible = VM_ALL_VISIBLE(scan->heapRelation, > + > ItemPointerGetBlockNumber(tid), > + > &hscan->xs_vmbuffer); > + } > + > + Assert(ItemPointerIsValid(&scan->xs_heaptid)); > + > + if (index_only) > + { > + /* > + * We can skip the heap fetch if the TID references a > heap page on > + * which all tuples are known visible to everybody. In > any case, > + * we'll use the index tuple not the heap tuple as the > data > + * source. > + */ > + if (!all_visible) > + { > + /* > + * Rats, we have to visit the heap to check > visibility. > + */ > + if (scan->instrument) > + scan->instrument->ntablefetches++; > + > + if (!heapam_index_fetch_heap(scan, slot, > &heap_continue)) > + { Still find it somewhat weird that we have local 'tid' variable, that we do not use pass to heapam_index_fetch_heap(), because heapam_index_fetch_heap() relies on index_getnext_tid() having stored the to-be-fetched tid in xs_heaptid. But I guess that's something to change at a later date. > + /* > + * No visible tuple. If caller set a > visited-pages limit > + * (only selfuncs.c does this), count > distinct heap pages > + * and give up once we've visited too > many. > + */ > + if (unlikely(xs_visited_pages_limit > > 0)) > + { > + BlockNumber blk = > ItemPointerGetBlockNumber(tid); > + > + if (blk != last_visited_block) > + { > + last_visited_block = > blk; > + if (++n_visited_pages > > xs_visited_pages_limit) > + return false; > /* give up */ > + } > + } > + continue; /* no visible tuple, > try next index entry */ > + } > + > + ExecClearTuple(slot); Previously the ExecClearTuple() here had at least this comment: /* We don't actually need the heap tuple for anything */ Now it looks even more confusing... > @@ -313,7 +313,32 @@ visibilitymap_set(BlockNumber heapBlk, > * since we don't lock the visibility map page either, it's even possible > that > * someone else could have changed the bit just before we look at it, but yet > * we might see the old value. It is the caller's responsibility to deal > with > - * all concurrency issues! > + * all concurrency issues! In practice it can't be stale enough to matter > for > + * the primary use case: index-only scans that check whether a heap fetch can > + * be skipped. > + * > + * The argument for why it can't be stale enough to matter for the primary > use > + * case is as follows: > + * > + * Inserts: we need to detect that a VM bit was cleared by an insert right > + * away, because the new tuple is present in the index but not yet visible. > + * Reading the TID from the index page (under a shared lock on the index > + * buffer) is serialized with the insertion of the TID into the index (under > + * an exclusive lock on the same index buffer). Because the VM bit is > cleared > + * before the index is updated, and locking/unlocking of the index page acts > + * as a full memory barrier, we are sure to see the cleared bit whenever we > + * see a recently-inserted TID. > + * > + * Deletes: the clearing of the VM bit by a delete is NOT serialized with the > + * index page access, because deletes do not update the index page (only > + * VACUUM removes the index TID). So we may see a significantly stale value. > + * However, we don't need to detect the delete right away, because the tuple > + * remains visible until the deleting transaction commits or the statement > + * ends (if it's our own transaction). In either case, the lock on the VM > + * buffer will have been released (acting as a write barrier) after clearing > + * the bit. And for us to have a snapshot that includes the deleting > + * transaction (making the tuple invisible), we must have acquired > + * ProcArrayLock after that time, acting as a read barrier. > */ > uint8 > visibilitymap_get_status(Relation rel, BlockNumber heapBlk, Buffer *vmbuf) Nice. Much better place for the comment. > diff --git a/src/backend/access/index/genam.c > b/src/backend/access/index/genam.c > index 1408989c5..acc9f3e6a 100644 > --- a/src/backend/access/index/genam.c > +++ b/src/backend/access/index/genam.c > @@ -126,6 +126,8 @@ RelationGetIndexScan(Relation indexRelation, int nkeys, > int norderbys) > scan->xs_hitup = NULL; > scan->xs_hitupdesc = NULL; > > + scan->xs_visited_pages_limit = 0; > + > return scan; > } Orthogonal: I suspect eventually we should pass the table to RelationGetIndexScan(), have the tableam return how much space it needs, and allocate it one chunk. > From 85be7bf520fc2746f4ee73a0744c7aa04da55e52 Mon Sep 17 00:00:00 2001 > From: Peter Geoghegan <[email protected]> > Date: Tue, 10 Mar 2026 14:40:35 -0400 > Subject: [PATCH v19 03/18] heapam: Track heap block in IndexFetchHeapData. LGTM. > Subject: [PATCH v19 04/18] heapam: Keep buffer pins across index rescans. > > Avoid dropping the heap page pin (xs_cbuf) and visibility map pin > (xs_vmbuffer) during heapam_index_fetch_reset. Retaining these pins > saves cycles during tight nested loop joins and merge joins that > frequently restore a saved mark, since the next tuple fetched after a > rescan often falls on the same heap page. It can also avoid repeated > pinning and unpinning of the same buffer when rescans happen to revisit > the same page. > > Preparation for an upcoming patch that will add the amgetbatch > interface to enable optimizations such as I/O prefetching. > > Author: Peter Geoghegan <[email protected]> > Reviewed-By: Andres Freund <[email protected]> > Discussion: > https://postgr.es/m/CAH2-Wz=g=jtsydb4utb5su2zcvss7vbp+zmvvag6abocb+s...@mail.gmail.com > --- > src/backend/access/heap/heapam_indexscan.c | 27 ++++++++++++---------- > src/backend/access/index/indexam.c | 6 ++--- > 2 files changed, 18 insertions(+), 15 deletions(-) > > diff --git a/src/backend/access/heap/heapam_indexscan.c > b/src/backend/access/heap/heapam_indexscan.c > index 4b5e2b30d..82f135e1e 100644 > --- a/src/backend/access/heap/heapam_indexscan.c > +++ b/src/backend/access/heap/heapam_indexscan.c > @@ -59,18 +59,15 @@ heapam_index_fetch_reset(IndexFetchTableData *scan) > { > IndexFetchHeapData *hscan = (IndexFetchHeapData *) scan; > > - if (BufferIsValid(hscan->xs_cbuf)) > - { > - ReleaseBuffer(hscan->xs_cbuf); > - hscan->xs_cbuf = InvalidBuffer; > - hscan->xs_blk = InvalidBlockNumber; > - } > + /* Resets are a no-op (XXX amgetbatch commit resets xs_vm_items here) */ > + (void) hscan; I assume that's not a line you intend to commit? LGTM otherwise. Need to do something else for a while. More another time. Greetings, Andres Freund
