Re: index prefetching

Andres Freund Sat, 28 Feb 2026 10:02:13 -0800

Hi,

On 2026-02-27 18:52:09 -0500, Andres Freund wrote:
> Ran out of time & energy for the moment, more later.


And now for a bit more...



> From b97e57174f45cfcde335684529b68d643f7506db Mon Sep 17 00:00:00 2001
> From: Peter Geoghegan <[email protected]>
> Date: Thu, 22 Jan 2026 13:07:13 -0500
> Subject: [PATCH v11 08/12] Use ExecSetTupleBound hint during index scans.
>
> This gives index scans a way to avoid using a read stream during certain
> kinds of queries that are very unlikely to benefit from prefetching:
> queries whose plan involves a LIMIT node that is consumes tuples from an
> index scan (or index-only scan) node.
>
> Testing has shown this to be particularly important with nested loop
> joins with a LIMIT on an inner index scan.  This is typical of nested
> loop anti-joins, and nested loop semi-joins.
>
> XXX This is still very much a WIP.
>
> Author: Peter Geoghegan <[email protected]>
> Reviewed-By: Tomas Vondra <[email protected]>
> ---


> @@ -977,6 +978,49 @@ ExecSetTupleBound(int64 tuples_needed, PlanState 
> *child_node)
>
>               ExecSetTupleBound(tuples_needed, outerPlanState(child_node));
>       }
> +     else if (IsA(child_node, IndexScanState))
> +     {
> +             /*
> +              * If it is an IndexScan, save the tuples_needed in the state 
> so it
> +              * can be propagated to the IndexScanDesc when the scan is 
> started.
> +              *
> +              * Note: As with Sort, the index scan node is responsible for 
> reacting
> +              * properly to changes to this parameter.
> +              */
> +             IndexScanState *isstate = (IndexScanState *) child_node;
> +
> +             isstate->iss_TuplesNeeded = tuples_needed;

I think this would need a comment explaining that, in contrast to e.g. the
limit in a sort node, tuples_needed here is just an approximation, not precise
(due to non-index quals).


> +             /* If scan already started, update the IndexScanDesc too */
> +             if (isstate->iss_ScanDesc)
> +                     isstate->iss_ScanDesc->tuples_needed = tuples_needed;

We really need to stop this pointless business with starting the scan late in
index and sequential scans. It's afaict completely unneceded and requires a
fair bit of extra instructions in the hot ${Node}Next() functions.  But that's
a separate topic.



> +     }
> +     else if (IsA(child_node, IndexOnlyScanState))
> +     {
> +             /* Same comments as for IndexScan */
> +             IndexOnlyScanState *iosstate = (IndexOnlyScanState *) 
> child_node;
> +
> +             iosstate->ioss_TuplesNeeded = tuples_needed;
> +
> +             /* If scan already started, update the IndexScanDesc too */
> +             if (iosstate->ioss_ScanDesc)
> +                     iosstate->ioss_ScanDesc->tuples_needed = tuples_needed;
> +     }
> +     else if (IsA(child_node, NestLoopState))
> +     {
> +             /*
> +              * For NestLoop joins where each outer tuple produces at most 
> one
> +              * output tuple, we can propagate the bound to the outer child
> +              */
> +             NestLoopState *nlstate = (NestLoopState *) child_node;
> +             JoinType        jointype = nlstate->js.jointype;
> +
> +             if (jointype == JOIN_SEMI || jointype == JOIN_ANTI ||
> +                     nlstate->js.single_match)
> +             {
> +                     ExecSetTupleBound(tuples_needed, 
> outerPlanState(child_node));
> +             }
> +     }
>
>       /*
>        * In principle we could descend through any plan node type that is

I suspect this isn't quite correct as-is. Consider a semi join nestloop with a
post-join qual. In that case we might pull more nodes from the outer side than
the tuple limit. Which would not necessarily work if e.g. the outer side is
implemented as a top-n sort.


Another issue is that the, I think, most potentially wasteful case for index
prefetching is actually a nestloop anti join doing a lot of unnecessary
prefetching on the *inner* side, because it can happen many times in a row.
But right now we don't use ExecSetTupleBound() for such cases...

Wonder if we should expand ExecSetTupleBound() with an additional argument
indicating whether the passed down limit is a hard limit (so it can be used by
e.g. bounded sort) or a soft limit (in which case it only would be used to
limit prefetching).


If we had something like that, and used it for hinting anit-join nestloops,
merge joins, ... we could pass that knowledge down to index scans.

I suspect that most of what yielding addresses could then be addressed by a
read stream flag forcing read stream's distance to increase much more slowly,
and perhaps also bounded by a lower distance.  Something like a
READ_STREAM_SLOW_START flag at creation and/or read_stream_limit_distance()?



> From 60a829a7ce45dbe4da9e0f8e2fcdad862866e89a Mon Sep 17 00:00:00 2001
> From: Peter Geoghegan <[email protected]>
> Date: Sun, 18 Jan 2026 11:32:52 -0500
> Subject: [PATCH v11 11/12] Add fake LSN support to hash index AM.
>
> This is preparation for an upcoming patch that will add the amgetbatch
> interface and switch hash over to it (from amgettuple).  We need fake
> LSNs to make it safe to apply behavior that is equivalent to nbtree's
> previous dropPin behavior that works with unlogged hash index scans.
>
> The commit that will add hashgetbatch makes the required changes to
> _hash_kill_items that actually relies on hash generating fake LSNs.
>
> Author: Peter Geoghegan <[email protected]>
> Discussion: 
> https://postgr.es/m/cah2-wzkehuhxyua8quc7rrn3etnxpiksjpfo8mhb+0dr2k0...@mail.gmail.com

This made me look at some of the surrounding code again. And I suspect we
might have a bug with unlogged tables (independent of this patch, just related
due to the fake lsn infra).

FlushBuffer() is careful to not use a page's LSN unless it's
BM_PERMANENT. However, GetVictimBuffer() is not:

                        /* Read the LSN while holding buffer header lock */
                        buf_state = LockBufHdr(buf_hdr);
                        lsn = BufferGetLSN(buf_hdr);
                        UnlockBufHdr(buf_hdr);

                        if (XLogNeedsFlush(lsn)
                                && StrategyRejectBuffer(strategy, buf_hdr, 
from_ring))
                        {
                                LockBuffer(buf, BUFFER_LOCK_UNLOCK);
                                UnpinBuffer(buf_hdr);
                                goto again;
                        }

If this is an unlogged table, the call to XLogNeedsFlush() will use a fake
LSN, while XLogNeedsFlush() expects a real LSN.

The consequences seem luckily limited, I don't think anything today uses
bulkread strategies (which are the only ones where StrategyRejectBuffer()
returns false if a flush is needed) with gist or such.  But it'd be completely
reasonable for XLogNeedsFlush() to have assertions about the LSN being in
range, we just don't today.



> @@ -662,14 +670,14 @@ _hash_freeovflpage(Relation rel, Buffer bucketbuf, 
> Buffer ovflbuf,
>                * bucket buffer was not changed, but still needs to be 
> registered to
>                * ensure that we can acquire a cleanup lock on it during 
> replay.
>                */
> -             if (!xlrec.is_prim_bucket_same_wrt)
> +             if (!is_prim_bucket_same_wrt)
>               {
>                       uint8           flags = REGBUF_STANDARD | 
> REGBUF_NO_IMAGE | REGBUF_NO_CHANGE;
>
>                       XLogRegisterBuffer(0, bucketbuf, flags);
>               }
>
> -             if (xlrec.ntups > 0)
> +             if (nitups > 0)
>               {
>                       XLogRegisterBuffer(1, wbuf, REGBUF_STANDARD);
>
>
> @@ -678,10 +686,10 @@ _hash_freeovflpage(Relation rel, Buffer bucketbuf, 
> Buffer ovflbuf,
>
>                       XLogRegisterBufData(1, itup_offsets,
>                                                               nitups * 
> sizeof(OffsetNumber));
> -                     for (i = 0; i < nitups; i++)
> +                     for (int i = 0; i < nitups; i++)
>                               XLogRegisterBufData(1, itups[i], tups_size[i]);
>               }
> -             else if (xlrec.is_prim_bucket_same_wrt || 
> xlrec.is_prev_bucket_same_wrt)
> +             else if (is_prim_bucket_same_wrt || is_prev_bucket_same_wrt)
>               {
>                       uint8           wbuf_flags;
>
> @@ -691,10 +699,10 @@ _hash_freeovflpage(Relation rel, Buffer bucketbuf, 
> Buffer ovflbuf,
>                        * if it is the same as primary bucket buffer or update 
> the
>                        * nextblkno if it is same as the previous bucket 
> buffer.
>                        */
> -                     Assert(xlrec.ntups == 0);
> +                     Assert(nitups == 0);
>
>                       wbuf_flags = REGBUF_STANDARD;
> -                     if (!xlrec.is_prev_bucket_same_wrt)
> +                     if (!is_prev_bucket_same_wrt)
>                       {
>                               wbuf_flags |= REGBUF_NO_CHANGE;
>                       }
> @@ -714,7 +722,7 @@ _hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer 
> ovflbuf,
>                * prevpage.  During replay, we can directly update the 
> nextblock in
>                * writepage.
>                */
> -             if (BufferIsValid(prevbuf) && !xlrec.is_prev_bucket_same_wrt)
> +             if (BufferIsValid(prevbuf) && !is_prev_bucket_same_wrt)
>                       XLogRegisterBuffer(3, prevbuf, REGBUF_STANDARD);
>
>               if (BufferIsValid(nextbuf))


These changes afaict are unrelated to the subject of the commit?


> @@ -730,23 +738,33 @@ _hash_freeovflpage(Relation rel, Buffer bucketbuf, 
> Buffer ovflbuf,
>               }
>
>               recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_SQUEEZE_PAGE);
> -
> -             /* Set LSN iff wbuf is modified. */
> -             if (mod_wbuf)
> -                     PageSetLSN(BufferGetPage(wbuf), recptr);
> -
> -             PageSetLSN(BufferGetPage(ovflbuf), recptr);
> -
> -             if (BufferIsValid(prevbuf) && !xlrec.is_prev_bucket_same_wrt)
> -                     PageSetLSN(BufferGetPage(prevbuf), recptr);
> -             if (BufferIsValid(nextbuf))
> -                     PageSetLSN(BufferGetPage(nextbuf), recptr);
> -
> -             PageSetLSN(BufferGetPage(mapbuf), recptr);
> -
> -             if (update_metap)
> -                     PageSetLSN(BufferGetPage(metabuf), recptr);
>       }
> +     else                                            /* 
> !RelationNeedsWAL(rel) */
> +     {
> +             recptr = XLogGetFakeLSN(rel);
> +
> +             /* Determine if wbuf is modified */
> +             if (nitups > 0)
> +                     mod_wbuf = true;
> +             else if (is_prev_bucket_same_wrt)
> +                     mod_wbuf = true;

Seems like this should be pulled to before the if (RelationNeedsWAL(rel))
so we don't have duplicate, potentially diverging, logic?


> From a99a3f0576c2c055c269e5d24a5d8075db0b66e7 Mon Sep 17 00:00:00 2001
> From: Peter Geoghegan <[email protected]>
> Date: Tue, 25 Nov 2025 18:03:15 -0500
> Subject: [PATCH v11 12/12] Make hash index AM use amgetbatch interface.

> Replace hashgettuple with hashgetbatch, a function that implements the
> new amgetbatch interface.  Plain index scans of hash indexes now return
> matching items in batches consisting of all of the matches from a given
> bucket or overflow page.  This gives the core executor the ability to
> perform optimizations like index prefetching during hash index scans.
>
> Note that hash index scans will now drop index page buffer pins eagerly
> (actually, the table AM will do so on behalf of the hash index AM).

I wonder if that part should be pulled out of this commit? I think that very
well could be committed earlier.


> This is a hard requirement for any index AM that adopts the new
> amgetbatch interface.  Guaranteeing that open batches won't hold buffer
> pins on index pages greatly simplifies resource management during index
> prefetching, where the read stream is expected to hold many pins on heap
> pages (that's why amgetbatch makes this a hard requirement).
>
> Also add Valgrind buffer lock instrumentation to hash, bringing it in
> line with nbtree following commit 4a70f829.  This is another requirement
> when using the amgetbatch interface.
>
> Author: Peter Geoghegan <[email protected]>
> Reviewed-By: Tomas Vondra <[email protected]>
> Discussion:
> https://postgr.es/m/CAH2-WzmYqhacBH161peAWb5eF=Ja7CFAQ+0jSEMq=qnflvt...@mail.gmail.com

Deferring reviewing this commit, I think it might change sufficiently much
with some further evolution of the relevant interfaces.


Greetings,

Andres

Re: index prefetching

Reply via email to