Re: EXPLAIN: showing ReadStream / prefetch stats

Andres Freund Mon, 30 Mar 2026 10:29:06 -0700

Hi,


> Subject: [PATCH v5 1/6] bufmgr: Return whether WaitReadBuffers() needed to
>  wait
>
> In a subsequent commit read_stream.c will use this as an input to the read
> ahead distance.

Note that this actually depends on a prior commit to work for io_uring...


> From 410eaaebe7b814ac9f44c080e153f4ec1d6d6b86 Mon Sep 17 00:00:00 2001
> From: Tomas Vondra <[email protected]>
> Date: Thu, 19 Mar 2026 22:25:09 +0100
> Subject: [PATCH v5 3/6] explain: show prefetch stats in EXPLAIN (ANALYZE)
>
> This adds details about AIO / prefetch for executor nodes using a
> ReadStream. Right now this applies only to BitmapHeapScan, because
> that's the only scan node using a ReadStream and collecting
> instrumentation from workers.

I don't understand why that means it should be done as part of this commit,
whereas seqscans shouldn't?


> The stats are collected by the ReadStream unconditionally, i.e. without
> EXPLAIN ANALYZE etc. The amount of statistics is trivial (a handful of
> integer counters), it's not worth gating this by a flag.

I think this is not true anymore.



> --- a/doc/src/sgml/ref/explain.sgml
> +++ b/doc/src/sgml/ref/explain.sgml
> @@ -46,6 +46,7 @@ EXPLAIN [ ( <replaceable 
> class="parameter">option</replaceable> [, ...] ) ] <rep
>      TIMING [ <replaceable class="parameter">boolean</replaceable> ]
>      SUMMARY [ <replaceable class="parameter">boolean</replaceable> ]
>      MEMORY [ <replaceable class="parameter">boolean</replaceable> ]
> +    IO [ <replaceable class="parameter">boolean</replaceable> ]
>      FORMAT { TEXT | XML | JSON | YAML }
>  </synopsis>
>   </refsynopsisdiv>
> @@ -298,6 +299,17 @@ ROLLBACK;
>      </listitem>
>     </varlistentry>
>
> +   <varlistentry>
> +    <term><literal>IO</literal></term>
> +    <listitem>
> +     <para>
> +      Include information on I/O performed by each node.
> +      This parameter may only be used when <literal>ANALYZE</literal> is also
> +      enabled.  It defaults to <literal>TRUE</literal>.
> +     </para>
> +    </listitem>
> +   </varlistentry>
> +
>     <varlistentry>
>      <term><literal>FORMAT</literal></term>
>      <listitem>

I wonder if it's sane to default this to true. The amount of adjustments in
the regression tests required in the later commits makes me a bit wary. But I
guess just going around and adding a few dozen OPTION OFF is what we've lately
been doing, so perhaps it's ok.



> @@ -1198,6 +1198,7 @@ heap_beginscan(Relation relation, Snapshot snapshot,
>       scan->rs_base.rs_nkeys = nkeys;
>       scan->rs_base.rs_flags = flags;
>       scan->rs_base.rs_parallel = parallel_scan;
> +     scan->rs_base.rs_instrument = NULL;
>       scan->rs_strategy = NULL;       /* set in initscan */
>       scan->rs_cbuf = InvalidBuffer;
>
> @@ -1267,6 +1268,12 @@ heap_beginscan(Relation relation, Snapshot snapshot,
>
>       scan->rs_read_stream = NULL;
>
> +     /* allocate instrumentation */
> +     if (flags & SO_SCAN_INSTRUMENT)
> +             scan->rs_base.rs_instrument = 
> palloc0_object(TableScanInstrumentation);
> +     else
> +             scan->rs_base.rs_instrument = NULL;
> +
>       /*
>        * Set up a read stream for sequential scans and TID range scans. This
>        * should be done after initscan() because initscan() allocates the
> @@ -1296,7 +1303,8 @@ heap_beginscan(Relation relation, Snapshot snapshot,
>                                                                               
>                                   MAIN_FORKNUM,
>                                                                               
>                                   cb,
>                                                                               
>                                   scan,
> -                                                                             
>                                   0);
> +                                                                             
>                                   0,
> +                                                                             
>                                   &scan->rs_base.rs_instrument->io);
>       }
>       else if (scan->rs_base.rs_flags & SO_TYPE_BITMAPSCAN)
>       {
> @@ -1307,7 +1315,8 @@ heap_beginscan(Relation relation, Snapshot snapshot,
>                                                                               
>                                   MAIN_FORKNUM,
>                                                                               
>                                   bitmapheap_stream_read_next,
>                                                                               
>                                   scan,
> -                                                                             
>                                   sizeof(TBMIterateResult));
> +                                                                             
>                                   sizeof(TBMIterateResult),
> +                                                                             
>                                   &scan->rs_base.rs_instrument->io);
>       }
>
>       scan->rs_vmbuffer = InvalidBuffer;

It doesn't seem great to dereference rs_base.rs_instrument if it's NULL, even
if that works because you then just compute the address of that.  I think
you'll end up passing the offset of io in TableScanInstrumentation this way.



> +static void
> +print_io_usage(ExplainState *es, IOStats *stats)
> +{
> +     /* don't print stats if there's nothing to report */
> +     if (stats->prefetch_count > 0)
> +     {
> +             if (es->format == EXPLAIN_FORMAT_TEXT)
> +             {
> +                     /* prefetch distance info */
> +                     ExplainIndentText(es);
> +                     appendStringInfo(es->str, "Prefetch: avg=%.3f max=%d 
> capacity=%d",
> +                                                      (stats->distance_sum * 
> 1.0 / stats->prefetch_count),
> +                                                      stats->distance_max,
> +                                                      
> stats->distance_capacity);
> +                     appendStringInfoChar(es->str, '\n');
> +
> +                     /* prefetch I/O info (only if there were actual I/Os) */
> +                     if (stats->io_count > 0)
> +                     {
> +                             ExplainIndentText(es);
> +                             appendStringInfo(es->str, "I/O: stalls=%" 
> PRIu64,
> +                                                              
> stats->stall_count);
> +
> +                             appendStringInfo(es->str, " size=%.3f 
> inprogress=%.3f",
> +                                                              
> (stats->io_nblocks * 1.0 / stats->io_count),
> +                                                              
> (stats->io_in_progress * 1.0 / stats->io_count));
> +
> +                             appendStringInfoChar(es->str, '\n');
> +                     }

Why have these separate appendStringInfoChar's if you just could just include
the \n in the prior appendStringInfo?

Separately, is %.3f really the right thing? When is that degree of precision
useful here?



> diff --git a/src/backend/executor/nodeBitmapHeapscan.c 
> b/src/backend/executor/nodeBitmapHeapscan.c
> index 7e2c1b7467b..d181fd7b8ff 100644
> --- a/src/backend/executor/nodeBitmapHeapscan.c
> +++ b/src/backend/executor/nodeBitmapHeapscan.c
> @@ -104,6 +104,7 @@ BitmapTableScanSetup(BitmapHeapScanState *node)
>       TBMIterator tbmiterator = {0};
>       ParallelBitmapHeapState *pstate = node->pstate;
>       dsa_area   *dsa = node->ss.ps.state->es_query_dsa;
> +     EState *estate = node->ss.ps.state;
>
>       if (!pstate)
>       {
> @@ -146,7 +147,7 @@ BitmapTableScanSetup(BitmapHeapScanState *node)
>       {
>               node->ss.ss_currentScanDesc =
>                       table_beginscan_bm(node->ss.ss_currentRelation,
> -                                                        0,
> +                                                        
> (estate->es_instrument) ? SO_SCAN_INSTRUMENT : 0,
>                                                          
> node->ss.ps.state->es_snapshot,
>                                                          0,
>                                                          NULL);
> @@ -329,6 +330,9 @@ ExecEndBitmapHeapScan(BitmapHeapScanState *node)
>                */
>               si->exact_pages += node->stats.exact_pages;
>               si->lossy_pages += node->stats.lossy_pages;
> +
> +             /* collect prefetch info for this process from the read_stream 
> */
> +             ACCUMULATE_IO_STATS(&si->stats.io, 
> &node->ss.ss_currentScanDesc->rs_instrument->io);
>       }
>
>       /*

I see no point in having ACCUMULATE_IO_STATS be a macro instead of a static
inline?



> +/*
> + * Update stream stats with current pinned buffer depth.
> + *
> + * Called once per buffer returned to the consumer in 
> read_stream_next_buffer().
> + * Records the number of pinned buffers at that moment, so we can compute the
> + * average look-ahead depth.
> + */
> +static inline void
> +read_stream_update_stats_prefetch(ReadStream *stream)
> +{
> +     IOStats *stats = stream->stats;
> +
> +     if (stats == NULL)
> +             return;
> +
> +     stats->prefetch_count++;
> +     stats->distance_sum += stream->pinned_buffers;
> +     if (stream->pinned_buffers > stats->distance_max)
> +             stats->distance_max = stream->pinned_buffers;
> +}
> +
> +/*
> + * Update stream stats about size of I/O requests.
> + *
> + * We count the number of I/O requests, size of requests (counted in blocks)
> + * and number of in-progress I/Os.
> + */
> +static inline void
> +read_stream_update_stats_io(ReadStream *stream, int nblocks, int in_progress)
> +{
> +     IOStats *stats = stream->stats;
> +
> +     if (stats == NULL)
> +             return;
> +
> +     stats->io_count++;
> +     stats->io_nblocks += nblocks;
> +     stats->io_in_progress += in_progress;
> +}
> +
> +static inline void
> +read_stream_update_stats_stall(ReadStream *stream)
> +{

Wonder if we should rename "stall" to "wait", seems like it might be easier to
understand for a more general audience.


>  /* ----------------------------------------------------------------
> @@ -404,9 +471,50 @@ void
>  ExecSeqScanInitializeWorker(SeqScanState *node,
>                                                       ParallelWorkerContext 
> *pwcxt)
>  {
> +     EState     *estate = node->ss.ps.state;
>       ParallelTableScanDesc pscan;
> +     char       *ptr;
> +     Size            size;
>
>       pscan = shm_toc_lookup(pwcxt->toc, node->ss.ps.plan->plan_node_id, 
> false);
>       node->ss.ss_currentScanDesc =
> -             table_beginscan_parallel(node->ss.ss_currentRelation, 0, pscan);
> +             table_beginscan_parallel(node->ss.ss_currentRelation,
> +                                                              
> (estate->es_instrument) ? SO_SCAN_INSTRUMENT : 0,
> +                                                              pscan);
> +
> +     /*
> +      * Workers don't get the pscan_len value in scan descriptor, so use the
> +      * TAM callback again. The result has to match the earlier result in
> +      * ExecSeqScanEstimate.
> +      */
> +     size = table_parallelscan_estimate(node->ss.ss_currentRelation,
> +                                                                        
> estate->es_snapshot);
> +

That seems quite grotty...


> +     ptr = (char *) pscan;
> +     ptr += MAXALIGN(size);
> +
> +     if (node->ss.ps.instrument)
> +             node->sinstrument = (SharedSeqScanInstrumentation *) ptr;

Seems like the appropriate instrumentation pointers should be computed by the
leader, not in the workers.



> diff --git a/src/test/regress/expected/explain.out 
> b/src/test/regress/expected/explain.out
> index 97cb7206676..3fa6d838545 100644
> --- a/src/test/regress/expected/explain.out
> +++ b/src/test/regress/expected/explain.out
> @@ -93,112 +93,126 @@ select explain_filter('explain (analyze, buffers, io 
> off, format text) select *
>   Execution Time: N.N ms
>  (3 rows)
>
> -select explain_filter('explain (analyze, buffers, io off, format xml) select 
> * from int8_tbl i8');
> -                     explain_filter
> ---------------------------------------------------------
> - <explain xmlns="http://www.postgresql.org/N/explain";> +
...
> + <explain xmlns="http://www.postgresql.org/N/explain";>           +
...

Seems these tests should really use unaligned output, to avoid this stupid
"every line changes despite not really anything changing" type of diff.

Greetings,

Andres Freund

Re: EXPLAIN: showing ReadStream / prefetch stats

Reply via email to