Hi,
> Subject: [PATCH v5 1/6] bufmgr: Return whether WaitReadBuffers() needed to > wait > > In a subsequent commit read_stream.c will use this as an input to the read > ahead distance. Note that this actually depends on a prior commit to work for io_uring... > From 410eaaebe7b814ac9f44c080e153f4ec1d6d6b86 Mon Sep 17 00:00:00 2001 > From: Tomas Vondra <[email protected]> > Date: Thu, 19 Mar 2026 22:25:09 +0100 > Subject: [PATCH v5 3/6] explain: show prefetch stats in EXPLAIN (ANALYZE) > > This adds details about AIO / prefetch for executor nodes using a > ReadStream. Right now this applies only to BitmapHeapScan, because > that's the only scan node using a ReadStream and collecting > instrumentation from workers. I don't understand why that means it should be done as part of this commit, whereas seqscans shouldn't? > The stats are collected by the ReadStream unconditionally, i.e. without > EXPLAIN ANALYZE etc. The amount of statistics is trivial (a handful of > integer counters), it's not worth gating this by a flag. I think this is not true anymore. > --- a/doc/src/sgml/ref/explain.sgml > +++ b/doc/src/sgml/ref/explain.sgml > @@ -46,6 +46,7 @@ EXPLAIN [ ( <replaceable > class="parameter">option</replaceable> [, ...] ) ] <rep > TIMING [ <replaceable class="parameter">boolean</replaceable> ] > SUMMARY [ <replaceable class="parameter">boolean</replaceable> ] > MEMORY [ <replaceable class="parameter">boolean</replaceable> ] > + IO [ <replaceable class="parameter">boolean</replaceable> ] > FORMAT { TEXT | XML | JSON | YAML } > </synopsis> > </refsynopsisdiv> > @@ -298,6 +299,17 @@ ROLLBACK; > </listitem> > </varlistentry> > > + <varlistentry> > + <term><literal>IO</literal></term> > + <listitem> > + <para> > + Include information on I/O performed by each node. > + This parameter may only be used when <literal>ANALYZE</literal> is also > + enabled. It defaults to <literal>TRUE</literal>. > + </para> > + </listitem> > + </varlistentry> > + > <varlistentry> > <term><literal>FORMAT</literal></term> > <listitem> I wonder if it's sane to default this to true. The amount of adjustments in the regression tests required in the later commits makes me a bit wary. But I guess just going around and adding a few dozen OPTION OFF is what we've lately been doing, so perhaps it's ok. > @@ -1198,6 +1198,7 @@ heap_beginscan(Relation relation, Snapshot snapshot, > scan->rs_base.rs_nkeys = nkeys; > scan->rs_base.rs_flags = flags; > scan->rs_base.rs_parallel = parallel_scan; > + scan->rs_base.rs_instrument = NULL; > scan->rs_strategy = NULL; /* set in initscan */ > scan->rs_cbuf = InvalidBuffer; > > @@ -1267,6 +1268,12 @@ heap_beginscan(Relation relation, Snapshot snapshot, > > scan->rs_read_stream = NULL; > > + /* allocate instrumentation */ > + if (flags & SO_SCAN_INSTRUMENT) > + scan->rs_base.rs_instrument = > palloc0_object(TableScanInstrumentation); > + else > + scan->rs_base.rs_instrument = NULL; > + > /* > * Set up a read stream for sequential scans and TID range scans. This > * should be done after initscan() because initscan() allocates the > @@ -1296,7 +1303,8 @@ heap_beginscan(Relation relation, Snapshot snapshot, > > MAIN_FORKNUM, > > cb, > > scan, > - > 0); > + > 0, > + > &scan->rs_base.rs_instrument->io); > } > else if (scan->rs_base.rs_flags & SO_TYPE_BITMAPSCAN) > { > @@ -1307,7 +1315,8 @@ heap_beginscan(Relation relation, Snapshot snapshot, > > MAIN_FORKNUM, > > bitmapheap_stream_read_next, > > scan, > - > sizeof(TBMIterateResult)); > + > sizeof(TBMIterateResult), > + > &scan->rs_base.rs_instrument->io); > } > > scan->rs_vmbuffer = InvalidBuffer; It doesn't seem great to dereference rs_base.rs_instrument if it's NULL, even if that works because you then just compute the address of that. I think you'll end up passing the offset of io in TableScanInstrumentation this way. > +static void > +print_io_usage(ExplainState *es, IOStats *stats) > +{ > + /* don't print stats if there's nothing to report */ > + if (stats->prefetch_count > 0) > + { > + if (es->format == EXPLAIN_FORMAT_TEXT) > + { > + /* prefetch distance info */ > + ExplainIndentText(es); > + appendStringInfo(es->str, "Prefetch: avg=%.3f max=%d > capacity=%d", > + (stats->distance_sum * > 1.0 / stats->prefetch_count), > + stats->distance_max, > + > stats->distance_capacity); > + appendStringInfoChar(es->str, '\n'); > + > + /* prefetch I/O info (only if there were actual I/Os) */ > + if (stats->io_count > 0) > + { > + ExplainIndentText(es); > + appendStringInfo(es->str, "I/O: stalls=%" > PRIu64, > + > stats->stall_count); > + > + appendStringInfo(es->str, " size=%.3f > inprogress=%.3f", > + > (stats->io_nblocks * 1.0 / stats->io_count), > + > (stats->io_in_progress * 1.0 / stats->io_count)); > + > + appendStringInfoChar(es->str, '\n'); > + } Why have these separate appendStringInfoChar's if you just could just include the \n in the prior appendStringInfo? Separately, is %.3f really the right thing? When is that degree of precision useful here? > diff --git a/src/backend/executor/nodeBitmapHeapscan.c > b/src/backend/executor/nodeBitmapHeapscan.c > index 7e2c1b7467b..d181fd7b8ff 100644 > --- a/src/backend/executor/nodeBitmapHeapscan.c > +++ b/src/backend/executor/nodeBitmapHeapscan.c > @@ -104,6 +104,7 @@ BitmapTableScanSetup(BitmapHeapScanState *node) > TBMIterator tbmiterator = {0}; > ParallelBitmapHeapState *pstate = node->pstate; > dsa_area *dsa = node->ss.ps.state->es_query_dsa; > + EState *estate = node->ss.ps.state; > > if (!pstate) > { > @@ -146,7 +147,7 @@ BitmapTableScanSetup(BitmapHeapScanState *node) > { > node->ss.ss_currentScanDesc = > table_beginscan_bm(node->ss.ss_currentRelation, > - 0, > + > (estate->es_instrument) ? SO_SCAN_INSTRUMENT : 0, > > node->ss.ps.state->es_snapshot, > 0, > NULL); > @@ -329,6 +330,9 @@ ExecEndBitmapHeapScan(BitmapHeapScanState *node) > */ > si->exact_pages += node->stats.exact_pages; > si->lossy_pages += node->stats.lossy_pages; > + > + /* collect prefetch info for this process from the read_stream > */ > + ACCUMULATE_IO_STATS(&si->stats.io, > &node->ss.ss_currentScanDesc->rs_instrument->io); > } > > /* I see no point in having ACCUMULATE_IO_STATS be a macro instead of a static inline? > +/* > + * Update stream stats with current pinned buffer depth. > + * > + * Called once per buffer returned to the consumer in > read_stream_next_buffer(). > + * Records the number of pinned buffers at that moment, so we can compute the > + * average look-ahead depth. > + */ > +static inline void > +read_stream_update_stats_prefetch(ReadStream *stream) > +{ > + IOStats *stats = stream->stats; > + > + if (stats == NULL) > + return; > + > + stats->prefetch_count++; > + stats->distance_sum += stream->pinned_buffers; > + if (stream->pinned_buffers > stats->distance_max) > + stats->distance_max = stream->pinned_buffers; > +} > + > +/* > + * Update stream stats about size of I/O requests. > + * > + * We count the number of I/O requests, size of requests (counted in blocks) > + * and number of in-progress I/Os. > + */ > +static inline void > +read_stream_update_stats_io(ReadStream *stream, int nblocks, int in_progress) > +{ > + IOStats *stats = stream->stats; > + > + if (stats == NULL) > + return; > + > + stats->io_count++; > + stats->io_nblocks += nblocks; > + stats->io_in_progress += in_progress; > +} > + > +static inline void > +read_stream_update_stats_stall(ReadStream *stream) > +{ Wonder if we should rename "stall" to "wait", seems like it might be easier to understand for a more general audience. > /* ---------------------------------------------------------------- > @@ -404,9 +471,50 @@ void > ExecSeqScanInitializeWorker(SeqScanState *node, > ParallelWorkerContext > *pwcxt) > { > + EState *estate = node->ss.ps.state; > ParallelTableScanDesc pscan; > + char *ptr; > + Size size; > > pscan = shm_toc_lookup(pwcxt->toc, node->ss.ps.plan->plan_node_id, > false); > node->ss.ss_currentScanDesc = > - table_beginscan_parallel(node->ss.ss_currentRelation, 0, pscan); > + table_beginscan_parallel(node->ss.ss_currentRelation, > + > (estate->es_instrument) ? SO_SCAN_INSTRUMENT : 0, > + pscan); > + > + /* > + * Workers don't get the pscan_len value in scan descriptor, so use the > + * TAM callback again. The result has to match the earlier result in > + * ExecSeqScanEstimate. > + */ > + size = table_parallelscan_estimate(node->ss.ss_currentRelation, > + > estate->es_snapshot); > + That seems quite grotty... > + ptr = (char *) pscan; > + ptr += MAXALIGN(size); > + > + if (node->ss.ps.instrument) > + node->sinstrument = (SharedSeqScanInstrumentation *) ptr; Seems like the appropriate instrumentation pointers should be computed by the leader, not in the workers. > diff --git a/src/test/regress/expected/explain.out > b/src/test/regress/expected/explain.out > index 97cb7206676..3fa6d838545 100644 > --- a/src/test/regress/expected/explain.out > +++ b/src/test/regress/expected/explain.out > @@ -93,112 +93,126 @@ select explain_filter('explain (analyze, buffers, io > off, format text) select * > Execution Time: N.N ms > (3 rows) > > -select explain_filter('explain (analyze, buffers, io off, format xml) select > * from int8_tbl i8'); > - explain_filter > --------------------------------------------------------- > - <explain xmlns="http://www.postgresql.org/N/explain"> + ... > + <explain xmlns="http://www.postgresql.org/N/explain"> + ... Seems these tests should really use unaligned output, to avoid this stupid "every line changes despite not really anything changing" type of diff. Greetings, Andres Freund
