On 4/7/26 02:19, Melanie Plageman wrote:
> On Mon, Apr 6, 2026 at 4:39 PM Tomas Vondra <[email protected]> wrote:
>>
>> On 4/6/26 18:50, Melanie Plageman wrote:
>>
>>> I cleaned up the first patch in the set that refactors index-only and
>>> index scan to use this pattern and realized that I wasn't sure what to
>>> do about the duplication between index-only and index scans for these
>>> functions.
>>
> <--snip-->
>>
>> I think this is ready to go, with a tiny amount of polishing.
>>
>> 1) "amount of DSM" sounds a bit strange to me. The wording "amount of
>> space in ..." from the other nodes seems better to me. Or maybe I'm just
>> used to it, not sure.
> 
> Fixed that before pushing.
> 
>> 2) I wonder if maybe PARALLEL_KEY_SCAN_INSTRUMENT_OFFSET should be
>> placed in plannodes.h, because that's where Plan->plan_node_id is
>> defined. instrument_node.h works too, but the places accessing the
>> plan_node_id already have to include plannodes.h.
> 
> I don't think it really belongs there. It is very specific to
> execution. We use the plan node id as the key for convenience, but it
> isn't the purpose of plan_node_id.
> PARALLEL_KEY_SCAN_INSTRUMENT_OFFSET's only purpose is to be used
> during execution for instrumentation, so it feels like it belongs in
> node_instrument.h. And we don't need a separate include for it either.
> It goes with the other stuff being defined in instrument_node.h (like
> SharedIndexScanInstrumentation) and being used by those callers. I
> admit the comment above it is a bit odd, but I think it is ultimately
> okay.
> 

WFM

>> We need to decide whether to push this into PG19. This was primarily
>> motivated by the index prefetching work, but we now know that won't
>> happen in PG19 :-( But the instrumentation is useful even for the three
>> scans using read streams, so I think it's a meaningful improvement.
> 
> I think it is a meaningful improvement too. I think we should do it.

OK

>> If you think you can get this pushed, I'll do my best to finalize the
>> instrumentation and SeqScan/TidRangeScan parts.
> 
> I've pushed the first patch for index/index-only scans. Attached is
> the BHS fix that uses the new pattern. It needs at least one review
> before pushing. While I was polishing it, I realized I neglected to
> use add_size()/mul_size() in the index-only/index scan patches. So,
> 0002 is just a fix commit to do that. Feel free to push these if you
> think they're ready. Otherwise, I'll do so pending your review in my
> morning.
> 

Thanks!

I'll take a look in my morning, and will consider pushing the changes so
that I can start pushing the parts adding the new instrumentation.


I've spent a fair amount of time looking at those parts today. Attached
is v12 (including the two parts you posted as v11), with two or three
bigger changes compared to earlier versions:

1) default OFF

After thinking about it a bit more, I decided to change the default to
OFF. On the one hand I agree it's somewhat similar to BUFFERS, and that
option is ON by default now. But on the other hand, we must not clutter
EXPLAIN output with too much information, and I'm not convinced IO is
worth it. So I changed to OFF by default. That also makes the patches
smaller, due to not having to adjust that many tests.

2) auto_explain

It also occurred to me we should add a matching option to auto_explain,
similarly to log_buffers. 0005 does that.

3) INSTRUMENT_IO

The auto_explain bit also implies we need a new INSTRUMENT_IO constant,
to handle this just like BUFFERS. Which also means we can actually
collect the stats only when the IO option is specified (instead of for
all ANALYZE runs). Which is nice.


I plan to do some more testing in my morning, add missing comments,
polish the commit messages etc. And then eventually push it.

regards

-- 
Tomas Vondra
From e7b4396be97e49a9c1791ce12de268229575bd83 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <[email protected]>
Date: Sun, 5 Apr 2026 16:27:22 -0400
Subject: [PATCH v12 1/7] Fix BitmapHeapScan non-parallel-aware EXPLAIN ANALYZE

Allocate shared bitmap table scan instrumentation in its own DSM chunk,
separate from ParallelBitmapHeapState. Previously, shared
instrumentation was only allocated for parallel-aware nodes, so a
non-parallel-aware bitmap table scan in a parallel query had no shared
instrumentation and EXPLAIN ANALYZE didn't report exact/lossy pages.
This affected queries like bitmap table scans on the outside of a
parallel join or those run with debug_parallel_query=regress. Fix by
allocating a separate DSM chunk for shared instrumentation and doing so
regardless of parallel-awareness.
---
 src/backend/commands/explain.c            |   2 +-
 src/backend/executor/execParallel.c       |   9 ++
 src/backend/executor/nodeBitmapHeapscan.c | 111 +++++++++++++---------
 src/include/executor/nodeBitmapHeapscan.h |   6 ++
 4 files changed, 83 insertions(+), 45 deletions(-)

diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 73eaaf176ac..f151f21f9b3 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -3948,7 +3948,7 @@ show_tidbitmap_info(BitmapHeapScanState *planstate, ExplainState *es)
 	}
 
 	/* Display stats for each parallel worker */
-	if (planstate->pstate != NULL)
+	if (planstate->sinstrument != NULL)
 	{
 		for (int n = 0; n < planstate->sinstrument->num_workers; n++)
 		{
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index 726aca230a4..1a5ec0c305f 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -303,6 +303,9 @@ ExecParallelEstimate(PlanState *planstate, ExecParallelEstimateContext *e)
 			if (planstate->plan->parallel_aware)
 				ExecBitmapHeapEstimate((BitmapHeapScanState *) planstate,
 									   e->pcxt);
+			/* even when not parallel-aware, for EXPLAIN ANALYZE */
+			ExecBitmapHeapInstrumentEstimate((BitmapHeapScanState *) planstate,
+											 e->pcxt);
 			break;
 		case T_HashJoinState:
 			if (planstate->plan->parallel_aware)
@@ -542,6 +545,9 @@ ExecParallelInitializeDSM(PlanState *planstate,
 			if (planstate->plan->parallel_aware)
 				ExecBitmapHeapInitializeDSM((BitmapHeapScanState *) planstate,
 											d->pcxt);
+			/* even when not parallel-aware, for EXPLAIN ANALYZE */
+			ExecBitmapHeapInstrumentInitDSM((BitmapHeapScanState *) planstate,
+											d->pcxt);
 			break;
 		case T_HashJoinState:
 			if (planstate->plan->parallel_aware)
@@ -1427,6 +1433,9 @@ ExecParallelInitializeWorker(PlanState *planstate, ParallelWorkerContext *pwcxt)
 			if (planstate->plan->parallel_aware)
 				ExecBitmapHeapInitializeWorker((BitmapHeapScanState *) planstate,
 											   pwcxt);
+			/* even when not parallel-aware, for EXPLAIN ANALYZE */
+			ExecBitmapHeapInstrumentInitWorker((BitmapHeapScanState *) planstate,
+											   pwcxt);
 			break;
 		case T_HashJoinState:
 			if (planstate->plan->parallel_aware)
diff --git a/src/backend/executor/nodeBitmapHeapscan.c b/src/backend/executor/nodeBitmapHeapscan.c
index 73831aed451..d65e2a87b42 100644
--- a/src/backend/executor/nodeBitmapHeapscan.c
+++ b/src/backend/executor/nodeBitmapHeapscan.c
@@ -495,18 +495,8 @@ void
 ExecBitmapHeapEstimate(BitmapHeapScanState *node,
 					   ParallelContext *pcxt)
 {
-	Size		size;
-
-	size = MAXALIGN(sizeof(ParallelBitmapHeapState));
-
-	/* account for instrumentation, if required */
-	if (node->ss.ps.instrument && pcxt->nworkers > 0)
-	{
-		size = add_size(size, offsetof(SharedBitmapHeapInstrumentation, sinstrument));
-		size = add_size(size, mul_size(pcxt->nworkers, sizeof(BitmapHeapScanInstrumentation)));
-	}
-
-	shm_toc_estimate_chunk(&pcxt->estimator, size);
+	shm_toc_estimate_chunk(&pcxt->estimator,
+						   MAXALIGN(sizeof(ParallelBitmapHeapState)));
 	shm_toc_estimate_keys(&pcxt->estimator, 1);
 }
 
@@ -521,27 +511,15 @@ ExecBitmapHeapInitializeDSM(BitmapHeapScanState *node,
 							ParallelContext *pcxt)
 {
 	ParallelBitmapHeapState *pstate;
-	SharedBitmapHeapInstrumentation *sinstrument = NULL;
 	dsa_area   *dsa = node->ss.ps.state->es_query_dsa;
-	char	   *ptr;
-	Size		size;
 
 	/* If there's no DSA, there are no workers; initialize nothing. */
 	if (dsa == NULL)
 		return;
 
-	size = MAXALIGN(sizeof(ParallelBitmapHeapState));
-	if (node->ss.ps.instrument && pcxt->nworkers > 0)
-	{
-		size = add_size(size, offsetof(SharedBitmapHeapInstrumentation, sinstrument));
-		size = add_size(size, mul_size(pcxt->nworkers, sizeof(BitmapHeapScanInstrumentation)));
-	}
-
-	ptr = shm_toc_allocate(pcxt->toc, size);
-	pstate = (ParallelBitmapHeapState *) ptr;
-	ptr += MAXALIGN(sizeof(ParallelBitmapHeapState));
-	if (node->ss.ps.instrument && pcxt->nworkers > 0)
-		sinstrument = (SharedBitmapHeapInstrumentation *) ptr;
+	pstate = (ParallelBitmapHeapState *)
+		shm_toc_allocate(pcxt->toc,
+						 MAXALIGN(sizeof(ParallelBitmapHeapState)));
 
 	pstate->tbmiterator = 0;
 
@@ -551,18 +529,8 @@ ExecBitmapHeapInitializeDSM(BitmapHeapScanState *node,
 
 	ConditionVariableInit(&pstate->cv);
 
-	if (sinstrument)
-	{
-		sinstrument->num_workers = pcxt->nworkers;
-
-		/* ensure any unfilled slots will contain zeroes */
-		memset(sinstrument->sinstrument, 0,
-			   pcxt->nworkers * sizeof(BitmapHeapScanInstrumentation));
-	}
-
 	shm_toc_insert(pcxt->toc, node->ss.ps.plan->plan_node_id, pstate);
 	node->pstate = pstate;
-	node->sinstrument = sinstrument;
 }
 
 /* ----------------------------------------------------------------
@@ -600,17 +568,72 @@ void
 ExecBitmapHeapInitializeWorker(BitmapHeapScanState *node,
 							   ParallelWorkerContext *pwcxt)
 {
-	char	   *ptr;
-
 	Assert(node->ss.ps.state->es_query_dsa != NULL);
 
-	ptr = shm_toc_lookup(pwcxt->toc, node->ss.ps.plan->plan_node_id, false);
+	node->pstate = (ParallelBitmapHeapState *)
+		shm_toc_lookup(pwcxt->toc, node->ss.ps.plan->plan_node_id, false);
+}
+
+/*
+ * Compute the amount of space we'll need for the shared instrumentation and
+ * inform pcxt->estimator.
+ */
+void
+ExecBitmapHeapInstrumentEstimate(BitmapHeapScanState *node,
+								 ParallelContext *pcxt)
+{
+	Size		size;
+
+	if (!node->ss.ps.instrument || pcxt->nworkers == 0)
+		return;
+
+	size = add_size(offsetof(SharedBitmapHeapInstrumentation, sinstrument),
+					mul_size(pcxt->nworkers, sizeof(BitmapHeapScanInstrumentation)));
+	shm_toc_estimate_chunk(&pcxt->estimator, size);
+	shm_toc_estimate_keys(&pcxt->estimator, 1);
+}
+
+/*
+ * Set up parallel bitmap heap scan instrumentation.
+ */
+void
+ExecBitmapHeapInstrumentInitDSM(BitmapHeapScanState *node,
+								ParallelContext *pcxt)
+{
+	Size		size;
+
+	if (!node->ss.ps.instrument || pcxt->nworkers == 0)
+		return;
+
+	size = add_size(offsetof(SharedBitmapHeapInstrumentation, sinstrument),
+					mul_size(pcxt->nworkers, sizeof(BitmapHeapScanInstrumentation)));
+	node->sinstrument =
+		(SharedBitmapHeapInstrumentation *) shm_toc_allocate(pcxt->toc, size);
+
+	/* Each per-worker area must start out as zeroes */
+	memset(node->sinstrument, 0, size);
+	node->sinstrument->num_workers = pcxt->nworkers;
+	shm_toc_insert(pcxt->toc,
+				   node->ss.ps.plan->plan_node_id +
+				   PARALLEL_KEY_SCAN_INSTRUMENT_OFFSET,
+				   node->sinstrument);
+}
 
-	node->pstate = (ParallelBitmapHeapState *) ptr;
-	ptr += MAXALIGN(sizeof(ParallelBitmapHeapState));
+/*
+ * Look up and save the location of the shared instrumentation.
+ */
+void
+ExecBitmapHeapInstrumentInitWorker(BitmapHeapScanState *node,
+								   ParallelWorkerContext *pwcxt)
+{
+	if (!node->ss.ps.instrument)
+		return;
 
-	if (node->ss.ps.instrument)
-		node->sinstrument = (SharedBitmapHeapInstrumentation *) ptr;
+	node->sinstrument = (SharedBitmapHeapInstrumentation *)
+		shm_toc_lookup(pwcxt->toc,
+					   node->ss.ps.plan->plan_node_id +
+					   PARALLEL_KEY_SCAN_INSTRUMENT_OFFSET,
+					   false);
 }
 
 /* ----------------------------------------------------------------
diff --git a/src/include/executor/nodeBitmapHeapscan.h b/src/include/executor/nodeBitmapHeapscan.h
index 5d82f71abff..5c0b6592a1f 100644
--- a/src/include/executor/nodeBitmapHeapscan.h
+++ b/src/include/executor/nodeBitmapHeapscan.h
@@ -28,6 +28,12 @@ extern void ExecBitmapHeapReInitializeDSM(BitmapHeapScanState *node,
 										  ParallelContext *pcxt);
 extern void ExecBitmapHeapInitializeWorker(BitmapHeapScanState *node,
 										   ParallelWorkerContext *pwcxt);
+extern void ExecBitmapHeapInstrumentEstimate(BitmapHeapScanState *node,
+											 ParallelContext *pcxt);
+extern void ExecBitmapHeapInstrumentInitDSM(BitmapHeapScanState *node,
+											ParallelContext *pcxt);
+extern void ExecBitmapHeapInstrumentInitWorker(BitmapHeapScanState *node,
+											   ParallelWorkerContext *pwcxt);
 extern void ExecBitmapHeapRetrieveInstrumentation(BitmapHeapScanState *node);
 
 #endif							/* NODEBITMAPHEAPSCAN_H */
-- 
2.53.0

From eb1b6b6f483d6ca77e869fd81cecd246cca3bf53 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <[email protected]>
Date: Mon, 6 Apr 2026 20:06:49 -0400
Subject: [PATCH v12 2/7] Use add_size/mul_size for index instrumentation size
 calculations

Use overflow-safe size arithmetic in the Index[Only]Scan and parallel
instrumentation functions, consistent with other executor nodes (Hash,
Sort, Agg, Memoize). This was an oversight in dd78e69cfc3.
---
 src/backend/executor/nodeIndexonlyscan.c | 8 ++++----
 src/backend/executor/nodeIndexscan.c     | 8 ++++----
 2 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c
index c4ff422756e..d52012e8a69 100644
--- a/src/backend/executor/nodeIndexonlyscan.c
+++ b/src/backend/executor/nodeIndexonlyscan.c
@@ -854,8 +854,8 @@ ExecIndexOnlyScanInstrumentEstimate(IndexOnlyScanState *node,
 	 * in the IndexOnlyScanState. We'll recalculate the needed size in
 	 * ExecIndexOnlyScanInstrumentInitDSM().
 	 */
-	size = offsetof(SharedIndexScanInstrumentation, winstrument) +
-		pcxt->nworkers * sizeof(IndexScanInstrumentation);
+	size = add_size(offsetof(SharedIndexScanInstrumentation, winstrument),
+					mul_size(pcxt->nworkers, sizeof(IndexScanInstrumentation)));
 	shm_toc_estimate_chunk(&pcxt->estimator, size);
 	shm_toc_estimate_keys(&pcxt->estimator, 1);
 }
@@ -872,8 +872,8 @@ ExecIndexOnlyScanInstrumentInitDSM(IndexOnlyScanState *node,
 	if (!node->ss.ps.instrument || pcxt->nworkers == 0)
 		return;
 
-	size = offsetof(SharedIndexScanInstrumentation, winstrument) +
-		pcxt->nworkers * sizeof(IndexScanInstrumentation);
+	size = add_size(offsetof(SharedIndexScanInstrumentation, winstrument),
+					mul_size(pcxt->nworkers, sizeof(IndexScanInstrumentation)));
 	node->ioss_SharedInfo =
 		(SharedIndexScanInstrumentation *) shm_toc_allocate(pcxt->toc, size);
 
diff --git a/src/backend/executor/nodeIndexscan.c b/src/backend/executor/nodeIndexscan.c
index 6cc927f2454..39f6691ee35 100644
--- a/src/backend/executor/nodeIndexscan.c
+++ b/src/backend/executor/nodeIndexscan.c
@@ -1789,8 +1789,8 @@ ExecIndexScanInstrumentEstimate(IndexScanState *node,
 	 * in the IndexScanState. We'll recalculate the needed size in
 	 * ExecIndexScanInstrumentInitDSM().
 	 */
-	size = offsetof(SharedIndexScanInstrumentation, winstrument) +
-		pcxt->nworkers * sizeof(IndexScanInstrumentation);
+	size = add_size(offsetof(SharedIndexScanInstrumentation, winstrument),
+					mul_size(pcxt->nworkers, sizeof(IndexScanInstrumentation)));
 	shm_toc_estimate_chunk(&pcxt->estimator, size);
 	shm_toc_estimate_keys(&pcxt->estimator, 1);
 }
@@ -1807,8 +1807,8 @@ ExecIndexScanInstrumentInitDSM(IndexScanState *node,
 	if (!node->ss.ps.instrument || pcxt->nworkers == 0)
 		return;
 
-	size = offsetof(SharedIndexScanInstrumentation, winstrument) +
-		pcxt->nworkers * sizeof(IndexScanInstrumentation);
+	size = add_size(offsetof(SharedIndexScanInstrumentation, winstrument),
+					mul_size(pcxt->nworkers, sizeof(IndexScanInstrumentation)));
 	node->iss_SharedInfo =
 		(SharedIndexScanInstrumentation *) shm_toc_allocate(pcxt->toc, size);
 
-- 
2.53.0

From b8561cc9b76282a0199a501c5b82daf093c91987 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <[email protected]>
Date: Tue, 31 Mar 2026 13:44:23 +0200
Subject: [PATCH v12 3/7] switch explain to unaligned for json/xml/yaml

---
 src/test/regress/expected/explain.out | 296 +++++++++++++-------------
 src/test/regress/sql/explain.sql      |   5 +-
 2 files changed, 150 insertions(+), 151 deletions(-)

diff --git a/src/test/regress/expected/explain.out b/src/test/regress/expected/explain.out
index 7c1f26b182c..dc31c7ce9f9 100644
--- a/src/test/regress/expected/explain.out
+++ b/src/test/regress/expected/explain.out
@@ -93,164 +93,160 @@ select explain_filter('explain (analyze, buffers, format text) select * from int
  Execution Time: N.N ms
 (3 rows)
 
-select explain_filter('explain (analyze, buffers, format xml) select * from int8_tbl i8');
-                     explain_filter                     
---------------------------------------------------------
- <explain xmlns="http://www.postgresql.org/N/explain";> +
-   <Query>                                             +
-     <Plan>                                            +
-       <Node-Type>Seq Scan</Node-Type>                 +
-       <Parallel-Aware>false</Parallel-Aware>          +
-       <Async-Capable>false</Async-Capable>            +
-       <Relation-Name>int8_tbl</Relation-Name>         +
-       <Alias>i8</Alias>                               +
-       <Startup-Cost>N.N</Startup-Cost>                +
-       <Total-Cost>N.N</Total-Cost>                    +
-       <Plan-Rows>N</Plan-Rows>                        +
-       <Plan-Width>N</Plan-Width>                      +
-       <Actual-Startup-Time>N.N</Actual-Startup-Time>  +
-       <Actual-Total-Time>N.N</Actual-Total-Time>      +
-       <Actual-Rows>N.N</Actual-Rows>                  +
-       <Actual-Loops>N</Actual-Loops>                  +
-       <Disabled>false</Disabled>                      +
-       <Shared-Hit-Blocks>N</Shared-Hit-Blocks>        +
-       <Shared-Read-Blocks>N</Shared-Read-Blocks>      +
-       <Shared-Dirtied-Blocks>N</Shared-Dirtied-Blocks>+
-       <Shared-Written-Blocks>N</Shared-Written-Blocks>+
-       <Local-Hit-Blocks>N</Local-Hit-Blocks>          +
-       <Local-Read-Blocks>N</Local-Read-Blocks>        +
-       <Local-Dirtied-Blocks>N</Local-Dirtied-Blocks>  +
-       <Local-Written-Blocks>N</Local-Written-Blocks>  +
-       <Temp-Read-Blocks>N</Temp-Read-Blocks>          +
-       <Temp-Written-Blocks>N</Temp-Written-Blocks>    +
-     </Plan>                                           +
-     <Planning>                                        +
-       <Shared-Hit-Blocks>N</Shared-Hit-Blocks>        +
-       <Shared-Read-Blocks>N</Shared-Read-Blocks>      +
-       <Shared-Dirtied-Blocks>N</Shared-Dirtied-Blocks>+
-       <Shared-Written-Blocks>N</Shared-Written-Blocks>+
-       <Local-Hit-Blocks>N</Local-Hit-Blocks>          +
-       <Local-Read-Blocks>N</Local-Read-Blocks>        +
-       <Local-Dirtied-Blocks>N</Local-Dirtied-Blocks>  +
-       <Local-Written-Blocks>N</Local-Written-Blocks>  +
-       <Temp-Read-Blocks>N</Temp-Read-Blocks>          +
-       <Temp-Written-Blocks>N</Temp-Written-Blocks>    +
-     </Planning>                                       +
-     <Planning-Time>N.N</Planning-Time>                +
-     <Triggers>                                        +
-     </Triggers>                                       +
-     <Execution-Time>N.N</Execution-Time>              +
-   </Query>                                            +
- </explain>
-(1 row)
-
-select explain_filter('explain (analyze, serialize, buffers, format yaml) select * from int8_tbl i8');
-        explain_filter         
--------------------------------
- - Plan:                      +
-     Node Type: "Seq Scan"    +
-     Parallel Aware: false    +
-     Async Capable: false     +
-     Relation Name: "int8_tbl"+
-     Alias: "i8"              +
-     Startup Cost: N.N        +
-     Total Cost: N.N          +
-     Plan Rows: N             +
-     Plan Width: N            +
-     Actual Startup Time: N.N +
-     Actual Total Time: N.N   +
-     Actual Rows: N.N         +
-     Actual Loops: N          +
-     Disabled: false          +
-     Shared Hit Blocks: N     +
-     Shared Read Blocks: N    +
-     Shared Dirtied Blocks: N +
-     Shared Written Blocks: N +
-     Local Hit Blocks: N      +
-     Local Read Blocks: N     +
-     Local Dirtied Blocks: N  +
-     Local Written Blocks: N  +
-     Temp Read Blocks: N      +
-     Temp Written Blocks: N   +
-   Planning:                  +
-     Shared Hit Blocks: N     +
-     Shared Read Blocks: N    +
-     Shared Dirtied Blocks: N +
-     Shared Written Blocks: N +
-     Local Hit Blocks: N      +
-     Local Read Blocks: N     +
-     Local Dirtied Blocks: N  +
-     Local Written Blocks: N  +
-     Temp Read Blocks: N      +
-     Temp Written Blocks: N   +
-   Planning Time: N.N         +
-   Triggers:                  +
-   Serialization:             +
-     Time: N.N                +
-     Output Volume: N         +
-     Format: "text"           +
-     Shared Hit Blocks: N     +
-     Shared Read Blocks: N    +
-     Shared Dirtied Blocks: N +
-     Shared Written Blocks: N +
-     Local Hit Blocks: N      +
-     Local Read Blocks: N     +
-     Local Dirtied Blocks: N  +
-     Local Written Blocks: N  +
-     Temp Read Blocks: N      +
-     Temp Written Blocks: N   +
-   Execution Time: N.N
-(1 row)
-
 select explain_filter('explain (buffers, format text) select * from int8_tbl i8');
                      explain_filter                      
 ---------------------------------------------------------
  Seq Scan on int8_tbl i8  (cost=N.N..N.N rows=N width=N)
 (1 row)
 
+\a
+select explain_filter('explain (analyze, buffers, format xml) select * from int8_tbl i8');
+explain_filter
+<explain xmlns="http://www.postgresql.org/N/explain";>
+  <Query>
+    <Plan>
+      <Node-Type>Seq Scan</Node-Type>
+      <Parallel-Aware>false</Parallel-Aware>
+      <Async-Capable>false</Async-Capable>
+      <Relation-Name>int8_tbl</Relation-Name>
+      <Alias>i8</Alias>
+      <Startup-Cost>N.N</Startup-Cost>
+      <Total-Cost>N.N</Total-Cost>
+      <Plan-Rows>N</Plan-Rows>
+      <Plan-Width>N</Plan-Width>
+      <Actual-Startup-Time>N.N</Actual-Startup-Time>
+      <Actual-Total-Time>N.N</Actual-Total-Time>
+      <Actual-Rows>N.N</Actual-Rows>
+      <Actual-Loops>N</Actual-Loops>
+      <Disabled>false</Disabled>
+      <Shared-Hit-Blocks>N</Shared-Hit-Blocks>
+      <Shared-Read-Blocks>N</Shared-Read-Blocks>
+      <Shared-Dirtied-Blocks>N</Shared-Dirtied-Blocks>
+      <Shared-Written-Blocks>N</Shared-Written-Blocks>
+      <Local-Hit-Blocks>N</Local-Hit-Blocks>
+      <Local-Read-Blocks>N</Local-Read-Blocks>
+      <Local-Dirtied-Blocks>N</Local-Dirtied-Blocks>
+      <Local-Written-Blocks>N</Local-Written-Blocks>
+      <Temp-Read-Blocks>N</Temp-Read-Blocks>
+      <Temp-Written-Blocks>N</Temp-Written-Blocks>
+    </Plan>
+    <Planning>
+      <Shared-Hit-Blocks>N</Shared-Hit-Blocks>
+      <Shared-Read-Blocks>N</Shared-Read-Blocks>
+      <Shared-Dirtied-Blocks>N</Shared-Dirtied-Blocks>
+      <Shared-Written-Blocks>N</Shared-Written-Blocks>
+      <Local-Hit-Blocks>N</Local-Hit-Blocks>
+      <Local-Read-Blocks>N</Local-Read-Blocks>
+      <Local-Dirtied-Blocks>N</Local-Dirtied-Blocks>
+      <Local-Written-Blocks>N</Local-Written-Blocks>
+      <Temp-Read-Blocks>N</Temp-Read-Blocks>
+      <Temp-Written-Blocks>N</Temp-Written-Blocks>
+    </Planning>
+    <Planning-Time>N.N</Planning-Time>
+    <Triggers>
+    </Triggers>
+    <Execution-Time>N.N</Execution-Time>
+  </Query>
+</explain>
+(1 row)
+select explain_filter('explain (analyze, serialize, buffers, format yaml) select * from int8_tbl i8');
+explain_filter
+- Plan: 
+    Node Type: "Seq Scan"
+    Parallel Aware: false
+    Async Capable: false
+    Relation Name: "int8_tbl"
+    Alias: "i8"
+    Startup Cost: N.N
+    Total Cost: N.N
+    Plan Rows: N
+    Plan Width: N
+    Actual Startup Time: N.N
+    Actual Total Time: N.N
+    Actual Rows: N.N
+    Actual Loops: N
+    Disabled: false
+    Shared Hit Blocks: N
+    Shared Read Blocks: N
+    Shared Dirtied Blocks: N
+    Shared Written Blocks: N
+    Local Hit Blocks: N
+    Local Read Blocks: N
+    Local Dirtied Blocks: N
+    Local Written Blocks: N
+    Temp Read Blocks: N
+    Temp Written Blocks: N
+  Planning: 
+    Shared Hit Blocks: N
+    Shared Read Blocks: N
+    Shared Dirtied Blocks: N
+    Shared Written Blocks: N
+    Local Hit Blocks: N
+    Local Read Blocks: N
+    Local Dirtied Blocks: N
+    Local Written Blocks: N
+    Temp Read Blocks: N
+    Temp Written Blocks: N
+  Planning Time: N.N
+  Triggers: 
+  Serialization: 
+    Time: N.N
+    Output Volume: N
+    Format: "text"
+    Shared Hit Blocks: N
+    Shared Read Blocks: N
+    Shared Dirtied Blocks: N
+    Shared Written Blocks: N
+    Local Hit Blocks: N
+    Local Read Blocks: N
+    Local Dirtied Blocks: N
+    Local Written Blocks: N
+    Temp Read Blocks: N
+    Temp Written Blocks: N
+  Execution Time: N.N
+(1 row)
 select explain_filter('explain (buffers, format json) select * from int8_tbl i8');
-           explain_filter           
-------------------------------------
- [                                 +
-   {                               +
-     "Plan": {                     +
-       "Node Type": "Seq Scan",    +
-       "Parallel Aware": false,    +
-       "Async Capable": false,     +
-       "Relation Name": "int8_tbl",+
-       "Alias": "i8",              +
-       "Startup Cost": N.N,        +
-       "Total Cost": N.N,          +
-       "Plan Rows": N,             +
-       "Plan Width": N,            +
-       "Disabled": false,          +
-       "Shared Hit Blocks": N,     +
-       "Shared Read Blocks": N,    +
-       "Shared Dirtied Blocks": N, +
-       "Shared Written Blocks": N, +
-       "Local Hit Blocks": N,      +
-       "Local Read Blocks": N,     +
-       "Local Dirtied Blocks": N,  +
-       "Local Written Blocks": N,  +
-       "Temp Read Blocks": N,      +
-       "Temp Written Blocks": N    +
-     },                            +
-     "Planning": {                 +
-       "Shared Hit Blocks": N,     +
-       "Shared Read Blocks": N,    +
-       "Shared Dirtied Blocks": N, +
-       "Shared Written Blocks": N, +
-       "Local Hit Blocks": N,      +
-       "Local Read Blocks": N,     +
-       "Local Dirtied Blocks": N,  +
-       "Local Written Blocks": N,  +
-       "Temp Read Blocks": N,      +
-       "Temp Written Blocks": N    +
-     }                             +
-   }                               +
- ]
+explain_filter
+[
+  {
+    "Plan": {
+      "Node Type": "Seq Scan",
+      "Parallel Aware": false,
+      "Async Capable": false,
+      "Relation Name": "int8_tbl",
+      "Alias": "i8",
+      "Startup Cost": N.N,
+      "Total Cost": N.N,
+      "Plan Rows": N,
+      "Plan Width": N,
+      "Disabled": false,
+      "Shared Hit Blocks": N,
+      "Shared Read Blocks": N,
+      "Shared Dirtied Blocks": N,
+      "Shared Written Blocks": N,
+      "Local Hit Blocks": N,
+      "Local Read Blocks": N,
+      "Local Dirtied Blocks": N,
+      "Local Written Blocks": N,
+      "Temp Read Blocks": N,
+      "Temp Written Blocks": N
+    },
+    "Planning": {
+      "Shared Hit Blocks": N,
+      "Shared Read Blocks": N,
+      "Shared Dirtied Blocks": N,
+      "Shared Written Blocks": N,
+      "Local Hit Blocks": N,
+      "Local Read Blocks": N,
+      "Local Dirtied Blocks": N,
+      "Local Written Blocks": N,
+      "Temp Read Blocks": N,
+      "Temp Written Blocks": N
+    }
+  }
+]
 (1 row)
-
+\a
 -- Check expansion of window definitions
 select explain_filter('explain verbose select sum(unique1) over w, sum(unique2) over (w order by hundred), sum(tenthous) over (w order by hundred) from tenk1 window w as (partition by ten)');
                                             explain_filter                                             
diff --git a/src/test/regress/sql/explain.sql b/src/test/regress/sql/explain.sql
index ebdab42604b..8f10e1aff55 100644
--- a/src/test/regress/sql/explain.sql
+++ b/src/test/regress/sql/explain.sql
@@ -66,10 +66,13 @@ select explain_filter('explain select * from int8_tbl i8');
 select explain_filter('explain (analyze, buffers off) select * from int8_tbl i8');
 select explain_filter('explain (analyze, buffers off, verbose) select * from int8_tbl i8');
 select explain_filter('explain (analyze, buffers, format text) select * from int8_tbl i8');
+select explain_filter('explain (buffers, format text) select * from int8_tbl i8');
+
+\a
 select explain_filter('explain (analyze, buffers, format xml) select * from int8_tbl i8');
 select explain_filter('explain (analyze, serialize, buffers, format yaml) select * from int8_tbl i8');
-select explain_filter('explain (buffers, format text) select * from int8_tbl i8');
 select explain_filter('explain (buffers, format json) select * from int8_tbl i8');
+\a
 
 -- Check expansion of window definitions
 
-- 
2.53.0

From b584342347342bc0db8a96ea99094dbe8f48c7d3 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <[email protected]>
Date: Sun, 5 Apr 2026 17:17:22 -0400
Subject: [PATCH v12 4/7] Add EXPLAIN (IO) infrastructure and BitmapHeapScan IO
 instrumentation

---
 doc/src/sgml/ref/explain.sgml             |  12 +++
 src/backend/access/heap/heapam.c          |  10 ++
 src/backend/commands/explain.c            | 112 ++++++++++++++++++++++
 src/backend/commands/explain_state.c      |   8 ++
 src/backend/executor/nodeBitmapHeapscan.c |  19 +++-
 src/backend/storage/aio/read_stream.c     |  87 +++++++++++++++++
 src/include/access/relscan.h              |   6 ++
 src/include/access/tableam.h              |   3 +
 src/include/commands/explain_state.h      |   1 +
 src/include/executor/instrument.h         |   1 +
 src/include/executor/instrument_node.h    |  50 ++++++++++
 src/include/storage/read_stream.h         |   2 +
 src/tools/pgindent/typedefs.list          |   2 +
 13 files changed, 311 insertions(+), 2 deletions(-)

diff --git a/doc/src/sgml/ref/explain.sgml b/doc/src/sgml/ref/explain.sgml
index 5b8b521802e..a854c41e963 100644
--- a/doc/src/sgml/ref/explain.sgml
+++ b/doc/src/sgml/ref/explain.sgml
@@ -46,6 +46,7 @@ EXPLAIN [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] <rep
     TIMING [ <replaceable class="parameter">boolean</replaceable> ]
     SUMMARY [ <replaceable class="parameter">boolean</replaceable> ]
     MEMORY [ <replaceable class="parameter">boolean</replaceable> ]
+    IO [ <replaceable class="parameter">boolean</replaceable> ]
     FORMAT { TEXT | XML | JSON | YAML }
 </synopsis>
  </refsynopsisdiv>
@@ -298,6 +299,17 @@ ROLLBACK;
     </listitem>
    </varlistentry>
 
+   <varlistentry>
+    <term><literal>IO</literal></term>
+    <listitem>
+     <para>
+      Include information on I/O performed by each node.
+      This parameter may only be used when <literal>ANALYZE</literal> is also
+      enabled.  It defaults to <literal>FALSE</literal>.
+     </para>
+    </listitem>
+   </varlistentry>
+
    <varlistentry>
     <term><literal>FORMAT</literal></term>
     <listitem>
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index f6ac5a0897c..89ab9742aa5 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -43,6 +43,7 @@
 #include "catalog/pg_database.h"
 #include "catalog/pg_database_d.h"
 #include "commands/vacuum.h"
+#include "executor/instrument_node.h"
 #include "pgstat.h"
 #include "port/pg_bitutils.h"
 #include "storage/lmgr.h"
@@ -1200,6 +1201,7 @@ heap_beginscan(Relation relation, Snapshot snapshot,
 	scan->rs_base.rs_nkeys = nkeys;
 	scan->rs_base.rs_flags = flags;
 	scan->rs_base.rs_parallel = parallel_scan;
+	scan->rs_base.rs_instrument = NULL;
 	scan->rs_strategy = NULL;	/* set in initscan */
 	scan->rs_cbuf = InvalidBuffer;
 
@@ -1312,6 +1314,14 @@ heap_beginscan(Relation relation, Snapshot snapshot,
 														  sizeof(TBMIterateResult));
 	}
 
+	/* enable read stream instrumentation */
+	if (flags & SO_SCAN_INSTRUMENT)
+	{
+		scan->rs_base.rs_instrument = palloc0_object(TableScanInstrumentation);
+		read_stream_enable_stats(scan->rs_read_stream,
+								 &scan->rs_base.rs_instrument->io);
+	}
+
 	scan->rs_vmbuffer = InvalidBuffer;
 
 	return (TableScanDesc) scan;
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index f151f21f9b3..863a9dd0f0d 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -13,6 +13,7 @@
  */
 #include "postgres.h"
 
+#include "access/relscan.h"
 #include "access/xact.h"
 #include "catalog/pg_type.h"
 #include "commands/createas.h"
@@ -139,6 +140,8 @@ static void show_hashagg_info(AggState *aggstate, ExplainState *es);
 static void show_indexsearches_info(PlanState *planstate, ExplainState *es);
 static void show_tidbitmap_info(BitmapHeapScanState *planstate,
 								ExplainState *es);
+static void show_scan_io_usage(ScanState *planstate,
+							   ExplainState *es);
 static void show_instrumentation_count(const char *qlabel, int which,
 									   PlanState *planstate, ExplainState *es);
 static void show_foreignscan_info(ForeignScanState *fsstate, ExplainState *es);
@@ -519,6 +522,8 @@ ExplainOnePlan(PlannedStmt *plannedstmt, IntoClause *into, ExplainState *es,
 		instrument_option |= INSTRUMENT_BUFFERS;
 	if (es->wal)
 		instrument_option |= INSTRUMENT_WAL;
+	else if (es->io)
+		instrument_option |= INSTRUMENT_IO;
 
 	/*
 	 * We always collect timing for the entire statement, even when node-level
@@ -2008,6 +2013,7 @@ ExplainNode(PlanState *planstate, List *ancestors,
 				show_instrumentation_count("Rows Removed by Filter", 1,
 										   planstate, es);
 			show_tidbitmap_info((BitmapHeapScanState *) planstate, es);
+			show_scan_io_usage((ScanState *) planstate, es);
 			break;
 		case T_SampleScan:
 			show_tablesample(((SampleScan *) plan)->tablesample,
@@ -3984,6 +3990,112 @@ show_tidbitmap_info(BitmapHeapScanState *planstate, ExplainState *es)
 	}
 }
 
+static void
+print_io_usage(ExplainState *es, IOStats *stats)
+{
+	/* don't print stats if there's nothing to report */
+	if (stats->prefetch_count > 0)
+	{
+		if (es->format == EXPLAIN_FORMAT_TEXT)
+		{
+			/* prefetch distance info */
+			ExplainIndentText(es);
+			appendStringInfo(es->str, "Prefetch: avg=%.2f max=%d capacity=%d\n",
+							 (stats->distance_sum * 1.0 / stats->prefetch_count),
+							 stats->distance_max,
+							 stats->distance_capacity);
+
+			/* prefetch I/O info (only if there were actual I/Os) */
+			if (stats->io_count > 0)
+			{
+				ExplainIndentText(es);
+				appendStringInfo(es->str, "I/O: count=%" PRIu64 " waits=%" PRIu64
+								 " size=%.2f inprogress=%.2f\n",
+								 stats->io_count, stats->wait_count,
+								 (stats->io_nblocks * 1.0 / stats->io_count),
+								 (stats->io_in_progress * 1.0 / stats->io_count));
+			}
+		}
+		else
+		{
+			ExplainPropertyFloat("Average Prefetch Distance", NULL,
+								 (stats->distance_sum * 1.0 / stats->prefetch_count), 3, es);
+			ExplainPropertyInteger("Max Prefetch Distance", NULL,
+								   stats->distance_max, es);
+			ExplainPropertyInteger("Prefetch Capacity", NULL,
+								   stats->distance_capacity, es);
+
+			ExplainPropertyUInteger("I/O Count", NULL,
+									stats->io_count, es);
+			ExplainPropertyUInteger("I/O Waits", NULL,
+									stats->wait_count, es);
+			ExplainPropertyFloat("Average I/O Size", NULL,
+								 (stats->io_nblocks * 1.0 / Max(1, stats->io_count)), 3, es);
+			ExplainPropertyFloat("Average I/Os In Progress", NULL,
+								 (stats->io_in_progress * 1.0 / Max(1, stats->io_count)), 3, es);
+		}
+	}
+}
+
+static void
+show_scan_io_usage(ScanState *planstate, ExplainState *es)
+{
+	Plan	   *plan = planstate->ps.plan;
+	IOStats		stats = {0};
+
+	if (!es->io)
+		return;
+
+	/*
+	 * Initialize counters with stats from the local process first.
+	 *
+	 * The scan descriptor may not exist, e.g. if the scan did not start, or
+	 * because of debug_parallel_query=regress. We still want to collect data
+	 * from workers.
+	 */
+	if (planstate->ss_currentScanDesc &&
+		planstate->ss_currentScanDesc->rs_instrument)
+	{
+		stats = planstate->ss_currentScanDesc->rs_instrument->io;
+	}
+
+	/*
+	 * Accumulate data from parallel workers (if any).
+	 */
+	switch (nodeTag(plan))
+	{
+		case T_BitmapHeapScan:
+			{
+				SharedBitmapHeapInstrumentation *sinstrument
+				= ((BitmapHeapScanState *) planstate)->sinstrument;
+
+				if (sinstrument)
+				{
+					for (int i = 0; i < sinstrument->num_workers; ++i)
+					{
+						BitmapHeapScanInstrumentation *winstrument = &sinstrument->sinstrument[i];
+
+						AccumulateIOStats(&stats, &winstrument->stats.io);
+
+						if (!es->workers_state)
+							continue;
+
+						ExplainOpenWorker(i, es);
+						print_io_usage(es, &winstrument->stats.io);
+						ExplainCloseWorker(i, es);
+					}
+				}
+
+				break;
+			}
+		default:
+			/* ignore other plans */
+			return;
+	}
+
+	print_io_usage(es, &stats);
+}
+
 /*
  * If it's EXPLAIN ANALYZE, show instrumentation information for a plan node
  *
diff --git a/src/backend/commands/explain_state.c b/src/backend/commands/explain_state.c
index 65dd4111459..0e07a63fca6 100644
--- a/src/backend/commands/explain_state.c
+++ b/src/backend/commands/explain_state.c
@@ -162,6 +162,8 @@ ParseExplainOptionList(ExplainState *es, List *options, ParseState *pstate)
 								"EXPLAIN", opt->defname, p),
 						 parser_errposition(pstate, opt->location)));
 		}
+		else if (strcmp(opt->defname, "io") == 0)
+			es->io = defGetBoolean(opt);
 		else if (!ApplyExtensionExplainOption(es, opt, pstate))
 			ereport(ERROR,
 					(errcode(ERRCODE_SYNTAX_ERROR),
@@ -188,6 +190,12 @@ ParseExplainOptionList(ExplainState *es, List *options, ParseState *pstate)
 				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
 				 errmsg("EXPLAIN option %s requires ANALYZE", "TIMING")));
 
+	/* check that IO is used with EXPLAIN ANALYZE */
+	if (es->io && !es->analyze)
+		ereport(ERROR,
+				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+				 errmsg("EXPLAIN option %s requires ANALYZE", "IO")));
+
 	/* check that serialize is used with EXPLAIN ANALYZE */
 	if (es->serialize != EXPLAIN_SERIALIZE_NONE && !es->analyze)
 		ereport(ERROR,
diff --git a/src/backend/executor/nodeBitmapHeapscan.c b/src/backend/executor/nodeBitmapHeapscan.c
index d65e2a87b42..83d6478bc2b 100644
--- a/src/backend/executor/nodeBitmapHeapscan.c
+++ b/src/backend/executor/nodeBitmapHeapscan.c
@@ -144,13 +144,20 @@ BitmapTableScanSetup(BitmapHeapScanState *node)
 	 */
 	if (!node->ss.ss_currentScanDesc)
 	{
+		uint32		flags = SO_NONE;
+
+		if (ScanRelIsReadOnly(&node->ss))
+			flags |= SO_HINT_REL_READ_ONLY;
+
+		if (node->ss.ps.state->es_instrument & INSTRUMENT_IO)
+			flags |= SO_SCAN_INSTRUMENT;
+
 		node->ss.ss_currentScanDesc =
 			table_beginscan_bm(node->ss.ss_currentRelation,
 							   node->ss.ps.state->es_snapshot,
 							   0,
 							   NULL,
-							   ScanRelIsReadOnly(&node->ss) ?
-							   SO_HINT_REL_READ_ONLY : SO_NONE);
+							   flags);
 	}
 
 	node->ss.ss_currentScanDesc->st.rs_tbmiterator = tbmiterator;
@@ -330,6 +337,14 @@ ExecEndBitmapHeapScan(BitmapHeapScanState *node)
 		 */
 		si->exact_pages += node->stats.exact_pages;
 		si->lossy_pages += node->stats.lossy_pages;
+
+		/* collect I/O instrumentation for this process */
+		if (node->ss.ss_currentScanDesc &&
+			node->ss.ss_currentScanDesc->rs_instrument)
+		{
+			AccumulateIOStats(&si->stats.io,
+							  &node->ss.ss_currentScanDesc->rs_instrument->io);
+		}
 	}
 
 	/*
diff --git a/src/backend/storage/aio/read_stream.c b/src/backend/storage/aio/read_stream.c
index 0b6cdf7c873..936e08ea450 100644
--- a/src/backend/storage/aio/read_stream.c
+++ b/src/backend/storage/aio/read_stream.c
@@ -74,6 +74,7 @@
 #include "postgres.h"
 
 #include "miscadmin.h"
+#include "executor/instrument_node.h"
 #include "storage/aio.h"
 #include "storage/fd.h"
 #include "storage/smgr.h"
@@ -123,6 +124,9 @@ struct ReadStream
 	bool		advice_enabled;
 	bool		temporary;
 
+	/* scan stats counters */
+	IOStats    *stats;
+
 	/*
 	 * One-block buffer to support 'ungetting' a block number, to resolve flow
 	 * control problems when I/Os are split.
@@ -188,6 +192,73 @@ block_range_read_stream_cb(ReadStream *stream,
 	return InvalidBlockNumber;
 }
 
+/*
+ * Update stream stats with current pinned buffer depth.
+ *
+ * Called once per buffer returned to the consumer in read_stream_next_buffer().
+ * Records the number of pinned buffers at that moment, so we can compute the
+ * average look-ahead depth.
+ */
+static inline void
+read_stream_count_prefetch(ReadStream *stream)
+{
+	IOStats    *stats = stream->stats;
+
+	if (stats == NULL)
+		return;
+
+	stats->prefetch_count++;
+	stats->distance_sum += stream->pinned_buffers;
+	if (stream->pinned_buffers > stats->distance_max)
+		stats->distance_max = stream->pinned_buffers;
+}
+
+/*
+ * Update stream stats about size of I/O requests.
+ *
+ * We count the number of I/O requests, size of requests (counted in blocks)
+ * and number of in-progress I/Os.
+ */
+static inline void
+read_stream_count_io(ReadStream *stream, int nblocks, int in_progress)
+{
+	IOStats    *stats = stream->stats;
+
+	if (stats == NULL)
+		return;
+
+	stats->io_count++;
+	stats->io_nblocks += nblocks;
+	stats->io_in_progress += in_progress;
+}
+
+/*
+ * Update stream stats about waits for I/O when consuming buffers.
+ *
+ * We count the number of I/O waits while pulling buffers out of a stream.
+ */
+static inline void
+read_stream_count_wait(ReadStream *stream)
+{
+	IOStats    *stats = stream->stats;
+
+	if (stats == NULL)
+		return;
+
+	stats->wait_count++;
+}
+
+/*
+ * Enable collection of stats into the provided IOStats.
+ */
+void
+read_stream_enable_stats(ReadStream *stream, IOStats *stats)
+{
+	stream->stats = stats;
+	if (stream->stats)
+		stream->stats->distance_capacity = stream->max_pinned_buffers;
+}
+
 /*
  * Ask the callback which block it would like us to read next, with a one block
  * buffer in front to allow read_stream_unget_block() to work.
@@ -426,6 +497,9 @@ read_stream_start_pending_read(ReadStream *stream)
 		Assert(stream->ios_in_progress < stream->max_ios);
 		stream->ios_in_progress++;
 		stream->seq_blocknum = stream->pending_read_blocknum + nblocks;
+
+		/* update I/O stats */
+		read_stream_count_io(stream, nblocks, stream->ios_in_progress);
 	}
 
 	/*
@@ -1021,6 +1095,7 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data)
 										flags)))
 			{
 				/* Fast return. */
+				read_stream_count_prefetch(stream);
 				return buffer;
 			}
 
@@ -1036,6 +1111,12 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data)
 			 * to avoid having to effectively do another synchronous IO for
 			 * the next block (if it were also a miss).
 			 */
+
+			/* update I/O stats */
+			read_stream_count_io(stream, 1, stream->ios_in_progress);
+
+			/* update prefetch distance */
+			read_stream_count_prefetch(stream);
 		}
 		else
 		{
@@ -1100,6 +1181,10 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data)
 
 		needed_wait = WaitReadBuffers(&stream->ios[io_index].op);
 
+		/* Count it as a wait if we need to wait for IO */
+		if (needed_wait)
+			read_stream_count_wait(stream);
+
 		Assert(stream->ios_in_progress > 0);
 		stream->ios_in_progress--;
 		if (++stream->oldest_io_index == stream->max_ios)
@@ -1228,6 +1313,8 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data)
 	}
 #endif
 
+	read_stream_count_prefetch(stream);
+
 	/* Pin transferred to caller. */
 	Assert(stream->pinned_buffers > 0);
 	stream->pinned_buffers--;
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index fd2076c582a..2ea06a67a63 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -24,6 +24,7 @@
 
 
 struct ParallelTableScanDescData;
+struct TableScanInstrumentation;
 
 /*
  * Generic descriptor for table scans. This is the base-class for table scans,
@@ -64,6 +65,11 @@ typedef struct TableScanDescData
 
 	struct ParallelTableScanDescData *rs_parallel;	/* parallel scan
 													 * information */
+
+	/*
+	 * Instrumentation counters maintained by all table AMs.
+	 */
+	struct TableScanInstrumentation *rs_instrument;
 } TableScanDescData;
 typedef struct TableScanDescData *TableScanDesc;
 
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index a21c7db5439..c13f05d39db 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -69,6 +69,9 @@ typedef enum ScanOptions
 
 	/* set if the query doesn't modify the relation */
 	SO_HINT_REL_READ_ONLY = 1 << 10,
+
+	/* collect scan instrumentation */
+	SO_SCAN_INSTRUMENT = 1 << 11,
 }			ScanOptions;
 
 /*
diff --git a/src/include/commands/explain_state.h b/src/include/commands/explain_state.h
index 6252fe11f15..97bc7ed49f6 100644
--- a/src/include/commands/explain_state.h
+++ b/src/include/commands/explain_state.h
@@ -55,6 +55,7 @@ typedef struct ExplainState
 	bool		summary;		/* print total planning and execution timing */
 	bool		memory;			/* print planner's memory usage information */
 	bool		settings;		/* print modified settings */
+	bool		io;				/* print info about IO (prefetch, ...) */
 	bool		generic;		/* generate a generic plan */
 	ExplainSerializeOption serialize;	/* serialize the query's output? */
 	ExplainFormat format;		/* output format */
diff --git a/src/include/executor/instrument.h b/src/include/executor/instrument.h
index cc9fbb0e2f0..f093a52aae0 100644
--- a/src/include/executor/instrument.h
+++ b/src/include/executor/instrument.h
@@ -64,6 +64,7 @@ typedef enum InstrumentOption
 	INSTRUMENT_BUFFERS = 1 << 1,	/* needs buffer usage */
 	INSTRUMENT_ROWS = 1 << 2,	/* needs row count */
 	INSTRUMENT_WAL = 1 << 3,	/* needs WAL usage */
+	INSTRUMENT_IO = 1 << 4,		/* needs IO usage */
 	INSTRUMENT_ALL = PG_INT32_MAX
 } InstrumentOption;
 
diff --git a/src/include/executor/instrument_node.h b/src/include/executor/instrument_node.h
index e6a3f9f1941..22a75ccd863 100644
--- a/src/include/executor/instrument_node.h
+++ b/src/include/executor/instrument_node.h
@@ -48,6 +48,55 @@ typedef struct SharedAggInfo
 } SharedAggInfo;
 
 
+/* ---------------------
+ *	Instrumentation information about read streams and I/O
+ * ---------------------
+ */
+typedef struct IOStats
+{
+	/* number of buffers returned to consumer (for averaging distance) */
+	uint64		prefetch_count;
+
+	/* sum of pinned_buffers sampled at each buffer return */
+	uint64		distance_sum;
+
+	/* maximum actual pinned_buffers observed during the scan */
+	int16		distance_max;
+
+	/* maximum possible look-ahead distance (max_pinned_buffers) */
+	int16		distance_capacity;
+
+	/* number of waits for a read (for the I/O) */
+	uint64		wait_count;
+
+	/* I/O stats */
+	uint64		io_count;		/* number of I/Os */
+	uint64		io_nblocks;		/* sum of blocks for all I/Os */
+	uint64		io_in_progress; /* sum of in-progress I/Os */
+} IOStats;
+
+typedef struct TableScanInstrumentation
+{
+	IOStats		io;
+} TableScanInstrumentation;
+
+/* merge IO statistics from 'src' into 'dst' */
+static inline void
+AccumulateIOStats(IOStats *dst, IOStats *src)
+{
+	dst->prefetch_count += src->prefetch_count;
+	dst->distance_sum += src->distance_sum;
+	if (src->distance_max > dst->distance_max)
+		dst->distance_max = src->distance_max;
+	if (src->distance_capacity > dst->distance_capacity)
+		dst->distance_capacity = src->distance_capacity;
+	dst->wait_count += src->wait_count;
+	dst->io_count += src->io_count;
+	dst->io_nblocks += src->io_nblocks;
+	dst->io_in_progress += src->io_in_progress;
+}
+
+
 /* ---------------------
  *	Instrumentation information for indexscans (amgettuple and amgetbitmap)
  * ---------------------
@@ -79,6 +128,7 @@ typedef struct BitmapHeapScanInstrumentation
 {
 	uint64		exact_pages;
 	uint64		lossy_pages;
+	TableScanInstrumentation stats;
 } BitmapHeapScanInstrumentation;
 
 /*
diff --git a/src/include/storage/read_stream.h b/src/include/storage/read_stream.h
index c9359b29b0f..aebb1fafb31 100644
--- a/src/include/storage/read_stream.h
+++ b/src/include/storage/read_stream.h
@@ -65,6 +65,7 @@
 
 struct ReadStream;
 typedef struct ReadStream ReadStream;
+typedef struct IOStats IOStats;
 
 /* for block_range_read_stream_cb */
 typedef struct BlockRangeReadStreamPrivate
@@ -103,5 +104,6 @@ extern BlockNumber read_stream_pause(ReadStream *stream);
 extern void read_stream_resume(ReadStream *stream);
 extern void read_stream_reset(ReadStream *stream);
 extern void read_stream_end(ReadStream *stream);
+extern void read_stream_enable_stats(ReadStream *stream, IOStats *stats);
 
 #endif							/* READ_STREAM_H */
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 9e6a39f5608..98b8d78e693 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1279,6 +1279,7 @@ IOContext
 IOFuncSelector
 IOObject
 IOOp
+IOStats
 IO_STATUS_BLOCK
 IPCompareMethod
 ITEM
@@ -3127,6 +3128,7 @@ TableLikeClause
 TableSampleClause
 TableScanDesc
 TableScanDescData
+TableScanInstrumentation
 TableSpaceCacheEntry
 TableSpaceOpts
 TableToProcess
-- 
2.53.0

From 85990c9c99327e76a5ede213bb705d33d97a1047 Mon Sep 17 00:00:00 2001
From: test <test>
Date: Tue, 7 Apr 2026 02:04:06 +0200
Subject: [PATCH v12 5/7] auto_explain

---
 contrib/auto_explain/auto_explain.c | 15 +++++++++++++++
 doc/src/sgml/auto-explain.sgml      | 20 ++++++++++++++++++++
 2 files changed, 35 insertions(+)

diff --git a/contrib/auto_explain/auto_explain.c b/contrib/auto_explain/auto_explain.c
index 6ceae1c69ce..2eb2369c354 100644
--- a/contrib/auto_explain/auto_explain.c
+++ b/contrib/auto_explain/auto_explain.c
@@ -38,6 +38,7 @@ static int	auto_explain_log_parameter_max_length = -1; /* bytes or -1 */
 static bool auto_explain_log_analyze = false;
 static bool auto_explain_log_verbose = false;
 static bool auto_explain_log_buffers = false;
+static bool auto_explain_log_io = false;
 static bool auto_explain_log_wal = false;
 static bool auto_explain_log_triggers = false;
 static bool auto_explain_log_timing = true;
@@ -203,6 +204,17 @@ _PG_init(void)
 							 NULL,
 							 NULL);
 
+	DefineCustomBoolVariable("auto_explain.log_io",
+							 "Log IO statistics.",
+							 NULL,
+							 &auto_explain_log_io,
+							 false,
+							 PGC_SUSET,
+							 0,
+							 NULL,
+							 NULL,
+							 NULL);
+
 	DefineCustomBoolVariable("auto_explain.log_wal",
 							 "Log WAL usage.",
 							 NULL,
@@ -343,6 +355,8 @@ explain_ExecutorStart(QueryDesc *queryDesc, int eflags)
 				queryDesc->instrument_options |= INSTRUMENT_ROWS;
 			if (auto_explain_log_buffers)
 				queryDesc->instrument_options |= INSTRUMENT_BUFFERS;
+			if (auto_explain_log_io)
+				queryDesc->instrument_options |= INSTRUMENT_IO;
 			if (auto_explain_log_wal)
 				queryDesc->instrument_options |= INSTRUMENT_WAL;
 		}
@@ -440,6 +454,7 @@ explain_ExecutorEnd(QueryDesc *queryDesc)
 			es->analyze = (queryDesc->instrument_options && auto_explain_log_analyze);
 			es->verbose = auto_explain_log_verbose;
 			es->buffers = (es->analyze && auto_explain_log_buffers);
+			es->io = (es->analyze && auto_explain_log_io);
 			es->wal = (es->analyze && auto_explain_log_wal);
 			es->timing = (es->analyze && auto_explain_log_timing);
 			es->summary = es->analyze;
diff --git a/doc/src/sgml/auto-explain.sgml b/doc/src/sgml/auto-explain.sgml
index ee85a67eb2e..06a8fcc6c5b 100644
--- a/doc/src/sgml/auto-explain.sgml
+++ b/doc/src/sgml/auto-explain.sgml
@@ -128,6 +128,26 @@ LOAD 'auto_explain';
     </listitem>
    </varlistentry>
 
+   <varlistentry id="auto-explain-configuration-parameters-log-io">
+    <term>
+     <varname>auto_explain.log_io</varname> (<type>boolean</type>)
+     <indexterm>
+      <primary><varname>auto_explain.log_io</varname> configuration parameter</primary>
+     </indexterm>
+    </term>
+    <listitem>
+     <para>
+      <varname>auto_explain.log_io</varname> controls whether I/O usage
+      statistics are printed when an execution plan is logged; it's
+      equivalent to the <literal>IO</literal> option of <command>EXPLAIN</command>.
+      This parameter has no effect
+      unless <varname>auto_explain.log_analyze</varname> is enabled.
+      This parameter is off by default.
+      Only superusers can change this setting.
+     </para>
+    </listitem>
+   </varlistentry>
+
    <varlistentry id="auto-explain-configuration-parameters-log-wal">
     <term>
      <varname>auto_explain.log_wal</varname> (<type>boolean</type>)
-- 
2.53.0

From 03899df55da0e525668ad6ef0d37fcc18ecf5dab Mon Sep 17 00:00:00 2001
From: test <test>
Date: Tue, 7 Apr 2026 01:38:53 +0200
Subject: [PATCH v12 6/7] Add EXPLAIN (IO) instrumentation for SeqScan

---
 src/backend/commands/explain.c         |  25 ++++++
 src/backend/executor/execParallel.c    |  11 +++
 src/backend/executor/nodeSeqscan.c     | 116 +++++++++++++++++++++++--
 src/include/executor/instrument_node.h |  19 ++++
 src/include/executor/nodeSeqscan.h     |   9 ++
 src/include/nodes/execnodes.h          |   1 +
 src/test/regress/expected/explain.out  |  18 +++-
 src/test/regress/sql/explain.sql       |   4 +-
 src/tools/pgindent/typedefs.list       |   2 +
 9 files changed, 192 insertions(+), 13 deletions(-)

diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 863a9dd0f0d..ac1bbf19a80 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -2032,6 +2032,7 @@ ExplainNode(PlanState *planstate, List *ancestors,
 										   planstate, es);
 			if (IsA(plan, CteScan))
 				show_ctescan_info(castNode(CteScanState, planstate), es);
+			show_scan_io_usage((ScanState *) planstate, es);
 			break;
 		case T_Gather:
 			{
@@ -4086,6 +4087,30 @@ show_scan_io_usage(ScanState *planstate, ExplainState *es)
 					}
 				}
 
+				break;
+			}
+		case T_SeqScan:
+			{
+				SharedSeqScanInstrumentation *sinstrument
+				= ((SeqScanState *) planstate)->sinstrument;
+
+				if (sinstrument)
+				{
+					for (int i = 0; i < sinstrument->num_workers; ++i)
+					{
+						SeqScanInstrumentation *winstrument = &sinstrument->sinstrument[i];
+
+						AccumulateIOStats(&stats, &winstrument->stats.io);
+
+						if (!es->workers_state)
+							continue;
+
+						ExplainOpenWorker(i, es);
+						print_io_usage(es, &winstrument->stats.io);
+						ExplainCloseWorker(i, es);
+					}
+				}
+
 				break;
 			}
 		default:
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index 1a5ec0c305f..9690f0938ae 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -257,6 +257,9 @@ ExecParallelEstimate(PlanState *planstate, ExecParallelEstimateContext *e)
 			if (planstate->plan->parallel_aware)
 				ExecSeqScanEstimate((SeqScanState *) planstate,
 									e->pcxt);
+			/* even when not parallel-aware, for EXPLAIN ANALYZE */
+			ExecSeqScanInstrumentEstimate((SeqScanState *) planstate,
+										  e->pcxt);
 			break;
 		case T_IndexScanState:
 			if (planstate->plan->parallel_aware)
@@ -500,6 +503,9 @@ ExecParallelInitializeDSM(PlanState *planstate,
 			if (planstate->plan->parallel_aware)
 				ExecSeqScanInitializeDSM((SeqScanState *) planstate,
 										 d->pcxt);
+			/* even when not parallel-aware, for EXPLAIN ANALYZE */
+			ExecSeqScanInstrumentInitDSM((SeqScanState *) planstate,
+										 d->pcxt);
 			break;
 		case T_IndexScanState:
 			if (planstate->plan->parallel_aware)
@@ -1148,6 +1154,9 @@ ExecParallelRetrieveInstrumentation(PlanState *planstate,
 		case T_BitmapHeapScanState:
 			ExecBitmapHeapRetrieveInstrumentation((BitmapHeapScanState *) planstate);
 			break;
+		case T_SeqScanState:
+			ExecSeqScanRetrieveInstrumentation((SeqScanState *) planstate);
+			break;
 		default:
 			break;
 	}
@@ -1388,6 +1397,8 @@ ExecParallelInitializeWorker(PlanState *planstate, ParallelWorkerContext *pwcxt)
 		case T_SeqScanState:
 			if (planstate->plan->parallel_aware)
 				ExecSeqScanInitializeWorker((SeqScanState *) planstate, pwcxt);
+			/* even when not parallel-aware, for EXPLAIN ANALYZE */
+			ExecSeqScanInstrumentInitWorker((SeqScanState *) planstate, pwcxt);
 			break;
 		case T_IndexScanState:
 			if (planstate->plan->parallel_aware)
diff --git a/src/backend/executor/nodeSeqscan.c b/src/backend/executor/nodeSeqscan.c
index 04803b0e37d..b95ac2a8696 100644
--- a/src/backend/executor/nodeSeqscan.c
+++ b/src/backend/executor/nodeSeqscan.c
@@ -29,6 +29,7 @@
 
 #include "access/relscan.h"
 #include "access/tableam.h"
+#include "executor/execParallel.h"
 #include "executor/execScan.h"
 #include "executor/executor.h"
 #include "executor/nodeSeqscan.h"
@@ -65,15 +66,21 @@ SeqNext(SeqScanState *node)
 
 	if (scandesc == NULL)
 	{
+		uint32		flags = SO_NONE;
+
+		if (ScanRelIsReadOnly(&node->ss))
+			flags |= SO_HINT_REL_READ_ONLY;
+
+		if (estate->es_instrument & INSTRUMENT_IO)
+			flags |= SO_SCAN_INSTRUMENT;
+
 		/*
 		 * We reach here if the scan is not parallel, or if we're serially
 		 * executing a scan that was planned to be parallel.
 		 */
 		scandesc = table_beginscan(node->ss.ss_currentRelation,
 								   estate->es_snapshot,
-								   0, NULL,
-								   ScanRelIsReadOnly(&node->ss) ?
-								   SO_HINT_REL_READ_ONLY : SO_NONE);
+								   0, NULL, flags);
 		node->ss.ss_currentScanDesc = scandesc;
 	}
 
@@ -297,6 +304,24 @@ ExecEndSeqScan(SeqScanState *node)
 {
 	TableScanDesc scanDesc;
 
+	/*
+	 * Collect IO stats for this process into shared instrumentation.
+	 */
+	if (node->sinstrument != NULL && IsParallelWorker())
+	{
+		SeqScanInstrumentation *si;
+
+		Assert(ParallelWorkerNumber <= node->sinstrument->num_workers);
+		si = &node->sinstrument->sinstrument[ParallelWorkerNumber];
+
+		if (node->ss.ss_currentScanDesc &&
+			node->ss.ss_currentScanDesc->rs_instrument)
+		{
+			AccumulateIOStats(&si->stats.io,
+							  &node->ss.ss_currentScanDesc->rs_instrument->io);
+		}
+	}
+
 	/*
 	 * get information from node
 	 */
@@ -370,6 +395,13 @@ ExecSeqScanInitializeDSM(SeqScanState *node,
 {
 	EState	   *estate = node->ss.ps.state;
 	ParallelTableScanDesc pscan;
+	uint32		flags = SO_NONE;
+
+	if (ScanRelIsReadOnly(&node->ss))
+		flags |= SO_HINT_REL_READ_ONLY;
+
+	if (estate->es_instrument)
+		flags |= SO_SCAN_INSTRUMENT;
 
 	pscan = shm_toc_allocate(pcxt->toc, node->pscan_len);
 	table_parallelscan_initialize(node->ss.ss_currentRelation,
@@ -378,9 +410,7 @@ ExecSeqScanInitializeDSM(SeqScanState *node,
 	shm_toc_insert(pcxt->toc, node->ss.ps.plan->plan_node_id, pscan);
 
 	node->ss.ss_currentScanDesc =
-		table_beginscan_parallel(node->ss.ss_currentRelation, pscan,
-								 ScanRelIsReadOnly(&node->ss) ?
-								 SO_HINT_REL_READ_ONLY : SO_NONE);
+		table_beginscan_parallel(node->ss.ss_currentRelation, pscan, flags);
 }
 
 /* ----------------------------------------------------------------
@@ -410,10 +440,78 @@ ExecSeqScanInitializeWorker(SeqScanState *node,
 							ParallelWorkerContext *pwcxt)
 {
 	ParallelTableScanDesc pscan;
+	uint32		flags = SO_NONE;
+
+	if (ScanRelIsReadOnly(&node->ss))
+		flags |= SO_HINT_REL_READ_ONLY;
+
+	if (node->ss.ps.state->es_instrument)
+		flags |= SO_SCAN_INSTRUMENT;
 
 	pscan = shm_toc_lookup(pwcxt->toc, node->ss.ps.plan->plan_node_id, false);
 	node->ss.ss_currentScanDesc =
-		table_beginscan_parallel(node->ss.ss_currentRelation, pscan,
-								 ScanRelIsReadOnly(&node->ss) ?
-								 SO_HINT_REL_READ_ONLY : SO_NONE);
+		table_beginscan_parallel(node->ss.ss_currentRelation, pscan, flags);
+}
+
+void
+ExecSeqScanInstrumentEstimate(SeqScanState *node, ParallelContext *pcxt)
+{
+	EState	   *estate = node->ss.ps.state;
+
+	if (!estate->es_instrument || pcxt->nworkers == 0)
+		return;
+
+	shm_toc_estimate_chunk(&pcxt->estimator,
+						   offsetof(SharedSeqScanInstrumentation, sinstrument) +
+						   sizeof(SeqScanInstrumentation) * pcxt->nworkers);
+	shm_toc_estimate_keys(&pcxt->estimator, 1);
+}
+
+void
+ExecSeqScanInstrumentInitDSM(SeqScanState *node, ParallelContext *pcxt)
+{
+	EState	   *estate = node->ss.ps.state;
+	SharedSeqScanInstrumentation *sinstrument;
+	Size		size;
+
+	if (!estate->es_instrument || pcxt->nworkers == 0)
+		return;
+
+	size = offsetof(SharedSeqScanInstrumentation, sinstrument) +
+		sizeof(SeqScanInstrumentation) * pcxt->nworkers;
+	sinstrument = shm_toc_allocate(pcxt->toc, size);
+	memset(sinstrument, 0, size);
+	sinstrument->num_workers = pcxt->nworkers;
+	shm_toc_insert(pcxt->toc,
+				   node->ss.ps.plan->plan_node_id + PARALLEL_KEY_SCAN_INSTRUMENT_OFFSET,
+				   sinstrument);
+	node->sinstrument = sinstrument;
+}
+
+void
+ExecSeqScanInstrumentInitWorker(SeqScanState *node,
+								ParallelWorkerContext *pwcxt)
+{
+	if (!node->ss.ps.state->es_instrument)
+		return;
+
+	node->sinstrument = shm_toc_lookup(pwcxt->toc,
+									   node->ss.ps.plan->plan_node_id + PARALLEL_KEY_SCAN_INSTRUMENT_OFFSET,
+									   true);
+}
+
+void
+ExecSeqScanRetrieveInstrumentation(SeqScanState *node)
+{
+	SharedSeqScanInstrumentation *sinstrument = node->sinstrument;
+	Size		size;
+
+	if (sinstrument == NULL)
+		return;
+
+	size = offsetof(SharedSeqScanInstrumentation, sinstrument)
+		+ sinstrument->num_workers * sizeof(SeqScanInstrumentation);
+
+	node->sinstrument = palloc(size);
+	memcpy(node->sinstrument, sinstrument, size);
 }
diff --git a/src/include/executor/instrument_node.h b/src/include/executor/instrument_node.h
index 22a75ccd863..003dc262b5d 100644
--- a/src/include/executor/instrument_node.h
+++ b/src/include/executor/instrument_node.h
@@ -266,4 +266,23 @@ typedef struct SharedIncrementalSortInfo
 	IncrementalSortInfo sinfo[FLEXIBLE_ARRAY_MEMBER];
 } SharedIncrementalSortInfo;
 
+
+/* ---------------------
+ *	Instrumentation information for sequential scans
+ * ---------------------
+ */
+typedef struct SeqScanInstrumentation
+{
+	TableScanInstrumentation stats;
+} SeqScanInstrumentation;
+
+/*
+ * Shared memory container for per-worker information
+ */
+typedef struct SharedSeqScanInstrumentation
+{
+	int			num_workers;
+	SeqScanInstrumentation sinstrument[FLEXIBLE_ARRAY_MEMBER];
+} SharedSeqScanInstrumentation;
+
 #endif							/* INSTRUMENT_NODE_H */
diff --git a/src/include/executor/nodeSeqscan.h b/src/include/executor/nodeSeqscan.h
index 7a1490596fb..9c0ad4879d7 100644
--- a/src/include/executor/nodeSeqscan.h
+++ b/src/include/executor/nodeSeqscan.h
@@ -28,4 +28,13 @@ extern void ExecSeqScanReInitializeDSM(SeqScanState *node, ParallelContext *pcxt
 extern void ExecSeqScanInitializeWorker(SeqScanState *node,
 										ParallelWorkerContext *pwcxt);
 
+/* instrument support */
+extern void ExecSeqScanInstrumentEstimate(SeqScanState *node,
+										  ParallelContext *pcxt);
+extern void ExecSeqScanInstrumentInitDSM(SeqScanState *node,
+										 ParallelContext *pcxt);
+extern void ExecSeqScanInstrumentInitWorker(SeqScanState *node,
+											ParallelWorkerContext *pwcxt);
+extern void ExecSeqScanRetrieveInstrumentation(SeqScanState *node);
+
 #endif							/* NODESEQSCAN_H */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 3ecae7552fc..56febb3204c 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1670,6 +1670,7 @@ typedef struct SeqScanState
 {
 	ScanState	ss;				/* its first field is NodeTag */
 	Size		pscan_len;		/* size of parallel heap scan descriptor */
+	struct SharedSeqScanInstrumentation *sinstrument;
 } SeqScanState;
 
 /* ----------------
diff --git a/src/test/regress/expected/explain.out b/src/test/regress/expected/explain.out
index dc31c7ce9f9..74a4d87801e 100644
--- a/src/test/regress/expected/explain.out
+++ b/src/test/regress/expected/explain.out
@@ -100,7 +100,7 @@ select explain_filter('explain (buffers, format text) select * from int8_tbl i8'
 (1 row)
 
 \a
-select explain_filter('explain (analyze, buffers, format xml) select * from int8_tbl i8');
+select explain_filter('explain (analyze, buffers, io, format xml) select * from int8_tbl i8');
 explain_filter
 <explain xmlns="http://www.postgresql.org/N/explain";>
   <Query>
@@ -119,6 +119,13 @@ explain_filter
       <Actual-Rows>N.N</Actual-Rows>
       <Actual-Loops>N</Actual-Loops>
       <Disabled>false</Disabled>
+      <Average-Prefetch-Distance>N.N</Average-Prefetch-Distance>
+      <Max-Prefetch-Distance>N</Max-Prefetch-Distance>
+      <Prefetch-Capacity>N</Prefetch-Capacity>
+      <I-O-Count>N</I-O-Count>
+      <I-O-Waits>N</I-O-Waits>
+      <Average-I-O-Size>N.N</Average-I-O-Size>
+      <Average-I-Os-In-Progress>N.N</Average-I-Os-In-Progress>
       <Shared-Hit-Blocks>N</Shared-Hit-Blocks>
       <Shared-Read-Blocks>N</Shared-Read-Blocks>
       <Shared-Dirtied-Blocks>N</Shared-Dirtied-Blocks>
@@ -149,7 +156,7 @@ explain_filter
   </Query>
 </explain>
 (1 row)
-select explain_filter('explain (analyze, serialize, buffers, format yaml) select * from int8_tbl i8');
+select explain_filter('explain (analyze, serialize, buffers, io, format yaml) select * from int8_tbl i8');
 explain_filter
 - Plan: 
     Node Type: "Seq Scan"
@@ -166,6 +173,13 @@ explain_filter
     Actual Rows: N.N
     Actual Loops: N
     Disabled: false
+    Average Prefetch Distance: N.N
+    Max Prefetch Distance: N
+    Prefetch Capacity: N
+    I/O Count: N
+    I/O Waits: N
+    Average I/O Size: N.N
+    Average I/Os In Progress: N.N
     Shared Hit Blocks: N
     Shared Read Blocks: N
     Shared Dirtied Blocks: N
diff --git a/src/test/regress/sql/explain.sql b/src/test/regress/sql/explain.sql
index 8f10e1aff55..2f163c64bf6 100644
--- a/src/test/regress/sql/explain.sql
+++ b/src/test/regress/sql/explain.sql
@@ -69,8 +69,8 @@ select explain_filter('explain (analyze, buffers, format text) select * from int
 select explain_filter('explain (buffers, format text) select * from int8_tbl i8');
 
 \a
-select explain_filter('explain (analyze, buffers, format xml) select * from int8_tbl i8');
-select explain_filter('explain (analyze, serialize, buffers, format yaml) select * from int8_tbl i8');
+select explain_filter('explain (analyze, buffers, io, format xml) select * from int8_tbl i8');
+select explain_filter('explain (analyze, serialize, buffers, io, format yaml) select * from int8_tbl i8');
 select explain_filter('explain (buffers, format json) select * from int8_tbl i8');
 \a
 
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 98b8d78e693..c0436a13ac3 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2800,6 +2800,7 @@ SelfJoinCandidate
 SemTPadded
 SemiAntiJoinFactors
 SeqScan
+SeqScanInstrumentation
 SeqScanState
 SeqTable
 SeqTableData
@@ -2864,6 +2865,7 @@ SharedMemoizeInfo
 SharedRecordTableEntry
 SharedRecordTableKey
 SharedRecordTypmodRegistry
+SharedSeqScanInstrumentation
 SharedSortInfo
 SharedTuplestore
 SharedTuplestoreAccessor
-- 
2.53.0

From bb6eb339595f5e21e05ff12fe7260aace3b7e564 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <[email protected]>
Date: Sun, 5 Apr 2026 17:31:18 -0400
Subject: [PATCH v12 7/7] Add EXPLAIN (IO) instrumentation for TidRangeScan

---
 src/backend/commands/explain.c          |  25 ++++++
 src/backend/executor/execParallel.c     |  12 +++
 src/backend/executor/nodeTidrangescan.c | 113 ++++++++++++++++++++++--
 src/include/executor/instrument_node.h  |  18 ++++
 src/include/executor/nodeTidrangescan.h |   9 ++
 src/include/nodes/execnodes.h           |   1 +
 src/tools/pgindent/typedefs.list        |   2 +
 7 files changed, 172 insertions(+), 8 deletions(-)

diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index ac1bbf19a80..257ce2bfc54 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -2149,6 +2149,7 @@ ExplainNode(PlanState *planstate, List *ancestors,
 				if (plan->qual)
 					show_instrumentation_count("Rows Removed by Filter", 1,
 											   planstate, es);
+				show_scan_io_usage((ScanState *) planstate, es);
 			}
 			break;
 		case T_ForeignScan:
@@ -4111,6 +4112,30 @@ show_scan_io_usage(ScanState *planstate, ExplainState *es)
 					}
 				}
 
+				break;
+			}
+		case T_TidRangeScan:
+			{
+				SharedTidRangeScanInstrumentation *sinstrument
+				= ((TidRangeScanState *) planstate)->trss_sinstrument;
+
+				if (sinstrument)
+				{
+					for (int i = 0; i < sinstrument->num_workers; ++i)
+					{
+						TidRangeScanInstrumentation *winstrument = &sinstrument->sinstrument[i];
+
+						AccumulateIOStats(&stats, &winstrument->stats.io);
+
+						if (!es->workers_state)
+							continue;
+
+						ExplainOpenWorker(i, es);
+						print_io_usage(es, &winstrument->stats.io);
+						ExplainCloseWorker(i, es);
+					}
+				}
+
 				break;
 			}
 		default:
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index 9690f0938ae..81b87d82fab 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -291,6 +291,9 @@ ExecParallelEstimate(PlanState *planstate, ExecParallelEstimateContext *e)
 			if (planstate->plan->parallel_aware)
 				ExecTidRangeScanEstimate((TidRangeScanState *) planstate,
 										 e->pcxt);
+			/* even when not parallel-aware, for EXPLAIN ANALYZE */
+			ExecTidRangeScanInstrumentEstimate((TidRangeScanState *) planstate,
+											   e->pcxt);
 			break;
 		case T_AppendState:
 			if (planstate->plan->parallel_aware)
@@ -536,6 +539,9 @@ ExecParallelInitializeDSM(PlanState *planstate,
 			if (planstate->plan->parallel_aware)
 				ExecTidRangeScanInitializeDSM((TidRangeScanState *) planstate,
 											  d->pcxt);
+			/* even when not parallel-aware, for EXPLAIN ANALYZE */
+			ExecTidRangeScanInstrumentInitDSM((TidRangeScanState *) planstate,
+											  d->pcxt);
 			break;
 		case T_AppendState:
 			if (planstate->plan->parallel_aware)
@@ -1157,6 +1163,9 @@ ExecParallelRetrieveInstrumentation(PlanState *planstate,
 		case T_SeqScanState:
 			ExecSeqScanRetrieveInstrumentation((SeqScanState *) planstate);
 			break;
+		case T_TidRangeScanState:
+			ExecTidRangeScanRetrieveInstrumentation((TidRangeScanState *) planstate);
+			break;
 		default:
 			break;
 	}
@@ -1430,6 +1439,9 @@ ExecParallelInitializeWorker(PlanState *planstate, ParallelWorkerContext *pwcxt)
 			if (planstate->plan->parallel_aware)
 				ExecTidRangeScanInitializeWorker((TidRangeScanState *) planstate,
 												 pwcxt);
+			/* even when not parallel-aware, for EXPLAIN ANALYZE */
+			ExecTidRangeScanInstrumentInitWorker((TidRangeScanState *) planstate,
+												 pwcxt);
 			break;
 		case T_AppendState:
 			if (planstate->plan->parallel_aware)
diff --git a/src/backend/executor/nodeTidrangescan.c b/src/backend/executor/nodeTidrangescan.c
index 4a8fe91b2b3..012e88e88f6 100644
--- a/src/backend/executor/nodeTidrangescan.c
+++ b/src/backend/executor/nodeTidrangescan.c
@@ -18,8 +18,10 @@
 #include "access/sysattr.h"
 #include "access/tableam.h"
 #include "catalog/pg_operator.h"
+#include "executor/execParallel.h"
 #include "executor/executor.h"
 #include "executor/nodeTidrangescan.h"
+#include "executor/instrument.h"
 #include "nodes/nodeFuncs.h"
 #include "utils/rel.h"
 
@@ -242,12 +244,19 @@ TidRangeNext(TidRangeScanState *node)
 
 		if (scandesc == NULL)
 		{
+			uint32		flags = SO_NONE;
+
+			if (ScanRelIsReadOnly(&node->ss))
+				flags |= SO_HINT_REL_READ_ONLY;
+
+			if (estate->es_instrument & INSTRUMENT_IO)
+				flags |= SO_SCAN_INSTRUMENT;
+
 			scandesc = table_beginscan_tidrange(node->ss.ss_currentRelation,
 												estate->es_snapshot,
 												&node->trss_mintid,
 												&node->trss_maxtid,
-												ScanRelIsReadOnly(&node->ss) ?
-												SO_HINT_REL_READ_ONLY : SO_NONE);
+												flags);
 			node->ss.ss_currentScanDesc = scandesc;
 		}
 		else
@@ -342,6 +351,19 @@ ExecEndTidRangeScan(TidRangeScanState *node)
 {
 	TableScanDesc scan = node->ss.ss_currentScanDesc;
 
+	/* Collect IO stats for this process into shared instrumentation */
+	if (node->trss_sinstrument != NULL && IsParallelWorker())
+	{
+		TidRangeScanInstrumentation *si;
+
+		Assert(ParallelWorkerNumber <= node->trss_sinstrument->num_workers);
+		si = &node->trss_sinstrument->sinstrument[ParallelWorkerNumber];
+
+		if (scan && scan->rs_instrument)
+			AccumulateIOStats(&si->stats.io,
+							  &scan->rs_instrument->io);
+	}
+
 	if (scan != NULL)
 		table_endscan(scan);
 }
@@ -454,6 +476,13 @@ ExecTidRangeScanInitializeDSM(TidRangeScanState *node, ParallelContext *pcxt)
 {
 	EState	   *estate = node->ss.ps.state;
 	ParallelTableScanDesc pscan;
+	uint32		flags = SO_NONE;
+
+	if (ScanRelIsReadOnly(&node->ss))
+		flags |= SO_HINT_REL_READ_ONLY;
+
+	if (estate->es_instrument)
+		flags |= SO_SCAN_INSTRUMENT;
 
 	pscan = shm_toc_allocate(pcxt->toc, node->trss_pscanlen);
 	table_parallelscan_initialize(node->ss.ss_currentRelation,
@@ -462,9 +491,7 @@ ExecTidRangeScanInitializeDSM(TidRangeScanState *node, ParallelContext *pcxt)
 	shm_toc_insert(pcxt->toc, node->ss.ps.plan->plan_node_id, pscan);
 	node->ss.ss_currentScanDesc =
 		table_beginscan_parallel_tidrange(node->ss.ss_currentRelation,
-										  pscan,
-										  ScanRelIsReadOnly(&node->ss) ?
-										  SO_HINT_REL_READ_ONLY : SO_NONE);
+										  pscan, flags);
 }
 
 /* ----------------------------------------------------------------
@@ -494,11 +521,81 @@ ExecTidRangeScanInitializeWorker(TidRangeScanState *node,
 								 ParallelWorkerContext *pwcxt)
 {
 	ParallelTableScanDesc pscan;
+	uint32		flags = SO_NONE;
+
+	if (ScanRelIsReadOnly(&node->ss))
+		flags |= SO_HINT_REL_READ_ONLY;
+
+	if (node->ss.ps.state->es_instrument)
+		flags |= SO_SCAN_INSTRUMENT;
 
 	pscan = shm_toc_lookup(pwcxt->toc, node->ss.ps.plan->plan_node_id, false);
 	node->ss.ss_currentScanDesc =
 		table_beginscan_parallel_tidrange(node->ss.ss_currentRelation,
-										  pscan,
-										  ScanRelIsReadOnly(&node->ss) ?
-										  SO_HINT_REL_READ_ONLY : SO_NONE);
+										  pscan, flags);
+}
+
+void
+ExecTidRangeScanInstrumentEstimate(TidRangeScanState *node,
+								   ParallelContext *pcxt)
+{
+	EState	   *estate = node->ss.ps.state;
+
+	if (!estate->es_instrument || pcxt->nworkers == 0)
+		return;
+
+	shm_toc_estimate_chunk(&pcxt->estimator,
+						   offsetof(SharedTidRangeScanInstrumentation, sinstrument) +
+						   sizeof(TidRangeScanInstrumentation) * pcxt->nworkers);
+	shm_toc_estimate_keys(&pcxt->estimator, 1);
+}
+
+void
+ExecTidRangeScanInstrumentInitDSM(TidRangeScanState *node,
+								  ParallelContext *pcxt)
+{
+	EState	   *estate = node->ss.ps.state;
+	SharedTidRangeScanInstrumentation *sinstrument;
+	Size		size;
+
+	if (!estate->es_instrument || pcxt->nworkers == 0)
+		return;
+
+	size = offsetof(SharedTidRangeScanInstrumentation, sinstrument) +
+		sizeof(TidRangeScanInstrumentation) * pcxt->nworkers;
+	sinstrument = shm_toc_allocate(pcxt->toc, size);
+	memset(sinstrument, 0, size);
+	sinstrument->num_workers = pcxt->nworkers;
+	shm_toc_insert(pcxt->toc,
+				   node->ss.ps.plan->plan_node_id + PARALLEL_KEY_SCAN_INSTRUMENT_OFFSET,
+				   sinstrument);
+	node->trss_sinstrument = sinstrument;
+}
+
+void
+ExecTidRangeScanInstrumentInitWorker(TidRangeScanState *node,
+									 ParallelWorkerContext *pwcxt)
+{
+	if (!node->ss.ps.state->es_instrument)
+		return;
+
+	node->trss_sinstrument = shm_toc_lookup(pwcxt->toc,
+											node->ss.ps.plan->plan_node_id + PARALLEL_KEY_SCAN_INSTRUMENT_OFFSET,
+											true);
+}
+
+void
+ExecTidRangeScanRetrieveInstrumentation(TidRangeScanState *node)
+{
+	SharedTidRangeScanInstrumentation *sinstrument = node->trss_sinstrument;
+	Size		size;
+
+	if (sinstrument == NULL)
+		return;
+
+	size = offsetof(SharedTidRangeScanInstrumentation, sinstrument)
+		+ sinstrument->num_workers * sizeof(TidRangeScanInstrumentation);
+
+	node->trss_sinstrument = palloc(size);
+	memcpy(node->trss_sinstrument, sinstrument, size);
 }
diff --git a/src/include/executor/instrument_node.h b/src/include/executor/instrument_node.h
index 003dc262b5d..4076990408e 100644
--- a/src/include/executor/instrument_node.h
+++ b/src/include/executor/instrument_node.h
@@ -285,4 +285,22 @@ typedef struct SharedSeqScanInstrumentation
 	SeqScanInstrumentation sinstrument[FLEXIBLE_ARRAY_MEMBER];
 } SharedSeqScanInstrumentation;
 
+
+/*
+ *	Instrumentation information for TID range scans
+ */
+typedef struct TidRangeScanInstrumentation
+{
+	TableScanInstrumentation stats;
+} TidRangeScanInstrumentation;
+
+/*
+ * Shared memory container for per-worker information
+ */
+typedef struct SharedTidRangeScanInstrumentation
+{
+	int			num_workers;
+	TidRangeScanInstrumentation sinstrument[FLEXIBLE_ARRAY_MEMBER];
+} SharedTidRangeScanInstrumentation;
+
 #endif							/* INSTRUMENT_NODE_H */
diff --git a/src/include/executor/nodeTidrangescan.h b/src/include/executor/nodeTidrangescan.h
index 8752d1ea8c4..9e7d0a357bb 100644
--- a/src/include/executor/nodeTidrangescan.h
+++ b/src/include/executor/nodeTidrangescan.h
@@ -28,4 +28,13 @@ extern void ExecTidRangeScanInitializeDSM(TidRangeScanState *node, ParallelConte
 extern void ExecTidRangeScanReInitializeDSM(TidRangeScanState *node, ParallelContext *pcxt);
 extern void ExecTidRangeScanInitializeWorker(TidRangeScanState *node, ParallelWorkerContext *pwcxt);
 
+/* instrument support */
+extern void ExecTidRangeScanInstrumentEstimate(TidRangeScanState *node,
+											   ParallelContext *pcxt);
+extern void ExecTidRangeScanInstrumentInitDSM(TidRangeScanState *node,
+											  ParallelContext *pcxt);
+extern void ExecTidRangeScanInstrumentInitWorker(TidRangeScanState *node,
+												 ParallelWorkerContext *pwcxt);
+extern void ExecTidRangeScanRetrieveInstrumentation(TidRangeScanState *node);
+
 #endif							/* NODETIDRANGESCAN_H */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 56febb3204c..13359180d25 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1922,6 +1922,7 @@ typedef struct TidRangeScanState
 	ItemPointerData trss_maxtid;
 	bool		trss_inScan;
 	Size		trss_pscanlen;
+	struct SharedTidRangeScanInstrumentation *trss_sinstrument;
 } TidRangeScanState;
 
 /* ----------------
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index c0436a13ac3..0cd41106975 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2867,6 +2867,7 @@ SharedRecordTableKey
 SharedRecordTypmodRegistry
 SharedSeqScanInstrumentation
 SharedSortInfo
+SharedTidRangeScanInstrumentation
 SharedTuplestore
 SharedTuplestoreAccessor
 SharedTuplestoreChunk
@@ -3171,6 +3172,7 @@ TidOpExpr
 TidPath
 TidRangePath
 TidRangeScan
+TidRangeScanInstrumentation
 TidRangeScanState
 TidScan
 TidScanState
-- 
2.53.0

Reply via email to