okay, so I realized v35 had an issue where I wasn't counting strategy
evictions correctly. fixed in attached v36. This made me wonder if there
is actually a way to add a test for evictions (in strategy and shared
contexts) that is not flakey.

On Sun, Oct 23, 2022 at 6:48 PM Maciek Sakrejda <m.sakre...@gmail.com> wrote:
>
> On Thu, Oct 20, 2022 at 10:31 AM Andres Freund <and...@anarazel.de> wrote:
> > - "repossession" is a very unintuitive name for me. If we want something 
> > like
> >   it, can't we just name it reuse_failed or such?
>
> +1, I think "repossessed" is awkward. I think "reuse_failed" works,
> but no strong opinions on an alternate name.

Also, re: repossessed, I can change it to reuse_failed but I do think it
is important to give users a way to distinguish between bulkread
rejections of dirty buffers and strategies failing to reuse buffers due
to concurrent pinning (since the reaction to these two scenarios would
likely be different).

If we added another column called something like "claim_failed" which
counts buffers which we failed to reuse because of concurrent pinning or
usage, we could recommend use of this column together with
"reuse_failed" to determine the cause of the failed reuses for a
bulkread. We could also use "claim_failed" in IOContext shared to
provide information on shared buffer contention.

- Melanie
From 0d5fc7da60f6b02259b8dd1d2eab25967cb9a95a Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplage...@gmail.com>
Date: Thu, 6 Oct 2022 12:23:38 -0400
Subject: [PATCH v36 3/5] Aggregate IO operation stats per BackendType

Stats on IOOps for all IOContexts for a backend are tracked locally. Add
functionality for backends to flush these stats to shared memory and
accumulate them with those from all other backends, exited and live.
Also add reset and snapshot functions used by cumulative stats system
for management of these statistics.

The aggregated stats in shared memory could be extended in the future
with per-backend stats -- useful for per connection IO statistics and
monitoring.

Some BackendTypes will not flush their pending statistics at regular
intervals and explicitly call pgstat_flush_io_ops() during the course of
normal operations to flush their backend-local IO operation statistics
to shared memory in a timely manner.

Because not all BackendType, IOOp, IOContext combinations are valid, the
validity of the stats is checked before flushing pending stats and
before reading in the existing stats file to shared memory.

Author: Melanie Plageman <melanieplage...@gmail.com>
Reviewed-by: Andres Freund <and...@anarazel.de>
Reviewed-by: Justin Pryzby <pry...@telsasoft.com>
Reviewed-by: Kyotaro Horiguchi <horikyota....@gmail.com>
Reviewed-by: Maciek Sakrejda <m.sakre...@gmail.com>
Reviewed-by: Lukas Fittl <lu...@fittl.com>
Discussion: https://www.postgresql.org/message-id/flat/20200124195226.lth52iydq2n2uilq%40alap3.anarazel.de
---
 doc/src/sgml/monitoring.sgml                  |   2 +
 src/backend/utils/activity/pgstat.c           |  35 ++++
 src/backend/utils/activity/pgstat_bgwriter.c  |   7 +-
 .../utils/activity/pgstat_checkpointer.c      |   7 +-
 src/backend/utils/activity/pgstat_io_ops.c    | 164 ++++++++++++++++++
 src/backend/utils/activity/pgstat_relation.c  |  15 +-
 src/backend/utils/activity/pgstat_shmem.c     |   4 +
 src/backend/utils/activity/pgstat_wal.c       |   4 +-
 src/backend/utils/adt/pgstatfuncs.c           |   4 +-
 src/include/miscadmin.h                       |   2 +
 src/include/pgstat.h                          |  88 ++++++++++
 src/include/utils/pgstat_internal.h           |  36 ++++
 src/tools/pgindent/typedefs.list              |   3 +
 13 files changed, 365 insertions(+), 6 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index e5d622d514..698f274341 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -5390,6 +5390,8 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
         the <structname>pg_stat_bgwriter</structname>
         view, <literal>archiver</literal> to reset all the counters shown in
         the <structname>pg_stat_archiver</structname> view,
+        <literal>io</literal> to reset all the counters shown in the
+        <structname>pg_stat_io</structname> view,
         <literal>wal</literal> to reset all the counters shown in the
         <structname>pg_stat_wal</structname> view or
         <literal>recovery_prefetch</literal> to reset all the counters shown
diff --git a/src/backend/utils/activity/pgstat.c b/src/backend/utils/activity/pgstat.c
index 1b97597f17..4becee9a6c 100644
--- a/src/backend/utils/activity/pgstat.c
+++ b/src/backend/utils/activity/pgstat.c
@@ -359,6 +359,15 @@ static const PgStat_KindInfo pgstat_kind_infos[PGSTAT_NUM_KINDS] = {
 		.snapshot_cb = pgstat_checkpointer_snapshot_cb,
 	},
 
+	[PGSTAT_KIND_IOOPS] = {
+		.name = "io_ops",
+
+		.fixed_amount = true,
+
+		.reset_all_cb = pgstat_io_ops_reset_all_cb,
+		.snapshot_cb = pgstat_io_ops_snapshot_cb,
+	},
+
 	[PGSTAT_KIND_SLRU] = {
 		.name = "slru",
 
@@ -582,6 +591,7 @@ pgstat_report_stat(bool force)
 
 	/* Don't expend a clock check if nothing to do */
 	if (dlist_is_empty(&pgStatPending) &&
+		!have_ioopstats &&
 		!have_slrustats &&
 		!pgstat_have_pending_wal())
 	{
@@ -628,6 +638,9 @@ pgstat_report_stat(bool force)
 	/* flush database / relation / function / ... stats */
 	partial_flush |= pgstat_flush_pending_entries(nowait);
 
+	/* flush IO Operations stats */
+	partial_flush |= pgstat_flush_io_ops(nowait);
+
 	/* flush wal stats */
 	partial_flush |= pgstat_flush_wal(nowait);
 
@@ -1321,6 +1334,14 @@ pgstat_write_statsfile(void)
 	pgstat_build_snapshot_fixed(PGSTAT_KIND_CHECKPOINTER);
 	write_chunk_s(fpout, &pgStatLocal.snapshot.checkpointer);
 
+	/*
+	 * Write IO Operations stats struct
+	 */
+	pgstat_build_snapshot_fixed(PGSTAT_KIND_IOOPS);
+	write_chunk_s(fpout, &pgStatLocal.snapshot.io_ops.stat_reset_timestamp);
+	for (int i = 0; i < BACKEND_NUM_TYPES; i++)
+		write_chunk_s(fpout, &pgStatLocal.snapshot.io_ops.stats[i]);
+
 	/*
 	 * Write SLRU stats struct
 	 */
@@ -1495,6 +1516,20 @@ pgstat_read_statsfile(void)
 	if (!read_chunk_s(fpin, &shmem->checkpointer.stats))
 		goto error;
 
+	/*
+	 * Read IO Operations stats struct
+	 */
+	if (!read_chunk_s(fpin, &shmem->io_ops.stat_reset_timestamp))
+		goto error;
+
+	for (int bktype = 0; bktype < BACKEND_NUM_TYPES; bktype++)
+	{
+		pgstat_backend_io_stats_assert_well_formed(shmem->io_ops.stats[bktype].data,
+												   (BackendType) bktype);
+		if (!read_chunk_s(fpin, &shmem->io_ops.stats[bktype].data))
+			goto error;
+	}
+
 	/*
 	 * Read SLRU stats struct
 	 */
diff --git a/src/backend/utils/activity/pgstat_bgwriter.c b/src/backend/utils/activity/pgstat_bgwriter.c
index fbb1edc527..3d7f90a1b7 100644
--- a/src/backend/utils/activity/pgstat_bgwriter.c
+++ b/src/backend/utils/activity/pgstat_bgwriter.c
@@ -24,7 +24,7 @@ PgStat_BgWriterStats PendingBgWriterStats = {0};
 
 
 /*
- * Report bgwriter statistics
+ * Report bgwriter and IO Operation statistics
  */
 void
 pgstat_report_bgwriter(void)
@@ -56,6 +56,11 @@ pgstat_report_bgwriter(void)
 	 * Clear out the statistics buffer, so it can be re-used.
 	 */
 	MemSet(&PendingBgWriterStats, 0, sizeof(PendingBgWriterStats));
+
+	/*
+	 * Report IO Operations statistics
+	 */
+	pgstat_flush_io_ops(false);
 }
 
 /*
diff --git a/src/backend/utils/activity/pgstat_checkpointer.c b/src/backend/utils/activity/pgstat_checkpointer.c
index af8d513e7b..cfcf127210 100644
--- a/src/backend/utils/activity/pgstat_checkpointer.c
+++ b/src/backend/utils/activity/pgstat_checkpointer.c
@@ -24,7 +24,7 @@ PgStat_CheckpointerStats PendingCheckpointerStats = {0};
 
 
 /*
- * Report checkpointer statistics
+ * Report checkpointer and IO Operation statistics
  */
 void
 pgstat_report_checkpointer(void)
@@ -62,6 +62,11 @@ pgstat_report_checkpointer(void)
 	 * Clear out the statistics buffer, so it can be re-used.
 	 */
 	MemSet(&PendingCheckpointerStats, 0, sizeof(PendingCheckpointerStats));
+
+	/*
+	 * Report IO Operation statistics
+	 */
+	pgstat_flush_io_ops(false);
 }
 
 /*
diff --git a/src/backend/utils/activity/pgstat_io_ops.c b/src/backend/utils/activity/pgstat_io_ops.c
index 6f9c250907..9f0f27da1f 100644
--- a/src/backend/utils/activity/pgstat_io_ops.c
+++ b/src/backend/utils/activity/pgstat_io_ops.c
@@ -20,6 +20,48 @@
 #include "utils/pgstat_internal.h"
 
 static PgStat_IOContextOps pending_IOOpStats;
+bool		have_ioopstats = false;
+
+
+/*
+ * Helper function to accumulate source PgStat_IOOpCounters into target
+ * PgStat_IOOpCounters. If either of the passed-in PgStat_IOOpCounters are
+ * members of PgStatShared_IOContextOps, the caller is responsible for ensuring
+ * that the appropriate lock is held.
+ */
+static void
+pgstat_accum_io_op(PgStat_IOOpCounters *target, PgStat_IOOpCounters *source, IOOp io_op)
+{
+	switch (io_op)
+	{
+		case IOOP_EVICT:
+			target->evictions += source->evictions;
+			return;
+		case IOOP_EXTEND:
+			target->extends += source->extends;
+			return;
+		case IOOP_FSYNC:
+			target->fsyncs += source->fsyncs;
+			return;
+		case IOOP_READ:
+			target->reads += source->reads;
+			return;
+		case IOOP_REPOSSESS:
+			target->repossessions += source->repossessions;
+			return;
+		case IOOP_REJECT:
+			target->rejections += source->rejections;
+			return;
+		case IOOP_REUSE:
+			target->reuses += source->reuses;
+			return;
+		case IOOP_WRITE:
+			target->writes += source->writes;
+			return;
+	}
+
+	elog(ERROR, "unrecognized IOOp value: %d", io_op);
+}
 
 void
 pgstat_count_io_op(IOOp io_op, IOContext io_context)
@@ -60,6 +102,78 @@ pgstat_count_io_op(IOOp io_op, IOContext io_context)
 			break;
 	}
 
+	have_ioopstats = true;
+}
+
+PgStat_BackendIOContextOps *
+pgstat_fetch_backend_io_context_ops(void)
+{
+	pgstat_snapshot_fixed(PGSTAT_KIND_IOOPS);
+
+	return &pgStatLocal.snapshot.io_ops;
+}
+
+/*
+ * Flush out locally pending IO Operation statistics entries
+ *
+ * If no stats have been recorded, this function returns false.
+ *
+ * If nowait is true, this function returns true if the lock could not be
+ * acquired. Otherwise, return false.
+ */
+bool
+pgstat_flush_io_ops(bool nowait)
+{
+	PgStatShared_IOContextOps *type_shstats;
+	bool		expect_backend_stats = true;
+
+	if (!have_ioopstats)
+		return false;
+
+	type_shstats =
+		&pgStatLocal.shmem->io_ops.stats[MyBackendType];
+
+	if (!nowait)
+		LWLockAcquire(&type_shstats->lock, LW_EXCLUSIVE);
+	else if (!LWLockConditionalAcquire(&type_shstats->lock, LW_EXCLUSIVE))
+		return true;
+
+	expect_backend_stats = pgstat_io_op_stats_collected(MyBackendType);
+
+	for (int io_context = 0; io_context < IOCONTEXT_NUM_TYPES; io_context++)
+	{
+		PgStat_IOOpCounters *sharedent = &type_shstats->data[io_context];
+		PgStat_IOOpCounters *pendingent = &pending_IOOpStats.data[io_context];
+
+		if (!expect_backend_stats ||
+			!pgstat_bktype_io_context_valid(MyBackendType, (IOContext) io_context))
+		{
+			pgstat_io_context_ops_assert_zero(sharedent);
+			pgstat_io_context_ops_assert_zero(pendingent);
+			continue;
+		}
+
+		for (int io_op = 0; io_op < IOOP_NUM_TYPES; io_op++)
+		{
+			if (!(pgstat_io_op_valid(MyBackendType, (IOContext) io_context,
+									 (IOOp) io_op)))
+			{
+				pgstat_io_op_assert_zero(sharedent, (IOOp) io_op);
+				pgstat_io_op_assert_zero(pendingent, (IOOp) io_op);
+				continue;
+			}
+
+			pgstat_accum_io_op(sharedent, pendingent, (IOOp) io_op);
+		}
+	}
+
+	LWLockRelease(&type_shstats->lock);
+
+	memset(&pending_IOOpStats, 0, sizeof(pending_IOOpStats));
+
+	have_ioopstats = false;
+
+	return false;
 }
 
 const char *
@@ -108,6 +222,56 @@ pgstat_io_op_desc(IOOp io_op)
 	elog(ERROR, "unrecognized IOOp value: %d", io_op);
 }
 
+void
+pgstat_io_ops_reset_all_cb(TimestampTz ts)
+{
+	PgStatShared_BackendIOContextOps *backends_stats_shmem = &pgStatLocal.shmem->io_ops;
+
+	for (int i = 0; i < BACKEND_NUM_TYPES; i++)
+	{
+		PgStatShared_IOContextOps *stats_shmem = &backends_stats_shmem->stats[i];
+
+		LWLockAcquire(&stats_shmem->lock, LW_EXCLUSIVE);
+
+		/*
+		 * Use the lock in the first BackendType's PgStat_IOContextOps to
+		 * protect the reset timestamp as well.
+		 */
+		if (i == 0)
+			backends_stats_shmem->stat_reset_timestamp = ts;
+
+		memset(stats_shmem->data, 0, sizeof(stats_shmem->data));
+		LWLockRelease(&stats_shmem->lock);
+	}
+}
+
+void
+pgstat_io_ops_snapshot_cb(void)
+{
+	PgStatShared_BackendIOContextOps *backends_stats_shmem = &pgStatLocal.shmem->io_ops;
+	PgStat_BackendIOContextOps *backends_stats_snap = &pgStatLocal.snapshot.io_ops;
+
+	for (int i = 0; i < BACKEND_NUM_TYPES; i++)
+	{
+		PgStatShared_IOContextOps *stats_shmem = &backends_stats_shmem->stats[i];
+		PgStat_IOContextOps *stats_snap = &backends_stats_snap->stats[i];
+
+		LWLockAcquire(&stats_shmem->lock, LW_SHARED);
+
+		/*
+		 * Use the lock in the first BackendType's PgStat_IOContextOps to
+		 * protect the reset timestamp as well.
+		 */
+		if (i == 0)
+			backends_stats_snap->stat_reset_timestamp =
+				backends_stats_shmem->stat_reset_timestamp;
+
+		memcpy(stats_snap->data, stats_shmem->data, sizeof(stats_shmem->data));
+		LWLockRelease(&stats_shmem->lock);
+	}
+
+}
+
 /*
 * IO Operation statistics are not collected for all BackendTypes.
 *
diff --git a/src/backend/utils/activity/pgstat_relation.c b/src/backend/utils/activity/pgstat_relation.c
index 55a355f583..a23a90b133 100644
--- a/src/backend/utils/activity/pgstat_relation.c
+++ b/src/backend/utils/activity/pgstat_relation.c
@@ -205,7 +205,7 @@ pgstat_drop_relation(Relation rel)
 }
 
 /*
- * Report that the table was just vacuumed.
+ * Report that the table was just vacuumed and flush IO Operation statistics.
  */
 void
 pgstat_report_vacuum(Oid tableoid, bool shared,
@@ -257,10 +257,18 @@ pgstat_report_vacuum(Oid tableoid, bool shared,
 	}
 
 	pgstat_unlock_entry(entry_ref);
+
+	/*
+	 * Flush IO Operations statistics now. pgstat_report_stat() will flush IO
+	 * Operation stats, however this will not be called after an entire
+	 * autovacuum cycle is done -- which will likely vacuum many relations --
+	 * or until the VACUUM command has processed all tables and committed.
+	 */
+	pgstat_flush_io_ops(false);
 }
 
 /*
- * Report that the table was just analyzed.
+ * Report that the table was just analyzed and flush IO Operation statistics.
  *
  * Caller must provide new live- and dead-tuples estimates, as well as a
  * flag indicating whether to reset the changes_since_analyze counter.
@@ -340,6 +348,9 @@ pgstat_report_analyze(Relation rel,
 	}
 
 	pgstat_unlock_entry(entry_ref);
+
+	/* see pgstat_report_vacuum() */
+	pgstat_flush_io_ops(false);
 }
 
 /*
diff --git a/src/backend/utils/activity/pgstat_shmem.c b/src/backend/utils/activity/pgstat_shmem.c
index 9a4f037959..275a7be166 100644
--- a/src/backend/utils/activity/pgstat_shmem.c
+++ b/src/backend/utils/activity/pgstat_shmem.c
@@ -202,6 +202,10 @@ StatsShmemInit(void)
 		LWLockInitialize(&ctl->checkpointer.lock, LWTRANCHE_PGSTATS_DATA);
 		LWLockInitialize(&ctl->slru.lock, LWTRANCHE_PGSTATS_DATA);
 		LWLockInitialize(&ctl->wal.lock, LWTRANCHE_PGSTATS_DATA);
+
+		for (int i = 0; i < BACKEND_NUM_TYPES; i++)
+			LWLockInitialize(&ctl->io_ops.stats[i].lock,
+							 LWTRANCHE_PGSTATS_DATA);
 	}
 	else
 	{
diff --git a/src/backend/utils/activity/pgstat_wal.c b/src/backend/utils/activity/pgstat_wal.c
index 5a878bd115..9cac407b42 100644
--- a/src/backend/utils/activity/pgstat_wal.c
+++ b/src/backend/utils/activity/pgstat_wal.c
@@ -34,7 +34,7 @@ static WalUsage prevWalUsage;
 
 /*
  * Calculate how much WAL usage counters have increased and update
- * shared statistics.
+ * shared WAL and IO Operation statistics.
  *
  * Must be called by processes that generate WAL, that do not call
  * pgstat_report_stat(), like walwriter.
@@ -43,6 +43,8 @@ void
 pgstat_report_wal(bool force)
 {
 	pgstat_flush_wal(force);
+
+	pgstat_flush_io_ops(force);
 }
 
 /*
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 96bffc0f2a..b783af130c 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -2084,6 +2084,8 @@ pg_stat_reset_shared(PG_FUNCTION_ARGS)
 		pgstat_reset_of_kind(PGSTAT_KIND_BGWRITER);
 		pgstat_reset_of_kind(PGSTAT_KIND_CHECKPOINTER);
 	}
+	else if (strcmp(target, "io") == 0)
+		pgstat_reset_of_kind(PGSTAT_KIND_IOOPS);
 	else if (strcmp(target, "recovery_prefetch") == 0)
 		XLogPrefetchResetStats();
 	else if (strcmp(target, "wal") == 0)
@@ -2092,7 +2094,7 @@ pg_stat_reset_shared(PG_FUNCTION_ARGS)
 		ereport(ERROR,
 				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
 				 errmsg("unrecognized reset target: \"%s\"", target),
-				 errhint("Target must be \"archiver\", \"bgwriter\", \"recovery_prefetch\", or \"wal\".")));
+				 errhint("Target must be \"archiver\", \"io\", \"bgwriter\", \"recovery_prefetch\", or \"wal\".")));
 
 	PG_RETURN_VOID();
 }
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index e7ebea4ff4..bf97162e83 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -331,6 +331,8 @@ typedef enum BackendType
 	B_WAL_WRITER,
 } BackendType;
 
+#define BACKEND_NUM_TYPES B_WAL_WRITER + 1
+
 extern PGDLLIMPORT BackendType MyBackendType;
 
 extern const char *GetBackendTypeDesc(BackendType backendType);
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 5883aafe9c..010dc7267b 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -49,6 +49,7 @@ typedef enum PgStat_Kind
 	PGSTAT_KIND_ARCHIVER,
 	PGSTAT_KIND_BGWRITER,
 	PGSTAT_KIND_CHECKPOINTER,
+	PGSTAT_KIND_IOOPS,
 	PGSTAT_KIND_SLRU,
 	PGSTAT_KIND_WAL,
 } PgStat_Kind;
@@ -326,6 +327,12 @@ typedef struct PgStat_IOContextOps
 	PgStat_IOOpCounters data[IOCONTEXT_NUM_TYPES];
 } PgStat_IOContextOps;
 
+typedef struct PgStat_BackendIOContextOps
+{
+	TimestampTz stat_reset_timestamp;
+	PgStat_IOContextOps stats[BACKEND_NUM_TYPES];
+} PgStat_BackendIOContextOps;
+
 typedef struct PgStat_StatDBEntry
 {
 	PgStat_Counter n_xact_commit;
@@ -508,6 +515,7 @@ extern PgStat_CheckpointerStats *pgstat_fetch_stat_checkpointer(void);
  */
 
 extern void pgstat_count_io_op(IOOp io_op, IOContext io_context);
+extern PgStat_BackendIOContextOps *pgstat_fetch_backend_io_context_ops(void);
 extern const char *pgstat_io_context_desc(IOContext io_context);
 extern const char *pgstat_io_op_desc(IOOp io_op);
 
@@ -519,6 +527,86 @@ extern bool pgstat_expect_io_op(BackendType bktype, IOContext io_context, IOOp i
 
 /* IO stats translation function in freelist.c */
 extern IOContext IOContextForStrategy(BufferAccessStrategy bas);
+/*
+ * Functions to assert that invalid IO Operation counters are zero.
+ */
+static inline void
+pgstat_io_context_ops_assert_zero(PgStat_IOOpCounters *counters)
+{
+	Assert(counters->evictions == 0 && counters->extends == 0 &&
+			counters->fsyncs == 0 && counters->reads == 0 &&
+			counters->rejections == 0 && counters->repossessions == 0 &&
+			counters->reuses == 0 && counters->writes == 0);
+}
+
+static inline void
+pgstat_io_op_assert_zero(PgStat_IOOpCounters *counters, IOOp io_op)
+{
+	switch (io_op)
+	{
+		case IOOP_EVICT:
+			Assert(counters->evictions == 0);
+			return;
+		case IOOP_EXTEND:
+			Assert(counters->extends == 0);
+			return;
+		case IOOP_FSYNC:
+			Assert(counters->fsyncs == 0);
+			return;
+		case IOOP_READ:
+			Assert(counters->reads == 0);
+			return;
+		case IOOP_REJECT:
+			Assert(counters->rejections == 0);
+			return;
+		case IOOP_REPOSSESS:
+			Assert(counters->repossessions == 0);
+			return;
+		case IOOP_REUSE:
+			Assert(counters->reuses == 0);
+			return;
+		case IOOP_WRITE:
+			Assert(counters->writes == 0);
+			return;
+	}
+
+	/* Should not reach here */
+	Assert(false);
+}
+
+/*
+ * Assert that stats have not been counted for any combination of IOContext and
+ * IOOp which are not valid for the passed-in BackendType. The passed-in array
+ * of PgStat_IOOpCounters must contain stats from the BackendType specified by
+ * the second parameter. Caller is responsible for any locking if the passed-in
+ * array of PgStat_IOOpCounters is a member of PgStatShared_IOContextOps.
+ */
+static inline void
+pgstat_backend_io_stats_assert_well_formed(PgStat_IOOpCounters
+										   backend_io_context_ops[IOCONTEXT_NUM_TYPES], BackendType bktype)
+{
+	bool		expect_backend_stats = true;
+
+	if (!pgstat_io_op_stats_collected(bktype))
+		expect_backend_stats = false;
+
+	for (int io_context = 0; io_context < IOCONTEXT_NUM_TYPES; io_context++)
+	{
+		if (!expect_backend_stats ||
+			!pgstat_bktype_io_context_valid(bktype, (IOContext) io_context))
+		{
+			pgstat_io_context_ops_assert_zero(&backend_io_context_ops[io_context]);
+			continue;
+		}
+
+		for (int io_op = 0; io_op < IOOP_NUM_TYPES; io_op++)
+		{
+			if (!pgstat_io_op_valid(bktype, (IOContext) io_context, (IOOp) io_op))
+				pgstat_io_op_assert_zero(&backend_io_context_ops[io_context],
+						(IOOp) io_op);
+		}
+	}
+}
 
 
 /*
diff --git a/src/include/utils/pgstat_internal.h b/src/include/utils/pgstat_internal.h
index 627c1389e4..9066fed660 100644
--- a/src/include/utils/pgstat_internal.h
+++ b/src/include/utils/pgstat_internal.h
@@ -330,6 +330,25 @@ typedef struct PgStatShared_Checkpointer
 	PgStat_CheckpointerStats reset_offset;
 } PgStatShared_Checkpointer;
 
+typedef struct PgStatShared_IOContextOps
+{
+	/*
+	 * lock protects ->data If this PgStatShared_IOContextOps is
+	 * PgStatShared_BackendIOContextOps->stats[0], lock also protects
+	 * PgStatShared_BackendIOContextOps->stat_reset_timestamp.
+	 */
+	LWLock		lock;
+	PgStat_IOOpCounters data[IOCONTEXT_NUM_TYPES];
+} PgStatShared_IOContextOps;
+
+typedef struct PgStatShared_BackendIOContextOps
+{
+	/* ->stats_reset_timestamp is protected by ->stats[0].lock */
+	TimestampTz stat_reset_timestamp;
+	PgStatShared_IOContextOps stats[BACKEND_NUM_TYPES];
+} PgStatShared_BackendIOContextOps;
+
+
 typedef struct PgStatShared_SLRU
 {
 	/* lock protects ->stats */
@@ -420,6 +439,7 @@ typedef struct PgStat_ShmemControl
 	PgStatShared_Archiver archiver;
 	PgStatShared_BgWriter bgwriter;
 	PgStatShared_Checkpointer checkpointer;
+	PgStatShared_BackendIOContextOps io_ops;
 	PgStatShared_SLRU slru;
 	PgStatShared_Wal wal;
 } PgStat_ShmemControl;
@@ -443,6 +463,8 @@ typedef struct PgStat_Snapshot
 
 	PgStat_CheckpointerStats checkpointer;
 
+	PgStat_BackendIOContextOps io_ops;
+
 	PgStat_SLRUStats slru[SLRU_NUM_ELEMENTS];
 
 	PgStat_WalStats wal;
@@ -550,6 +572,15 @@ extern void pgstat_database_reset_timestamp_cb(PgStatShared_Common *header, Time
 extern bool pgstat_function_flush_cb(PgStat_EntryRef *entry_ref, bool nowait);
 
 
+/*
+ * Functions in pgstat_io_ops.c
+ */
+
+extern void pgstat_io_ops_reset_all_cb(TimestampTz ts);
+extern void pgstat_io_ops_snapshot_cb(void);
+extern bool pgstat_flush_io_ops(bool nowait);
+
+
 /*
  * Functions in pgstat_relation.c
  */
@@ -642,6 +673,11 @@ extern void pgstat_create_transactional(PgStat_Kind kind, Oid dboid, Oid objoid)
 
 extern PGDLLIMPORT PgStat_LocalState pgStatLocal;
 
+/*
+ * Variables in pgstat_io_ops.c
+ */
+
+extern PGDLLIMPORT bool have_ioopstats;
 
 /*
  * Variables in pgstat_slru.c
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index b080367073..6d33b2c9bb 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2005,12 +2005,14 @@ PgFdwRelationInfo
 PgFdwScanState
 PgIfAddrCallback
 PgStatShared_Archiver
+PgStatShared_BackendIOContextOps
 PgStatShared_BgWriter
 PgStatShared_Checkpointer
 PgStatShared_Common
 PgStatShared_Database
 PgStatShared_Function
 PgStatShared_HashEntry
+PgStatShared_IOContextOps
 PgStatShared_Relation
 PgStatShared_ReplSlot
 PgStatShared_SLRU
@@ -2018,6 +2020,7 @@ PgStatShared_Subscription
 PgStatShared_Wal
 PgStat_ArchiverStats
 PgStat_BackendFunctionEntry
+PgStat_BackendIOContextOps
 PgStat_BackendSubEntry
 PgStat_BgWriterStats
 PgStat_CheckpointerStats
-- 
2.34.1

From ac9fa4fe501fd948cfc4c4af983555813bfb20de Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplage...@gmail.com>
Date: Thu, 6 Oct 2022 12:23:25 -0400
Subject: [PATCH v36 2/5] Track IO operation statistics locally

Introduce "IOOp", an IO operation done by a backend, and "IOContext",
the IO source, target, or type done by a backend. For example, the
checkpointer may write a shared buffer out. This would be counted as an
IOOp "written" on an IOContext IOCONTEXT_SHARED by BackendType
"checkpointer".

Each IOOp (evict, reject, repossess, reuse, read, write, extend, and
fsync) is counted per IOContext (bulkread, bulkwrite, local, shared, or
vacuum) through a call to pgstat_count_io_op().

The primary concern of these statistics is IO operations on data blocks
during the course of normal database operations. IO operations done by,
for example, the archiver or syslogger are not counted in these
statistics. WAL IO, temporary file IO, and IO done directly though smgr*
functions (such as when building an index) are not yet counted but would
be useful future additions.

IOCONTEXT_LOCAL and IOCONTEXT_SHARED IOContexts concern operations on
local and shared buffers.

The IOCONTEXT_BULKREAD, IOCONTEXT_BULKWRITE, and IOCONTEXT_VACUUM
IOContexts concern IO operations on buffers as part of a
BufferAccessStrategy.

IOOP_FREELIST_ACQUIRE and IOOP_EVICT IOOps are counted in
IOCONTEXT_SHARED and IOCONTEXT_LOCAL IOContexts when a buffer is
acquired or allocated through [Local]BufferAlloc() and no
BufferAccessStrategy is in use.

When a BufferAccessStrategy is in use, shared buffers added to the
strategy ring are counted as IOOP_FREELIST_ACQUIRE or IOOP_EVICT IOOps
in the IOCONTEXT_[BULKREAD|BULKWRITE|VACUUM) IOContext. When one of
these buffers is reused, it is counted as an IOOP_REUSE IOOp in the
corresponding strategy IOContext.

IOOP_WRITE IOOps are counted in the BufferAccessStrategy IOContexts
whenever the reused dirty buffer is written out.

Stats on IOOps in all IOContexts for a given backend are counted in a
backend's local memory. A subsequent commit will expose functions for
aggregating and viewing these stats.

Author: Melanie Plageman <melanieplage...@gmail.com>
Reviewed-by: Andres Freund <and...@anarazel.de>
Reviewed-by: Justin Pryzby <pry...@telsasoft.com>
Reviewed-by: Kyotaro Horiguchi <horikyota....@gmail.com>
Reviewed-by: Maciek Sakrejda <m.sakre...@gmail.com>
Reviewed-by: Lukas Fittl <lu...@fittl.com>
Discussion: https://www.postgresql.org/message-id/flat/20200124195226.lth52iydq2n2uilq%40alap3.anarazel.de
---
 src/backend/postmaster/checkpointer.c      |  13 ++
 src/backend/storage/buffer/bufmgr.c        |  82 ++++++-
 src/backend/storage/buffer/freelist.c      |  51 ++++-
 src/backend/storage/buffer/localbuf.c      |   6 +
 src/backend/storage/sync/sync.c            |   2 +
 src/backend/utils/activity/Makefile        |   1 +
 src/backend/utils/activity/meson.build     |   1 +
 src/backend/utils/activity/pgstat_io_ops.c | 255 +++++++++++++++++++++
 src/include/pgstat.h                       |  68 ++++++
 src/include/storage/buf_internals.h        |   2 +-
 src/include/storage/bufmgr.h               |   7 +-
 src/tools/pgindent/typedefs.list           |   4 +
 12 files changed, 479 insertions(+), 13 deletions(-)
 create mode 100644 src/backend/utils/activity/pgstat_io_ops.c

diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 5fc076fc14..4ea4e6a298 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -1116,6 +1116,19 @@ ForwardSyncRequest(const FileTag *ftag, SyncRequestType type)
 		if (!AmBackgroundWriterProcess())
 			CheckpointerShmem->num_backend_fsync++;
 		LWLockRelease(CheckpointerCommLock);
+
+		/*
+		 * We have no way of knowing if the current IOContext is
+		 * IOCONTEXT_SHARED or IOCONTEXT_[BULKREAD, BULKWRITE, VACUUM] at this
+		 * point, so count the fsync as being in the IOCONTEXT_SHARED
+		 * IOContext. This is probably okay, because the number of backend
+		 * fsyncs doesn't say anything about the efficacy of the
+		 * BufferAccessStrategy. And counting both fsyncs done in
+		 * IOCONTEXT_SHARED and IOCONTEXT_[BULKREAD, BULKWRITE, VACUUM] under
+		 * IOCONTEXT_SHARED is likely clearer when investigating the number of
+		 * backend fsyncs.
+		 */
+		pgstat_count_io_op(IOOP_FSYNC, IOCONTEXT_SHARED);
 		return false;
 	}
 
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 4e7b0b31bb..1cc108004f 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -482,7 +482,7 @@ static BufferDesc *BufferAlloc(SMgrRelation smgr,
 							   BlockNumber blockNum,
 							   BufferAccessStrategy strategy,
 							   bool *foundPtr);
-static void FlushBuffer(BufferDesc *buf, SMgrRelation reln);
+static void FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOContext io_context);
 static void FindAndDropRelationBuffers(RelFileLocator rlocator,
 									   ForkNumber forkNum,
 									   BlockNumber nForkBlock,
@@ -823,6 +823,7 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	BufferDesc *bufHdr;
 	Block		bufBlock;
 	bool		found;
+	IOContext	io_context;
 	bool		isExtend;
 	bool		isLocalBuf = SmgrIsTemp(smgr);
 
@@ -833,6 +834,11 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 
 	isExtend = (blockNum == P_NEW);
 
+	if (isLocalBuf)
+		io_context = IOCONTEXT_LOCAL;
+	else
+		io_context = IOContextForStrategy(strategy);
+
 	TRACE_POSTGRESQL_BUFFER_READ_START(forkNum, blockNum,
 									   smgr->smgr_rlocator.locator.spcOid,
 									   smgr->smgr_rlocator.locator.dbOid,
@@ -990,6 +996,7 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 
 	if (isExtend)
 	{
+		pgstat_count_io_op(IOOP_EXTEND, io_context);
 		/* new buffers are zero-filled */
 		MemSet((char *) bufBlock, 0, BLCKSZ);
 		/* don't set checksum for all-zero page */
@@ -1015,6 +1022,9 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 			instr_time	io_start,
 						io_time;
 
+			pgstat_count_io_op(IOOP_READ, io_context);
+
+
 			if (track_io_timing)
 				INSTR_TIME_SET_CURRENT(io_start);
 
@@ -1121,6 +1131,8 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 			BufferAccessStrategy strategy,
 			bool *foundPtr)
 {
+	bool		from_ring;
+	IOContext	io_context;
 	BufferTag	newTag;			/* identity of requested block */
 	uint32		newHash;		/* hash value for newTag */
 	LWLock	   *newPartitionLock;	/* buffer partition lock for it */
@@ -1187,9 +1199,12 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	 */
 	LWLockRelease(newPartitionLock);
 
+	io_context = IOContextForStrategy(strategy);
+
 	/* Loop here in case we have to try another victim buffer */
 	for (;;)
 	{
+
 		/*
 		 * Ensure, while the spinlock's not yet held, that there's a free
 		 * refcount entry.
@@ -1200,7 +1215,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		 * Select a victim buffer.  The buffer is returned with its header
 		 * spinlock still held!
 		 */
-		buf = StrategyGetBuffer(strategy, &buf_state);
+		buf = StrategyGetBuffer(strategy, &buf_state, &from_ring);
 
 		Assert(BUF_STATE_GET_REFCOUNT(buf_state) == 0);
 
@@ -1263,13 +1278,34 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 					}
 				}
 
+				/*
+				 * When a strategy is in use, only flushes of dirty buffers
+				 * already in the strategy ring are counted as strategy writes
+				 * (IOCONTEXT [BULKREAD|BULKWRITE|VACUUM] IOOP_WRITE) for the
+				 * purpose of IO operation statistics tracking.
+				 *
+				 * If a shared buffer initially added to the ring must be
+				 * flushed before being used, this is counted as an
+				 * IOCONTEXT_SHARED IOOP_WRITE.
+				 *
+				 * If a shared buffer added to the ring later because the
+				 * current strategy buffer is pinned or in use or because all
+				 * strategy buffers were dirty and rejected (for BAS_BULKREAD
+				 * operations only) requires flushing, this is counted as an
+				 * IOCONTEXT_SHARED IOOP_WRITE (from_ring will be false).
+				 *
+				 * When a strategy is not in use, the write can only be a
+				 * "regular" write of a dirty shared buffer (IOCONTEXT_SHARED
+				 * IOOP_WRITE).
+				 */
+
 				/* OK, do the I/O */
 				TRACE_POSTGRESQL_BUFFER_WRITE_DIRTY_START(forkNum, blockNum,
 														  smgr->smgr_rlocator.locator.spcOid,
 														  smgr->smgr_rlocator.locator.dbOid,
 														  smgr->smgr_rlocator.locator.relNumber);
 
-				FlushBuffer(buf, NULL);
+				FlushBuffer(buf, NULL, io_context);
 				LWLockRelease(BufferDescriptorGetContentLock(buf));
 
 				ScheduleBufferTagForWriteback(&BackendWritebackContext,
@@ -1441,6 +1477,30 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 
 	UnlockBufHdr(buf, buf_state);
 
+	if (oldFlags & BM_VALID)
+	{
+		/*
+		* When a BufferAccessStrategy is in use, evictions adding a
+		* shared buffer to the strategy ring are counted in the
+		* corresponding strategy's context. This includes the evictions
+		* done to add buffers to the ring initially as well as those
+		* done to add a new shared buffer to the ring when current
+		* buffer is pinned or otherwise in use.
+		*
+		* Blocks evicted from buffers already in the strategy ring are counted
+		* as IOCONTEXT_BULKREAD, IOCONTEXT_BULKWRITE, or IOCONTEXT_VACUUM
+		* reuses.
+		*
+		* We wait until this point to count reuses and evictions in order to
+		* avoid incorrectly counting a buffer as reused or evicted when it was
+		* released because it was concurrently pinned or in use or counting it
+		* as reused when it was rejected or when we errored out.
+		*/
+		IOOp io_op = from_ring ? IOOP_REUSE : IOOP_EVICT;
+
+		pgstat_count_io_op(io_op, io_context);
+	}
+
 	if (oldPartitionLock != NULL)
 	{
 		BufTableDelete(&oldTag, oldHash);
@@ -2570,7 +2630,7 @@ SyncOneBuffer(int buf_id, bool skip_recently_used, WritebackContext *wb_context)
 	PinBuffer_Locked(bufHdr);
 	LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
 
-	FlushBuffer(bufHdr, NULL);
+	FlushBuffer(bufHdr, NULL, IOCONTEXT_SHARED);
 
 	LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
 
@@ -2820,7 +2880,7 @@ BufferGetTag(Buffer buffer, RelFileLocator *rlocator, ForkNumber *forknum,
  * as the second parameter.  If not, pass NULL.
  */
 static void
-FlushBuffer(BufferDesc *buf, SMgrRelation reln)
+FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOContext io_context)
 {
 	XLogRecPtr	recptr;
 	ErrorContextCallback errcallback;
@@ -2900,6 +2960,8 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln)
 	 */
 	bufToWrite = PageSetChecksumCopy((Page) bufBlock, buf->tag.blockNum);
 
+	pgstat_count_io_op(IOOP_WRITE, io_context);
+
 	if (track_io_timing)
 		INSTR_TIME_SET_CURRENT(io_start);
 
@@ -3551,6 +3613,8 @@ FlushRelationBuffers(Relation rel)
 						  localpage,
 						  false);
 
+				pgstat_count_io_op(IOOP_WRITE, IOCONTEXT_LOCAL);
+
 				buf_state &= ~(BM_DIRTY | BM_JUST_DIRTIED);
 				pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
 
@@ -3586,7 +3650,7 @@ FlushRelationBuffers(Relation rel)
 		{
 			PinBuffer_Locked(bufHdr);
 			LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
-			FlushBuffer(bufHdr, RelationGetSmgr(rel));
+			FlushBuffer(bufHdr, RelationGetSmgr(rel), IOCONTEXT_SHARED);
 			LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
 			UnpinBuffer(bufHdr);
 		}
@@ -3684,7 +3748,7 @@ FlushRelationsAllBuffers(SMgrRelation *smgrs, int nrels)
 		{
 			PinBuffer_Locked(bufHdr);
 			LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
-			FlushBuffer(bufHdr, srelent->srel);
+			FlushBuffer(bufHdr, srelent->srel, IOCONTEXT_SHARED);
 			LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
 			UnpinBuffer(bufHdr);
 		}
@@ -3894,7 +3958,7 @@ FlushDatabaseBuffers(Oid dbid)
 		{
 			PinBuffer_Locked(bufHdr);
 			LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
-			FlushBuffer(bufHdr, NULL);
+			FlushBuffer(bufHdr, NULL, IOCONTEXT_SHARED);
 			LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
 			UnpinBuffer(bufHdr);
 		}
@@ -3921,7 +3985,7 @@ FlushOneBuffer(Buffer buffer)
 
 	Assert(LWLockHeldByMe(BufferDescriptorGetContentLock(bufHdr)));
 
-	FlushBuffer(bufHdr, NULL);
+	FlushBuffer(bufHdr, NULL, IOCONTEXT_SHARED);
 }
 
 /*
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 64728bd7ce..6eb2e00ae2 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -15,6 +15,7 @@
  */
 #include "postgres.h"
 
+#include "pgstat.h"
 #include "port/atomics.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
@@ -192,13 +193,15 @@ have_free_buffer(void)
  *	return the buffer with the buffer header spinlock still held.
  */
 BufferDesc *
-StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state)
+StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_ring)
 {
 	BufferDesc *buf;
 	int			bgwprocno;
 	int			trycounter;
 	uint32		local_buf_state;	/* to avoid repeated (de-)referencing */
 
+	*from_ring = false;
+
 	/*
 	 * If given a strategy object, see whether it can select a buffer. We
 	 * assume strategy objects don't need buffer_strategy_lock.
@@ -207,7 +210,10 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state)
 	{
 		buf = GetBufferFromRing(strategy, buf_state);
 		if (buf != NULL)
+		{
+			*from_ring = true;
 			return buf;
+		}
 	}
 
 	/*
@@ -299,6 +305,7 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state)
 				if (strategy != NULL)
 					AddBufferToRing(strategy, buf);
 				*buf_state = local_buf_state;
+
 				return buf;
 			}
 			UnlockBufHdr(buf, local_buf_state);
@@ -331,6 +338,7 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state)
 				if (strategy != NULL)
 					AddBufferToRing(strategy, buf);
 				*buf_state = local_buf_state;
+
 				return buf;
 			}
 		}
@@ -596,7 +604,7 @@ FreeAccessStrategy(BufferAccessStrategy strategy)
 
 /*
  * GetBufferFromRing -- returns a buffer from the ring, or NULL if the
- *		ring is empty.
+ *		ring is empty / not usable.
  *
  * The bufhdr spin lock is held on the returned buffer.
  */
@@ -643,7 +651,13 @@ GetBufferFromRing(BufferAccessStrategy strategy, uint32 *buf_state)
 	/*
 	 * Tell caller to allocate a new buffer with the normal allocation
 	 * strategy.  He'll then replace this ring element via AddBufferToRing.
+	 *
+	 * This counts as a "repossession" for the purposes of IO operation
+	 * statistic tracking, since the reason that we no longer consider the
+	 * current buffer to be part of the ring is that the block in it is in use
+	 * outside of the ring, preventing us from reusing the buffer.
 	 */
+	pgstat_count_io_op(IOOP_REPOSSESS, IOContextForStrategy(strategy));
 	return NULL;
 }
 
@@ -659,6 +673,37 @@ AddBufferToRing(BufferAccessStrategy strategy, BufferDesc *buf)
 	strategy->buffers[strategy->current] = BufferDescriptorGetBuffer(buf);
 }
 
+/*
+ * Utility function returning the IOContext of a given BufferAccessStrategy's
+ * strategy ring.
+ */
+IOContext
+IOContextForStrategy(BufferAccessStrategy strategy)
+{
+	if (!strategy)
+		return IOCONTEXT_SHARED;
+
+	switch (strategy->btype)
+	{
+		case BAS_NORMAL:
+			/*
+			 * Currently, GetAccessStrategy() returns NULL for
+			 * BufferAccessStrategyType BAS_NORMAL, so this case is
+			 * unreachable.
+			 */
+			pg_unreachable();
+			return IOCONTEXT_SHARED;
+		case BAS_BULKREAD:
+			return IOCONTEXT_BULKREAD;
+		case BAS_BULKWRITE:
+			return IOCONTEXT_BULKWRITE;
+		case BAS_VACUUM:
+			return IOCONTEXT_VACUUM;
+	}
+
+	elog(ERROR, "unrecognized BufferAccessStrategyType: %d", strategy->btype);
+}
+
 /*
  * StrategyRejectBuffer -- consider rejecting a dirty buffer
  *
@@ -688,5 +733,7 @@ StrategyRejectBuffer(BufferAccessStrategy strategy, BufferDesc *buf, bool from_r
 	 */
 	strategy->buffers[strategy->current] = InvalidBuffer;
 
+	pgstat_count_io_op(IOOP_REJECT, IOContextForStrategy(strategy));
+
 	return true;
 }
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index 30d67d1c40..cb9685564f 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -18,6 +18,7 @@
 #include "access/parallel.h"
 #include "catalog/catalog.h"
 #include "executor/instrument.h"
+#include "pgstat.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
 #include "utils/guc_hooks.h"
@@ -196,6 +197,7 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
 				LocalRefCount[b]++;
 				ResourceOwnerRememberBuffer(CurrentResourceOwner,
 											BufferDescriptorGetBuffer(bufHdr));
+
 				break;
 			}
 		}
@@ -226,6 +228,8 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
 				  localpage,
 				  false);
 
+		pgstat_count_io_op(IOOP_WRITE, IOCONTEXT_LOCAL);
+
 		/* Mark not-dirty now in case we error out below */
 		buf_state &= ~BM_DIRTY;
 		pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
@@ -256,6 +260,7 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
 		ClearBufferTag(&bufHdr->tag);
 		buf_state &= ~(BM_VALID | BM_TAG_VALID);
 		pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
+		pgstat_count_io_op(IOOP_EVICT, IOCONTEXT_LOCAL);
 	}
 
 	hresult = (LocalBufferLookupEnt *)
@@ -275,6 +280,7 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
 	pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
 
 	*foundPtr = false;
+
 	return bufHdr;
 }
 
diff --git a/src/backend/storage/sync/sync.c b/src/backend/storage/sync/sync.c
index 9d6a9e9109..5718b52fb5 100644
--- a/src/backend/storage/sync/sync.c
+++ b/src/backend/storage/sync/sync.c
@@ -432,6 +432,8 @@ ProcessSyncRequests(void)
 					total_elapsed += elapsed;
 					processed++;
 
+					pgstat_count_io_op(IOOP_FSYNC, IOCONTEXT_SHARED);
+
 					if (log_checkpoints)
 						elog(DEBUG1, "checkpoint sync: number=%d file=%s time=%.3f ms",
 							 processed,
diff --git a/src/backend/utils/activity/Makefile b/src/backend/utils/activity/Makefile
index a2e8507fd6..0098785089 100644
--- a/src/backend/utils/activity/Makefile
+++ b/src/backend/utils/activity/Makefile
@@ -22,6 +22,7 @@ OBJS = \
 	pgstat_checkpointer.o \
 	pgstat_database.o \
 	pgstat_function.o \
+	pgstat_io_ops.o \
 	pgstat_relation.o \
 	pgstat_replslot.o \
 	pgstat_shmem.o \
diff --git a/src/backend/utils/activity/meson.build b/src/backend/utils/activity/meson.build
index 5b3b558a67..1038324c32 100644
--- a/src/backend/utils/activity/meson.build
+++ b/src/backend/utils/activity/meson.build
@@ -7,6 +7,7 @@ backend_sources += files(
   'pgstat_checkpointer.c',
   'pgstat_database.c',
   'pgstat_function.c',
+  'pgstat_io_ops.c',
   'pgstat_relation.c',
   'pgstat_replslot.c',
   'pgstat_shmem.c',
diff --git a/src/backend/utils/activity/pgstat_io_ops.c b/src/backend/utils/activity/pgstat_io_ops.c
new file mode 100644
index 0000000000..6f9c250907
--- /dev/null
+++ b/src/backend/utils/activity/pgstat_io_ops.c
@@ -0,0 +1,255 @@
+/* -------------------------------------------------------------------------
+ *
+ * pgstat_io_ops.c
+ *	  Implementation of IO operation statistics.
+ *
+ * This file contains the implementation of IO operation statistics. It is kept
+ * separate from pgstat.c to enforce the line between the statistics access /
+ * storage implementation and the details about individual types of
+ * statistics.
+ *
+ * Copyright (c) 2021-2022, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/utils/activity/pgstat_io_ops.c
+ * -------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "utils/pgstat_internal.h"
+
+static PgStat_IOContextOps pending_IOOpStats;
+
+void
+pgstat_count_io_op(IOOp io_op, IOContext io_context)
+{
+	PgStat_IOOpCounters *pending_counters;
+
+	Assert(io_context < IOCONTEXT_NUM_TYPES);
+	Assert(io_op < IOOP_NUM_TYPES);
+	Assert(pgstat_expect_io_op(MyBackendType, io_context, io_op));
+
+	pending_counters = &pending_IOOpStats.data[io_context];
+
+	switch (io_op)
+	{
+		case IOOP_EVICT:
+			pending_counters->evictions++;
+			break;
+		case IOOP_EXTEND:
+			pending_counters->extends++;
+			break;
+		case IOOP_FSYNC:
+			pending_counters->fsyncs++;
+			break;
+		case IOOP_READ:
+			pending_counters->reads++;
+			break;
+		case IOOP_REPOSSESS:
+			pending_counters->repossessions++;
+			break;
+		case IOOP_REJECT:
+			pending_counters->rejections++;
+			break;
+		case IOOP_REUSE:
+			pending_counters->reuses++;
+			break;
+		case IOOP_WRITE:
+			pending_counters->writes++;
+			break;
+	}
+
+}
+
+const char *
+pgstat_io_context_desc(IOContext io_context)
+{
+	switch (io_context)
+	{
+		case IOCONTEXT_BULKREAD:
+			return "bulkread";
+		case IOCONTEXT_BULKWRITE:
+			return "bulkwrite";
+		case IOCONTEXT_LOCAL:
+			return "local";
+		case IOCONTEXT_SHARED:
+			return "shared";
+		case IOCONTEXT_VACUUM:
+			return "vacuum";
+	}
+
+	elog(ERROR, "unrecognized IOContext value: %d", io_context);
+}
+
+const char *
+pgstat_io_op_desc(IOOp io_op)
+{
+	switch (io_op)
+	{
+		case IOOP_EVICT:
+			return "evicted";
+		case IOOP_EXTEND:
+			return "extended";
+		case IOOP_FSYNC:
+			return "files synced";
+		case IOOP_READ:
+			return "read";
+		case IOOP_REPOSSESS:
+			return "repossessed";
+		case IOOP_REJECT:
+			return "rejected";
+		case IOOP_REUSE:
+			return "reused";
+		case IOOP_WRITE:
+			return "written";
+	}
+
+	elog(ERROR, "unrecognized IOOp value: %d", io_op);
+}
+
+/*
+* IO Operation statistics are not collected for all BackendTypes.
+*
+* The following BackendTypes do not participate in the cumulative stats
+* subsystem or do not do IO operations worth reporting statistics on:
+* - Syslogger because it is not connected to shared memory
+* - Archiver because most relevant archiving IO is delegated to a
+*   specialized command or module
+* - WAL Receiver and WAL Writer IO is not tracked in pg_stat_io for now
+*
+* Function returns true if BackendType participates in the cumulative stats
+* subsystem for IO Operations and false if it does not.
+*/
+bool
+pgstat_io_op_stats_collected(BackendType bktype)
+{
+	return bktype != B_INVALID && bktype != B_ARCHIVER && bktype != B_LOGGER &&
+		bktype != B_WAL_RECEIVER && bktype != B_WAL_WRITER;
+}
+
+/*
+ * Some BackendTypes do not perform IO operations in certain IOContexts. Check
+ * that the given BackendType is expected to do IO in the given IOContext.
+ */
+bool
+pgstat_bktype_io_context_valid(BackendType bktype, IOContext io_context)
+{
+	bool		no_local;
+
+	/*
+	 * In core Postgres, only regular backends and WAL Sender processes
+	 * executing queries should use local buffers. Parallel workers will not
+	 * use local buffers (see InitLocalBuffers()); however, extensions
+	 * leveraging background workers have no such limitation, so track IO
+	 * Operations in IOCONTEXT_LOCAL for BackendType B_BG_WORKER.
+	 */
+	no_local = bktype == B_AUTOVAC_LAUNCHER || bktype == B_BG_WRITER || bktype
+		== B_CHECKPOINTER || bktype == B_AUTOVAC_WORKER || bktype ==
+		B_STANDALONE_BACKEND || bktype == B_STARTUP;
+
+	if (io_context == IOCONTEXT_LOCAL && no_local)
+		return false;
+
+	/*
+	 * Some BackendTypes do not currently perform any IO operations in certain
+	 * IOContexts, and, while it may not be inherently incorrect for them to
+	 * do so, excluding those rows from the view makes the view easier to use.
+	 */
+	if ((io_context == IOCONTEXT_BULKREAD || io_context == IOCONTEXT_BULKWRITE
+		 || io_context == IOCONTEXT_VACUUM) && (bktype == B_CHECKPOINTER
+												|| bktype == B_BG_WRITER))
+		return false;
+
+	if (io_context == IOCONTEXT_VACUUM && bktype == B_AUTOVAC_LAUNCHER)
+		return false;
+
+	if (io_context == IOCONTEXT_BULKWRITE && (bktype == B_AUTOVAC_WORKER ||
+											  bktype == B_AUTOVAC_LAUNCHER))
+		return false;
+
+	return true;
+}
+
+/*
+ * Some BackendTypes will never do certain IOOps and some IOOps should not
+ * occur in certain IOContexts. Check that the given IOOp is valid for the
+ * given BackendType in the given IOContext. Note that there are currently no
+ * cases of an IOOp being invalid for a particular BackendType only within a
+ * certain IOContext.
+ */
+bool
+pgstat_io_op_valid(BackendType bktype, IOContext io_context, IOOp io_op)
+{
+	bool		strategy_io_context;
+
+	/*
+	 * Some BackendTypes should never track IO Operation statistics.
+	 */
+	Assert(pgstat_io_op_stats_collected(bktype));
+
+	/*
+	 * Some BackendTypes will not do certain IOOps.
+	 */
+	if ((bktype == B_BG_WRITER || bktype == B_CHECKPOINTER) &&
+		(io_op == IOOP_READ || io_op == IOOP_EVICT))
+		return false;
+
+	if ((bktype == B_AUTOVAC_LAUNCHER || bktype == B_BG_WRITER || bktype ==
+		 B_CHECKPOINTER) && io_op == IOOP_EXTEND)
+		return false;
+
+	/*
+	 * Some IOOps are not valid in certain IOContexts and some IOOps are only
+	 * valid in certain contexts.
+	 */
+	if (io_context == IOCONTEXT_BULKREAD && io_op == IOOP_EXTEND)
+		return false;
+
+	/*
+	 * Only BAS_BULKREAD will reject strategy buffers
+	 */
+	if (io_context != IOCONTEXT_BULKREAD && io_op == IOOP_REJECT)
+		return false;
+
+
+	strategy_io_context = io_context == IOCONTEXT_BULKREAD || io_context ==
+		IOCONTEXT_BULKWRITE || io_context == IOCONTEXT_VACUUM;
+
+	/*
+	 * IOOP_REPOSSESS and IOOP_REUSE are only relevant when a
+	 * BufferAccessStrategy is in use.
+	 */
+	if (!strategy_io_context && (io_op == IOOP_REJECT || io_op ==
+				IOOP_REPOSSESS || io_op == IOOP_REUSE))
+		return false;
+
+	/*
+	 * Temporary tables using local buffers are not logged and thus do not
+	 * require fsync'ing.
+	 *
+	 * IOOP_FSYNC IOOps done by a backend using a BufferAccessStrategy are
+	 * counted in the IOCONTEXT_SHARED IOContext. See comment in
+	 * ForwardSyncRequest() for more details.
+	 */
+	if ((io_context == IOCONTEXT_LOCAL || strategy_io_context) &&
+		io_op == IOOP_FSYNC)
+		return false;
+
+	return true;
+}
+
+bool
+pgstat_expect_io_op(BackendType bktype, IOContext io_context, IOOp io_op)
+{
+	if (!pgstat_io_op_stats_collected(bktype))
+		return false;
+
+	if (!pgstat_bktype_io_context_valid(bktype, io_context))
+		return false;
+
+	if (!(pgstat_io_op_valid(bktype, io_context, io_op)))
+		return false;
+
+	return true;
+}
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 9e2ce6f011..5883aafe9c 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -14,6 +14,7 @@
 #include "datatype/timestamp.h"
 #include "portability/instr_time.h"
 #include "postmaster/pgarch.h"	/* for MAX_XFN_CHARS */
+#include "storage/buf.h"
 #include "utils/backend_progress.h" /* for backward compatibility */
 #include "utils/backend_status.h"	/* for backward compatibility */
 #include "utils/relcache.h"
@@ -276,6 +277,55 @@ typedef struct PgStat_CheckpointerStats
 	PgStat_Counter buf_fsync_backend;
 } PgStat_CheckpointerStats;
 
+/*
+ * Types related to counting IO Operations for various IO Contexts
+ * When adding a new value, ensure that the proper assertions are added to
+ * pgstat_io_context_ops_assert_zero() and pgstat_io_op_assert_zero() (though
+ * the compiler will remind you about the latter)
+ */
+
+typedef enum IOOp
+{
+	IOOP_EVICT = 0,
+	IOOP_EXTEND,
+	IOOP_FSYNC,
+	IOOP_READ,
+	IOOP_REJECT,
+	IOOP_REPOSSESS,
+	IOOP_REUSE,
+	IOOP_WRITE,
+} IOOp;
+
+#define IOOP_NUM_TYPES (IOOP_WRITE + 1)
+
+typedef enum IOContext
+{
+	IOCONTEXT_BULKREAD = 0,
+	IOCONTEXT_BULKWRITE,
+	IOCONTEXT_LOCAL,
+	IOCONTEXT_SHARED,
+	IOCONTEXT_VACUUM,
+} IOContext;
+
+#define IOCONTEXT_NUM_TYPES (IOCONTEXT_VACUUM + 1)
+
+typedef struct PgStat_IOOpCounters
+{
+	PgStat_Counter evictions;
+	PgStat_Counter extends;
+	PgStat_Counter fsyncs;
+	PgStat_Counter reads;
+	PgStat_Counter rejections;
+	PgStat_Counter reuses;
+	PgStat_Counter repossessions;
+	PgStat_Counter writes;
+} PgStat_IOOpCounters;
+
+typedef struct PgStat_IOContextOps
+{
+	PgStat_IOOpCounters data[IOCONTEXT_NUM_TYPES];
+} PgStat_IOContextOps;
+
 typedef struct PgStat_StatDBEntry
 {
 	PgStat_Counter n_xact_commit;
@@ -453,6 +503,24 @@ extern void pgstat_report_checkpointer(void);
 extern PgStat_CheckpointerStats *pgstat_fetch_stat_checkpointer(void);
 
 
+/*
+ * Functions in pgstat_io_ops.c
+ */
+
+extern void pgstat_count_io_op(IOOp io_op, IOContext io_context);
+extern const char *pgstat_io_context_desc(IOContext io_context);
+extern const char *pgstat_io_op_desc(IOOp io_op);
+
+/* Validation functions in pgstat_io_ops.c */
+extern bool pgstat_io_op_stats_collected(BackendType bktype);
+extern bool pgstat_bktype_io_context_valid(BackendType bktype, IOContext io_context);
+extern bool pgstat_io_op_valid(BackendType bktype, IOContext io_context, IOOp io_op);
+extern bool pgstat_expect_io_op(BackendType bktype, IOContext io_context, IOOp io_op);
+
+/* IO stats translation function in freelist.c */
+extern IOContext IOContextForStrategy(BufferAccessStrategy bas);
+
+
 /*
  * Functions in pgstat_database.c
  */
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index b75481450d..7b67250747 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -392,7 +392,7 @@ extern void ScheduleBufferTagForWriteback(WritebackContext *context, BufferTag *
 
 /* freelist.c */
 extern BufferDesc *StrategyGetBuffer(BufferAccessStrategy strategy,
-									 uint32 *buf_state);
+									 uint32 *buf_state, bool *from_ring);
 extern void StrategyFreeBuffer(BufferDesc *buf);
 extern bool StrategyRejectBuffer(BufferAccessStrategy strategy,
 								 BufferDesc *buf, bool from_ring);
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 6f4dfa0960..d0eed71f63 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -23,7 +23,12 @@
 
 typedef void *Block;
 
-/* Possible arguments for GetAccessStrategy() */
+/*
+ * Possible arguments for GetAccessStrategy().
+ *
+ * If adding a new BufferAccessStrategyType, also add a new IOContext so
+ * statistics on IO operations using this strategy are tracked.
+ */
 typedef enum BufferAccessStrategyType
 {
 	BAS_NORMAL,					/* Normal random access */
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 2f02cc8f42..b080367073 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1106,7 +1106,9 @@ ID
 INFIX
 INT128
 INTERFACE_INFO
+IOContext
 IOFuncSelector
+IOOp
 IPCompareMethod
 ITEM
 IV
@@ -2026,6 +2028,8 @@ PgStat_FetchConsistency
 PgStat_FunctionCallUsage
 PgStat_FunctionCounts
 PgStat_HashKey
+PgStat_IOContextOps
+PgStat_IOOpCounters
 PgStat_Kind
 PgStat_KindInfo
 PgStat_LocalState
-- 
2.34.1

From 2bb6195640ec5f04dad43e276b4f2801bd5b76ab Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplage...@gmail.com>
Date: Thu, 6 Oct 2022 12:24:42 -0400
Subject: [PATCH v36 4/5] Add system view tracking IO ops per backend type

Add pg_stat_io, a system view which tracks the number of IOOps (
evictions, reuses, rejections, repossessions, reads, writes, extends,
and fsyncs) done through each IOContext (shared buffers, local buffers,
and buffers reserved by a BufferAccessStrategy) by each type of backend
(e.g. client backend, checkpointer).

Some BackendTypes do not accumulate IO operations statistics and will
not be included in the view.

Some IOContexts are not used by some BackendTypes and will not be in the
view. For example, checkpointer does not use a BufferAccessStrategy
(currently), so there will be no rows for BufferAccessStrategy
IOContexts for checkpointer.

Some IOOps are invalid in combination with certain IOContexts. Those
cells will be NULL in the view to distinguish between 0 observed IOOps
of that type and an invalid combination. For example, local buffers are
not fsynced so cells for all BackendTypes for IOCONTEXT_LOCAL and
IOOP_FSYNC will be NULL.

Some BackendTypes never perform certain IOOps. Those cells will also be
NULL in the view. For example, bgwriter should not perform reads.

View stats are populated with statistics incremented when a backend
performs an IO Operation and maintained by the cumulative statistics
subsystem.

Each row of the view shows stats for a particular BackendType and
IOContext combination (e.g. shared buffer accesses by checkpointer) and
each column in the view is the total number of IO Operations done (e.g.
writes).
So a cell in the view would be, for example, the number of shared
buffers written by checkpointer since the last stats reset.

In anticipation of tracking WAL IO and non-block-oriented IO (such as
temporary file IO), the "unit" column specifies the unit of the "read",
"written", and "extended" columns for a given row.

Note that some of the cells in the view are redundant with fields in
pg_stat_bgwriter (e.g. buffers_backend), however these have been kept in
pg_stat_bgwriter for backwards compatibility. Deriving the redundant
pg_stat_bgwriter stats from the IO operations stats structures was also
problematic due to the separate reset targets for 'bgwriter' and 'io'.

Suggested by Andres Freund

Author: Melanie Plageman <melanieplage...@gmail.com>
Reviewed-by: Andres Freund <and...@anarazel.de>
Reviewed-by: Justin Pryzby <pry...@telsasoft.com>
Reviewed-by: Kyotaro Horiguchi <horikyota....@gmail.com>
Reviewed-by: Maciek Sakrejda <m.sakre...@gmail.com>
Reviewed-by: Lukas Fittl <lu...@fittl.com>
Discussion: https://www.postgresql.org/message-id/flat/20200124195226.lth52iydq2n2uilq%40alap3.anarazel.de
---
 doc/src/sgml/monitoring.sgml         | 384 +++++++++++++++++++++++++--
 src/backend/catalog/system_views.sql |  16 ++
 src/backend/utils/adt/pgstatfuncs.c  | 139 ++++++++++
 src/include/catalog/pg_proc.dat      |   9 +
 src/test/regress/expected/rules.out  |  13 +
 src/test/regress/expected/stats.out  | 224 ++++++++++++++++
 src/test/regress/sql/stats.sql       | 123 +++++++++
 7 files changed, 892 insertions(+), 16 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 698f274341..de0850337b 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -448,6 +448,15 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
      </entry>
      </row>
 
+     <row>
+      <entry><structname>pg_stat_io</structname><indexterm><primary>pg_stat_io</primary></indexterm></entry>
+      <entry>A row for each IO Context for each backend type showing
+      statistics about backend IO operations. See
+       <link linkend="monitoring-pg-stat-io-view">
+       <structname>pg_stat_io</structname></link> for details.
+     </entry>
+     </row>
+
      <row>
       <entry><structname>pg_stat_wal</structname><indexterm><primary>pg_stat_wal</primary></indexterm></entry>
       <entry>One row only, showing statistics about WAL activity. See
@@ -658,20 +667,20 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
   </para>
 
   <para>
-   The <structname>pg_statio_</structname> views are primarily useful to
-   determine the effectiveness of the buffer cache.  When the number
-   of actual disk reads is much smaller than the number of buffer
-   hits, then the cache is satisfying most read requests without
-   invoking a kernel call. However, these statistics do not give the
-   entire story: due to the way in which <productname>PostgreSQL</productname>
-   handles disk I/O, data that is not in the
-   <productname>PostgreSQL</productname> buffer cache might still reside in the
-   kernel's I/O cache, and might therefore still be fetched without
-   requiring a physical read. Users interested in obtaining more
-   detailed information on <productname>PostgreSQL</productname> I/O behavior are
-   advised to use the <productname>PostgreSQL</productname> statistics views
-   in combination with operating system utilities that allow insight
-   into the kernel's handling of I/O.
+   The <structname>pg_statio_</structname> and
+   <structname>pg_stat_io</structname> views are primarily useful to determine
+   the effectiveness of the buffer cache.  When the number of actual disk reads
+   is much smaller than the number of buffer hits, then the cache is satisfying
+   most read requests without invoking a kernel call. However, these statistics
+   do not give the entire story: due to the way in which
+   <productname>PostgreSQL</productname> handles disk I/O, data that is not in
+   the <productname>PostgreSQL</productname> buffer cache might still reside in
+   the kernel's I/O cache, and might therefore still be fetched without
+   requiring a physical read. Users interested in obtaining more detailed
+   information on <productname>PostgreSQL</productname> I/O behavior are
+   advised to use the <productname>PostgreSQL</productname> statistics views in
+   combination with operating system utilities that allow insight into the
+   kernel's handling of I/O.
   </para>
 
  </sect2>
@@ -3600,13 +3609,12 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
        <structfield>stats_reset</structfield> <type>timestamp with time zone</type>
       </para>
       <para>
-       Time at which these statistics were last reset
+       Time at which these statistics were last reset.
       </para></entry>
      </row>
     </tbody>
    </tgroup>
   </table>
-
   <para>
     Normally, WAL files are archived in order, oldest to newest, but that is
     not guaranteed, and does not hold under special circumstances like when
@@ -3615,7 +3623,351 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
     <structfield>last_archived_wal</structfield> have also been successfully
     archived.
   </para>
+ </sect2>
+
+ <sect2 id="monitoring-pg-stat-io-view">
+  <title><structname>pg_stat_io</structname></title>
+
+  <indexterm>
+   <primary>pg_stat_io</primary>
+  </indexterm>
+
+  <para>
+   The <structname>pg_stat_io</structname> view has a row for each backend type
+   and IO context containing global data for the cluster on IO operations done
+   by that backend type in that IO context. Currently only a subset of IO
+   operations are tracked here. WAL IO, IO on temporary files, and some forms
+   of IO outside of shared buffers (such as when building indexes or moving a
+   table from one tablespace to another) could be added in the future.
+  </para>
+
+  <table id="pg-stat-io-view" xreflabel="pg_stat_io">
+   <title><structname>pg_stat_io</structname> View</title>
+   <tgroup cols="1">
+    <thead>
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       Column Type
+      </para>
+      <para>
+       Description
+      </para></entry>
+     </row>
+    </thead>
+    <tbody>
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>backend_type</structfield> <type>text</type>
+      </para>
+      <para>
+       Type of backend (e.g. background worker, autovacuum worker).
+       See <link linkend="monitoring-pg-stat-activity-view">
+       <structname>pg_stat_activity</structname></link> for more information on
+       <varname>backend_type</varname>s.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>io_context</structfield> <type>text</type>
+      </para>
+      <para>
+       The context or location of an IO operation.
+       <varname>io_context</varname> <literal>shared</literal> refers to IO
+       operations of data in shared buffers, the primary buffer pool for
+       relation data. <varname>io_context</varname> <literal>local</literal>
+       refers to IO operations on process-local memory used for temporary
+       tables. <varname>io_context</varname> <literal>vacuum</literal> refers
+       to the IO operations incurred while vacuuming and analyzing.
+       <varname>io_context</varname> <literal>bulkread</literal> refers to IO
+       operations specially designated as <literal>bulk reads</literal>, such
+       as the sequential scan of a large table. <varname>io_context</varname>
+       <literal>bulkwrite</literal> refers to IO operations specially
+       designated as <literal>bulk writes</literal>, such as
+       <command>COPY</command>.
+       </para>
+
+       <para>
+       These last three <varname>io_context</varname>s are counted separately
+       because the autovacuum daemon, explicit <command>VACUUM</command>,
+       explicit <command>ANALYZE</command>, many bulk reads, and many bulk
+       writes use a fixed amount of memory, acquiring the equivalent number of
+       shared buffers and reusing them circularly to avoid occupying an undue
+       portion of the main shared buffer pool. This pattern is called a
+       <quote>Buffer Access Strategy</quote> in the
+       <productname>PostgreSQL</productname> source code and the fixed-size
+       ring buffer is referred to as a <quote>strategy ring buffer</quote> for
+       the purposes of this view's documentation.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>read</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Reads by this <varname>backend_type</varname> into buffers in this
+       <varname>io_context</varname>.
+       </para>
+       <para>
+       <varname>read</varname> and <varname>extended</varname> for
+       <varname>backend_type</varname>s <literal>autovacuum launcher</literal>,
+       <literal>autovacuum worker</literal>, <literal>client backend</literal>,
+       <literal>standalone backend</literal>, <literal>background
+       worker</literal>, and <literal>walsender</literal> for all
+       <varname>io_context</varname>s is similar to the sum of
+       <varname>heap_blks_read</varname>, <varname>idx_blks_read</varname>,
+       <varname>tidx_blks_read</varname>, and
+       <varname>toast_blks_read</varname> in <link
+       linkend="monitoring-pg-statio-all-tables-view">
+       <structname>pg_statio_all_tables</structname></link>. and
+       <varname>blks_read</varname> from <link
+       linkend="monitoring-pg-stat-database-view">
+       <structname>pg_stat_database</structname></link>. The difference is that
+       reads done as part of <command>CREATE DATABASE</command> are not counted
+       in <structname>pg_statio_all_tables</structname> and
+       <structname>pg_stat_database</structname>.
+       </para>
+       <para>If using the <productname>PostgreSQL</productname> extension,
+       <xref linkend="pgstatstatements"/>,
+       <varname>read</varname> for
+       <varname>backend_type</varname>s <literal>autovacuum launcher</literal>,
+       <literal>autovacuum worker</literal>, <literal>client backend</literal>,
+       <literal>standalone backend</literal>, <literal>background
+       worker</literal>, and <literal>walsender</literal> for all
+       <varname>io_context</varname>s is equivalent to
+       <varname>shared_blks_read</varname> together with
+       <varname>local_blks_read</varname>.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>written</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Writes of data in this <varname>io_context</varname> written out by this
+       <varname>backend_type</varname>.
+       </para>
+
+       <para>
+       Normal client backends should be able to rely on maintenance processes
+       like the checkpointer and background writer to write out dirty data as
+       much as possible. Large numbers of writes by
+       <varname>backend_type</varname> <literal>client backend</literal> in
+       <varname>io_context</varname> <literal>shared</literal> could indicate a
+       misconfiguration of shared buffers or of checkpointer . More information
+       on checkpointer configuration can be found in <xref
+       linkend="wal-configuration"/>.
+       </para>
+
+       <para>Note that the values of <varname>written</varname> for
+       <varname>backend_type</varname> <literal>background writer</literal> and
+       <varname>backend_type</varname> <literal>checkpointer</literal> are
+       equivalent to the values of <varname>buffers_clean</varname> and
+       <varname>buffers_checkpoint</varname>, respectively, in <link
+       linkend="monitoring-pg-stat-bgwriter-view">
+       <structname>pg_stat_bgwriter</structname></link>. Also, the sum of
+       <varname>written</varname> and <varname>extended</varname> in this view
+       for <varname>backend_type</varname>s <literal>client backend</literal>,
+       <literal>autovacuum worker</literal>, <literal>background
+       worker</literal>, and <literal>walsender</literal> in
+       <varname>io_context</varname>s <literal>shared</literal>,
+       <literal>bulkread</literal>, <literal>bulkwrite</literal>, and
+       <literal>vacuum</literal> is equivalent to
+       <varname>buffers_backend</varname> in
+       <structname>pg_stat_bgwriter</structname>.
+       </para>
+
+       <para>If using the <productname>PostgreSQL</productname> extension,
+       <xref linkend="pgstatstatements"/>, <varname>written</varname> and
+       <varname>extended</varname> for <varname>backend_type</varname>s
+       <literal>autovacuum launcher</literal>, <literal>autovacuum
+       worker</literal>, <literal>client backend</literal>, <literal>standalone
+       backend</literal>, <literal>background worker</literal>, and
+       <literal>walsender</literal> for all <varname>io_context</varname>s is
+       equivalent to <varname>shared_blks_written</varname> together with
+       <varname>local_blks_written</varname>.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>extended</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Extends of relations done by this <varname>backend_type</varname> in
+       order to write data in this <varname>io_context</varname>.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>bytes_conversion</structfield> <type>bigint</type>
+      </para>
+      <para>
+      The number of bytes per unit of IO read, written, or extended. For
+      block-oriented IO of relation data, reads, writes, and extends are done
+      in <varname>block_size</varname> units, derived from the build-time
+      parameter <symbol>BLCKSZ</symbol>, which is <literal>8192</literal> by
+      default. Future values could include those derived from
+      <symbol>XLOG_BLCKSZ</symbol>, once WAL IO is tracked in this view, and
+      constant multipliers once non-block-oriented IO (e.g. temporary file IO)
+      is tracked here.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>evicted</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of times a <varname>backend_type</varname> has evicted a block
+       from a shared or local buffer in order to reuse the buffer in this
+       <varname>io_context</varname>. Blocks are only evicted when there are no
+       unoccupied buffers.
+       </para>
+
+      <para>
+       <varname>evicted</varname> in <varname>io_context</varname>
+       <literal>shared</literal> counts the number of times a block from a
+       shared buffer was evicted so that it can be replaced with another block,
+       also in shared buffers.
+
+       A high <varname>evicted</varname> count in <varname>io_context</varname>
+       <literal>shared</literal> could indicate that shared buffers is too
+       small and should be set to a larger value.
+       </para>
+
+       <para>
+       <varname>evicted</varname> in <varname>io_context</varname>
+       <literal>vacuum</literal>, <literal>bulkread</literal>, and
+       <literal>bulkwrite</literal> counts the number of times occupied shared
+       buffers were added to the fixed-size strategy ring buffer, causing the
+       buffer contents to be evicted. If the current buffer in the ring is
+       pinned or in use by another backend, it may be replaced by a new shared
+       buffer. If this shared buffer contains valid data, that block must be
+       evicted and will count as <varname>evicted</varname>.
+
+       In <varname>io_context</varname> <literal>bulkread</literal>, existing
+       dirty buffers in the ring requiring flush are
+       <varname>rejected</varname>. If all of the buffers in the strategy ring
+       have been <varname>rejected</varname>, a new shared buffer will be added
+       to the ring. If the new shared buffer is occupied, its contents will
+       need to be evicted.
+
+       Seeing a large number of <varname>evicted</varname> in strategy
+       <varname>io_context</varname>s can provide insight into primary working
+       set cache misses.
+       </para>
+
+       <para>
+       <varname>evicted</varname> in <varname>io_context</varname>
+       <literal>local</literal> counts the number of times a block of data from
+       an existing local buffer was evicted in order to replace it with another
+       block, also in local buffers.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>reused</structfield> <type>bigint</type>
+      </para>
+      <para>
+       The number of times an existing buffer in the strategy ring was reused
+       as part of an operation in the <literal>bulkread</literal>,
+       <literal>bulkwrite</literal>, or <literal>vacuum</literal>
+       <varname>io_context</varname>s. When a <quote>Buffer Access
+       Strategy</quote> reuses a buffer in the strategy ring, it must evict its
+       contents, incrementing <varname>reused</varname>. When a <quote>Buffer
+       Access Strategy</quote> adds a new shared buffer to the strategy ring
+       and this shared buffer is occupied, the <quote>Buffer Access
+       Strategy</quote> must evict the contents of the shared buffer,
+       incrementing <varname>evicted</varname>.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>rejected</structfield> <type>bigint</type>
+      </para>
+      <para>
+      The number of times a <literal>bulkread</literal> found the current
+      buffer in the fixed-size strategy ring dirty and requiring flush.
+      <quote>Rejecting</quote> the buffer effectively removes it from the
+      strategy ring buffer allowing the slot in the ring to be replaced in the
+      future with a new shared buffer. A high number of
+      <literal>bulkread</literal> rejections can indicate a need for more
+      frequent vacuuming or more aggressive autovacuum settings, as buffers are
+      dirtied during a bulkread operation when updating the hint bit or when
+      performing on-access pruning.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>repossessed</structfield> <type>bigint</type>
+      </para>
+      <para>
+       The number of times a buffer in the fixed-size ring buffer used by
+       operations in the <literal>bulkread</literal>,
+       <literal>bulkwrite</literal>, and <literal>vacuum</literal>
+       <varname>io_context</varname>s was removed from that ring buffer because
+       it was pinned or in use by another backend and thus could not have its
+       tenant block evicted so it could be reused. Once removed from the
+       strategy ring, this buffer is a <quote>normal</quote> shared buffer
+       again. A high number of repossessions is a sign of contention for the
+       blocks operated on by the strategy operation.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>files_synced</structfield> <type>bigint</type>
+      </para>
+       <para>
+       Number of files fsynced by this <varname>backend_type</varname> for the
+       purpose of persisting data dirtied in this
+       <varname>io_context</varname>. <literal>fsyncs</literal> are done at
+       segment boundaries so <varname>bytes_conversion</varname> does not apply to the
+       <varname>files_synced</varname> column. <literal>fsyncs</literal> done
+       by backends in order to persist data written in
+       <varname>io_context</varname> <literal>vacuum</literal>,
+       <varname>io_context</varname> <literal>bulkread</literal>, or
+       <varname>io_context</varname> <literal>bulkwrite</literal> are counted
+       as an <varname>io_context</varname> <literal>shared</literal>
+       <literal>fsync</literal>.
+       </para>
 
+       <para>
+       Normal client backends should be able to rely on the checkpointer to
+       ensure data is persisted to permanent storage. Large numbers of
+       <varname>files_synced</varname> by <varname>backend_type</varname>
+       <literal>client backend</literal> could indicate a misconfiguration of
+       shared buffers or of checkpointer. More information on checkpointer
+       configuration can be found in <xref linkend="wal-configuration"/>.
+       </para>
+
+       <para>
+       Note that the sum of <varname>files_synced</varname> for all
+       <varname>io_context</varname> <literal>shared</literal> for all
+       <varname>backend_type</varname>s except <literal>checkpointer</literal>
+       is equivalent to <varname>buffers_backend_fsync</varname> in
+       <link linkend="monitoring-pg-stat-bgwriter-view"> <structname>pg_stat_bgwriter</structname></link>.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>stats_reset</structfield> <type>timestamp with time zone</type>
+      </para>
+      <para>
+       Time at which these statistics were last reset.
+      </para></entry>
+     </row>
+    </tbody>
+   </tgroup>
+  </table>
  </sect2>
 
  <sect2 id="monitoring-pg-stat-bgwriter-view">
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 2d8104b090..571c422f73 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1117,6 +1117,22 @@ CREATE VIEW pg_stat_bgwriter AS
         pg_stat_get_buf_alloc() AS buffers_alloc,
         pg_stat_get_bgwriter_stat_reset_time() AS stats_reset;
 
+CREATE VIEW pg_stat_io AS
+SELECT
+       b.backend_type,
+       b.io_context,
+       b.read,
+       b.written,
+       b.extended,
+       b.bytes_conversion,
+       b.evicted,
+       b.reused,
+       b.rejected,
+       b.repossessed,
+       b.files_synced,
+       b.stats_reset
+FROM pg_stat_get_io() b;
+
 CREATE VIEW pg_stat_wal AS
     SELECT
         w.wal_records,
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index b783af130c..5bd39733b6 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -29,6 +29,7 @@
 #include "storage/procarray.h"
 #include "utils/acl.h"
 #include "utils/builtins.h"
+#include "utils/guc.h"
 #include "utils/inet.h"
 #include "utils/timestamp.h"
 
@@ -1725,6 +1726,144 @@ pg_stat_get_buf_alloc(PG_FUNCTION_ARGS)
 	PG_RETURN_INT64(pgstat_fetch_stat_bgwriter()->buf_alloc);
 }
 
+/*
+* When adding a new column to the pg_stat_io view, add a new enum value
+* here above IO_NUM_COLUMNS.
+*/
+typedef enum io_stat_col
+{
+	IO_COL_BACKEND_TYPE,
+	IO_COL_IO_CONTEXT,
+	IO_COL_READS,
+	IO_COL_WRITES,
+	IO_COL_EXTENDS,
+	IO_COL_CONVERSION,
+	IO_COL_EVICTIONS,
+	IO_COL_REUSES,
+	IO_COL_REJECTIONS,
+	IO_COL_REPOSSESSIONS,
+	IO_COL_FSYNCS,
+	IO_COL_RESET_TIME,
+	IO_NUM_COLUMNS,
+}			io_stat_col;
+
+/*
+ * When adding a new IOOp, add a new io_stat_col and add a case to this
+ * function returning the corresponding io_stat_col.
+ */
+static io_stat_col
+pgstat_io_op_get_index(IOOp io_op)
+{
+	switch (io_op)
+	{
+		case IOOP_EVICT:
+			return IO_COL_EVICTIONS;
+		case IOOP_READ:
+			return IO_COL_READS;
+		case IOOP_REUSE:
+			return IO_COL_REUSES;
+		case IOOP_REJECT:
+			return IO_COL_REJECTIONS;
+		case IOOP_REPOSSESS:
+			return IO_COL_REPOSSESSIONS;
+		case IOOP_WRITE:
+			return IO_COL_WRITES;
+		case IOOP_EXTEND:
+			return IO_COL_EXTENDS;
+		case IOOP_FSYNC:
+			return IO_COL_FSYNCS;
+	}
+
+	elog(ERROR, "unrecognized IOOp value: %d", io_op);
+}
+
+Datum
+pg_stat_get_io(PG_FUNCTION_ARGS)
+{
+	PgStat_BackendIOContextOps *backends_io_stats;
+	ReturnSetInfo *rsinfo;
+	Datum		reset_time;
+
+	InitMaterializedSRF(fcinfo, 0);
+	rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+
+	backends_io_stats = pgstat_fetch_backend_io_context_ops();
+
+	reset_time = TimestampTzGetDatum(backends_io_stats->stat_reset_timestamp);
+
+	for (int bktype = 0; bktype < BACKEND_NUM_TYPES; bktype++)
+	{
+		Datum		bktype_desc = CStringGetTextDatum(GetBackendTypeDesc((BackendType) bktype));
+		bool		expect_backend_stats = true;
+		PgStat_IOContextOps *io_context_ops = &backends_io_stats->stats[bktype];
+
+		/*
+		 * For those BackendTypes without IO Operation stats, skip
+		 * representing them in the view altogether.
+		 */
+		expect_backend_stats = pgstat_io_op_stats_collected((BackendType)
+															bktype);
+
+		for (int io_context = 0; io_context < IOCONTEXT_NUM_TYPES; io_context++)
+		{
+			PgStat_IOOpCounters *counters = &io_context_ops->data[io_context];
+			const char *io_context_str = pgstat_io_context_desc(io_context);
+
+			Datum		values[IO_NUM_COLUMNS] = {0};
+			bool		nulls[IO_NUM_COLUMNS] = {0};
+
+			/*
+			 * Some combinations of IOContext and BackendType are not valid
+			 * for any type of IOOp. In such cases, omit the entire row from
+			 * the view.
+			 */
+			if (!expect_backend_stats ||
+				!pgstat_bktype_io_context_valid((BackendType) bktype,
+												(IOContext) io_context))
+			{
+				pgstat_io_context_ops_assert_zero(counters);
+				continue;
+			}
+
+			values[IO_COL_BACKEND_TYPE] = bktype_desc;
+			values[IO_COL_IO_CONTEXT] = CStringGetTextDatum(io_context_str);
+			values[IO_COL_READS] = Int64GetDatum(counters->reads);
+			values[IO_COL_WRITES] = Int64GetDatum(counters->writes);
+			values[IO_COL_EXTENDS] = Int64GetDatum(counters->extends);
+			/*
+			 * Hard-code this to blocks until we have non-block-oriented IO
+			 * represented in the view as well
+			 */
+			values[IO_COL_CONVERSION] = Int64GetDatum(BLCKSZ);
+			values[IO_COL_EVICTIONS] = Int64GetDatum(counters->evictions);
+			values[IO_COL_REUSES] = Int64GetDatum(counters->reuses);
+			values[IO_COL_REJECTIONS] = Int64GetDatum(counters->rejections);
+			values[IO_COL_REPOSSESSIONS] = Int64GetDatum(counters->repossessions);
+			values[IO_COL_FSYNCS] = Int64GetDatum(counters->fsyncs);
+			values[IO_COL_RESET_TIME] = TimestampTzGetDatum(reset_time);
+
+			/*
+			 * Some combinations of BackendType and IOOp and of IOContext and
+			 * IOOp are not valid. Set these cells in the view NULL and assert
+			 * that these stats are zero as expected.
+			 */
+			for (int io_op = 0; io_op < IOOP_NUM_TYPES; io_op++)
+			{
+				if (!(pgstat_io_op_valid((BackendType) bktype, (IOContext)
+										 io_context, (IOOp) io_op)))
+				{
+					pgstat_io_op_assert_zero(counters, (IOOp) io_op);
+					nulls[pgstat_io_op_get_index((IOOp) io_op)] = true;
+				}
+			}
+
+			tuplestore_putvalues(rsinfo->setResult, rsinfo->setDesc, values, nulls);
+		}
+	}
+
+	return (Datum) 0;
+}
+
 /*
  * Returns statistics of WAL activity
  */
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 20f5aa56ea..aae96db37a 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5653,6 +5653,15 @@
   proname => 'pg_stat_get_buf_alloc', provolatile => 's', proparallel => 'r',
   prorettype => 'int8', proargtypes => '', prosrc => 'pg_stat_get_buf_alloc' },
 
+{ oid => '8459', descr => 'statistics: per backend type IO statistics',
+  proname => 'pg_stat_get_io', provolatile => 'v',
+  prorows => '30', proretset => 't',
+  proparallel => 'r', prorettype => 'record', proargtypes => '',
+  proallargtypes => '{text,text,int8,int8,int8,int8,int8,int8,int8,int8,int8,timestamptz}',
+  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o}',
+  proargnames => '{backend_type,io_context,read,written,extended,bytes_conversion,evicted,reused,rejected,repossessed,files_synced,stats_reset}',
+  prosrc => 'pg_stat_get_io' },
+
 { oid => '1136', descr => 'statistics: information about WAL activity',
   proname => 'pg_stat_get_wal', proisstrict => 'f', provolatile => 's',
   proparallel => 'r', prorettype => 'record', proargtypes => '',
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 624d0e5aae..c46babade3 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1873,6 +1873,19 @@ pg_stat_gssapi| SELECT s.pid,
     s.gss_enc AS encrypted
    FROM pg_stat_get_activity(NULL::integer) s(datid, pid, usesysid, application_name, state, query, wait_event_type, wait_event, xact_start, query_start, backend_start, state_change, client_addr, client_hostname, client_port, backend_xid, backend_xmin, backend_type, ssl, sslversion, sslcipher, sslbits, ssl_client_dn, ssl_client_serial, ssl_issuer_dn, gss_auth, gss_princ, gss_enc, leader_pid, query_id)
   WHERE (s.client_port IS NOT NULL);
+pg_stat_io| SELECT b.backend_type,
+    b.io_context,
+    b.read,
+    b.written,
+    b.extended,
+    b.bytes_conversion,
+    b.evicted,
+    b.reused,
+    b.rejected,
+    b.repossessed,
+    b.files_synced,
+    b.stats_reset
+   FROM pg_stat_get_io() b(backend_type, io_context, read, written, extended, bytes_conversion, evicted, reused, rejected, repossessed, files_synced, stats_reset);
 pg_stat_progress_analyze| SELECT s.pid,
     s.datid,
     d.datname,
diff --git a/src/test/regress/expected/stats.out b/src/test/regress/expected/stats.out
index 257a6a9da9..28ef9171de 100644
--- a/src/test/regress/expected/stats.out
+++ b/src/test/regress/expected/stats.out
@@ -1120,4 +1120,228 @@ SELECT pg_stat_get_subscription_stats(NULL);
  
 (1 row)
 
+-- Test that the following operations are tracked in pg_stat_io:
+-- - reads of target blocks into shared buffers
+-- - writes of shared buffers to permanent storage
+-- - extends of relations using shared buffers
+-- - fsyncs done to ensure the durability of data dirtying shared buffers
+-- There is no test for blocks evicted from shared buffers, because we cannot
+-- be sure of the state of shared buffers at the point the test is run.
+SELECT sum(read) AS io_sum_shared_reads_before FROM pg_stat_io WHERE io_context = 'shared' \gset
+SELECT sum(written) AS io_sum_shared_writes_before FROM pg_stat_io WHERE io_context = 'shared' \gset
+SELECT sum(extended) AS io_sum_shared_extends_before FROM pg_stat_io WHERE io_context = 'shared' \gset
+SELECT sum(files_synced) AS io_sum_shared_fsyncs_before FROM pg_stat_io WHERE io_context = 'shared' \gset
+-- Create a regular table and insert some data to generate IOCONTEXT_SHARED
+-- extends.
+CREATE TABLE test_io_shared(a int);
+INSERT INTO test_io_shared SELECT i FROM generate_series(1,100)i;
+SELECT pg_stat_force_next_flush();
+ pg_stat_force_next_flush 
+--------------------------
+ 
+(1 row)
+
+-- After a checkpoint, there should be some additional IOCONTEXT_SHARED writes
+-- and fsyncs.
+-- The second checkpoint ensures that stats from the first checkpoint have been
+-- reported and protects against any potential races amongst the table
+-- creation, a possible timing-triggered checkpoint, and the explicit
+-- checkpoint in the test.
+CHECKPOINT;
+CHECKPOINT;
+SELECT sum(written) AS io_sum_shared_writes_after FROM pg_stat_io WHERE io_context = 'shared' \gset
+SELECT sum(extended) AS io_sum_shared_extends_after FROM pg_stat_io WHERE io_context = 'shared' \gset
+SELECT sum(files_synced) AS io_sum_shared_fsyncs_after FROM pg_stat_io WHERE io_context = 'shared' \gset
+SELECT :io_sum_shared_writes_after > :io_sum_shared_writes_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT :io_sum_shared_extends_after > :io_sum_shared_extends_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT current_setting('fsync') = 'off' OR :io_sum_shared_fsyncs_after > :io_sum_shared_fsyncs_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+-- Change the tablespace so that the table is rewritten directly, then SELECT
+-- from it to cause it to be read back into shared buffers.
+SET allow_in_place_tablespaces = true;
+CREATE TABLESPACE test_io_shared_stats_tblspc LOCATION '';
+ALTER TABLE test_io_shared SET TABLESPACE test_io_shared_stats_tblspc;
+SELECT COUNT(*) FROM test_io_shared;
+ count 
+-------
+   100
+(1 row)
+
+SELECT pg_stat_force_next_flush();
+ pg_stat_force_next_flush 
+--------------------------
+ 
+(1 row)
+
+SELECT sum(read) AS io_sum_shared_reads_after FROM pg_stat_io WHERE io_context = 'shared' \gset
+SELECT :io_sum_shared_reads_after > :io_sum_shared_reads_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+DROP TABLE test_io_shared;
+DROP TABLESPACE test_io_shared_stats_tblspc;
+-- Test that the follow IOCONTEXT_LOCAL IOOps are tracked in pg_stat_io:
+-- - eviction of local buffers in order to reuse them
+-- - reads of temporary table blocks into local buffers
+-- - writes of local buffers to permanent storage
+-- - extends of temporary tables
+-- Set temp_buffers to a low value so that we can trigger writes with fewer
+-- inserted tuples. Do so in a new session in case temporary tables have been
+-- accessed by previous tests in this session.
+\c
+SET temp_buffers TO '1MB';
+CREATE TEMPORARY TABLE test_io_local(a int, b TEXT);
+SELECT sum(evicted) AS io_sum_local_evictions_before FROM pg_stat_io WHERE io_context = 'local' \gset
+SELECT sum(read) AS io_sum_local_reads_before FROM pg_stat_io WHERE io_context = 'local' \gset
+SELECT sum(written) AS io_sum_local_writes_before FROM pg_stat_io WHERE io_context = 'local' \gset
+SELECT sum(extended) AS io_sum_local_extends_before FROM pg_stat_io WHERE io_context = 'local' \gset
+-- Insert enough values that we need to reuse and write out dirty local
+-- buffers.
+INSERT INTO test_io_local SELECT generate_series(1, 8000) as id, repeat('a', 100);
+-- Read in evicted buffers.
+SELECT COUNT(*) FROM test_io_local;
+ count 
+-------
+  8000
+(1 row)
+
+SELECT pg_stat_force_next_flush();
+ pg_stat_force_next_flush 
+--------------------------
+ 
+(1 row)
+
+SELECT sum(evicted) AS io_sum_local_evictions_after FROM pg_stat_io WHERE io_context = 'local' \gset
+SELECT sum(read) AS io_sum_local_reads_after FROM pg_stat_io WHERE io_context = 'local' \gset
+SELECT sum(written) AS io_sum_local_writes_after FROM pg_stat_io WHERE io_context = 'local' \gset
+SELECT sum(extended) AS io_sum_local_extends_after FROM pg_stat_io WHERE io_context = 'local' \gset
+SELECT :io_sum_local_evictions_after > :io_sum_local_evictions_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT :io_sum_local_reads_after > :io_sum_local_reads_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT :io_sum_local_writes_after > :io_sum_local_writes_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT :io_sum_local_extends_after > :io_sum_local_extends_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+RESET temp_buffers;
+-- Test that reuse of strategy buffers and reads of blocks into these reused
+-- buffers while VACUUMing are tracked in pg_stat_io.
+-- Set wal_skip_threshold smaller than the expected size of
+-- test_io_vac_strategy so that, even if wal_level is minimal, VACUUM FULL will
+-- fsync the newly rewritten test_io_vac_strategy instead of writing it to WAL.
+-- Writing it to WAL will result in the newly written relation pages being in
+-- shared buffers -- preventing us from testing BAS_VACUUM BufferAccessStrategy
+-- reads.
+SET wal_skip_threshold = '1 kB';
+SELECT sum(reused) AS io_sum_vac_strategy_reuses_before FROM pg_stat_io WHERE io_context = 'vacuum' \gset
+SELECT sum(read) AS io_sum_vac_strategy_reads_before FROM pg_stat_io WHERE io_context = 'vacuum' \gset
+CREATE TABLE test_io_vac_strategy(a int, b int) WITH (autovacuum_enabled = 'false');
+INSERT INTO test_io_vac_strategy SELECT i, i from generate_series(1, 8000)i;
+-- Ensure that the next VACUUM will need to perform IO by rewriting the table
+-- first with VACUUM (FULL).
+VACUUM (FULL) test_io_vac_strategy;
+VACUUM (PARALLEL 0) test_io_vac_strategy;
+SELECT pg_stat_force_next_flush();
+ pg_stat_force_next_flush 
+--------------------------
+ 
+(1 row)
+
+SELECT sum(reused) AS io_sum_vac_strategy_reuses_after FROM pg_stat_io WHERE io_context = 'vacuum' \gset
+SELECT sum(read) AS io_sum_vac_strategy_reads_after FROM pg_stat_io WHERE io_context = 'vacuum' \gset
+SELECT :io_sum_vac_strategy_reads_after > :io_sum_vac_strategy_reads_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT :io_sum_vac_strategy_reuses_after > :io_sum_vac_strategy_reuses_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+RESET wal_skip_threshold;
+-- Test that extends done by a CTAS, which uses a BAS_BULKWRITE
+-- BufferAccessStrategy, are tracked in pg_stat_io.
+SELECT sum(extended) AS io_sum_bulkwrite_strategy_extends_before FROM pg_stat_io WHERE io_context = 'bulkwrite' \gset
+CREATE TABLE test_io_bulkwrite_strategy AS SELECT i FROM generate_series(1,100)i;
+SELECT pg_stat_force_next_flush();
+ pg_stat_force_next_flush 
+--------------------------
+ 
+(1 row)
+
+SELECT sum(extended) AS io_sum_bulkwrite_strategy_extends_after FROM pg_stat_io WHERE io_context = 'bulkwrite' \gset
+SELECT :io_sum_bulkwrite_strategy_extends_after > :io_sum_bulkwrite_strategy_extends_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+-- Test that reads of blocks into reused strategy buffers during database
+-- creation, which uses a BAS_BULKREAD BufferAccessStrategy, are tracked in
+-- pg_stat_io.
+SELECT sum(read) AS io_sum_bulkread_strategy_reads_before FROM pg_stat_io WHERE io_context = 'bulkread' \gset
+CREATE DATABASE test_io_bulkread_strategy_db;
+SELECT pg_stat_force_next_flush();
+ pg_stat_force_next_flush 
+--------------------------
+ 
+(1 row)
+
+SELECT sum(read) AS io_sum_bulkread_strategy_reads_after FROM pg_stat_io WHERE io_context = 'bulkread' \gset
+SELECT :io_sum_bulkread_strategy_reads_after > :io_sum_bulkread_strategy_reads_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+-- Test IO stats reset
+SELECT sum(evicted) + sum(reused) + sum(extended) + sum(files_synced) + sum(read) + sum(written) AS io_stats_pre_reset FROM pg_stat_io \gset
+SELECT pg_stat_reset_shared('io');
+ pg_stat_reset_shared 
+----------------------
+ 
+(1 row)
+
+SELECT sum(evicted) + sum(reused) + sum(extended) + sum(files_synced) + sum(read) + sum(written) AS io_stats_post_reset FROM pg_stat_io \gset
+SELECT :io_stats_post_reset < :io_stats_pre_reset;
+ ?column? 
+----------
+ t
+(1 row)
+
 -- End of Stats Test
diff --git a/src/test/regress/sql/stats.sql b/src/test/regress/sql/stats.sql
index f6270f7bad..75c2f6c4c0 100644
--- a/src/test/regress/sql/stats.sql
+++ b/src/test/regress/sql/stats.sql
@@ -535,4 +535,127 @@ SELECT pg_stat_get_replication_slot(NULL);
 SELECT pg_stat_get_subscription_stats(NULL);
 
 
+-- Test that the following operations are tracked in pg_stat_io:
+-- - reads of target blocks into shared buffers
+-- - writes of shared buffers to permanent storage
+-- - extends of relations using shared buffers
+-- - fsyncs done to ensure the durability of data dirtying shared buffers
+
+-- There is no test for blocks evicted from shared buffers, because we cannot
+-- be sure of the state of shared buffers at the point the test is run.
+SELECT sum(read) AS io_sum_shared_reads_before FROM pg_stat_io WHERE io_context = 'shared' \gset
+SELECT sum(written) AS io_sum_shared_writes_before FROM pg_stat_io WHERE io_context = 'shared' \gset
+SELECT sum(extended) AS io_sum_shared_extends_before FROM pg_stat_io WHERE io_context = 'shared' \gset
+SELECT sum(files_synced) AS io_sum_shared_fsyncs_before FROM pg_stat_io WHERE io_context = 'shared' \gset
+-- Create a regular table and insert some data to generate IOCONTEXT_SHARED
+-- extends.
+CREATE TABLE test_io_shared(a int);
+INSERT INTO test_io_shared SELECT i FROM generate_series(1,100)i;
+SELECT pg_stat_force_next_flush();
+-- After a checkpoint, there should be some additional IOCONTEXT_SHARED writes
+-- and fsyncs.
+-- The second checkpoint ensures that stats from the first checkpoint have been
+-- reported and protects against any potential races amongst the table
+-- creation, a possible timing-triggered checkpoint, and the explicit
+-- checkpoint in the test.
+CHECKPOINT;
+CHECKPOINT;
+SELECT sum(written) AS io_sum_shared_writes_after FROM pg_stat_io WHERE io_context = 'shared' \gset
+SELECT sum(extended) AS io_sum_shared_extends_after FROM pg_stat_io WHERE io_context = 'shared' \gset
+SELECT sum(files_synced) AS io_sum_shared_fsyncs_after FROM pg_stat_io WHERE io_context = 'shared' \gset
+SELECT :io_sum_shared_writes_after > :io_sum_shared_writes_before;
+SELECT :io_sum_shared_extends_after > :io_sum_shared_extends_before;
+SELECT current_setting('fsync') = 'off' OR :io_sum_shared_fsyncs_after > :io_sum_shared_fsyncs_before;
+-- Change the tablespace so that the table is rewritten directly, then SELECT
+-- from it to cause it to be read back into shared buffers.
+SET allow_in_place_tablespaces = true;
+CREATE TABLESPACE test_io_shared_stats_tblspc LOCATION '';
+ALTER TABLE test_io_shared SET TABLESPACE test_io_shared_stats_tblspc;
+SELECT COUNT(*) FROM test_io_shared;
+SELECT pg_stat_force_next_flush();
+SELECT sum(read) AS io_sum_shared_reads_after FROM pg_stat_io WHERE io_context = 'shared' \gset
+SELECT :io_sum_shared_reads_after > :io_sum_shared_reads_before;
+DROP TABLE test_io_shared;
+DROP TABLESPACE test_io_shared_stats_tblspc;
+
+-- Test that the follow IOCONTEXT_LOCAL IOOps are tracked in pg_stat_io:
+-- - eviction of local buffers in order to reuse them
+-- - reads of temporary table blocks into local buffers
+-- - writes of local buffers to permanent storage
+-- - extends of temporary tables
+
+-- Set temp_buffers to a low value so that we can trigger writes with fewer
+-- inserted tuples. Do so in a new session in case temporary tables have been
+-- accessed by previous tests in this session.
+\c
+SET temp_buffers TO '1MB';
+CREATE TEMPORARY TABLE test_io_local(a int, b TEXT);
+SELECT sum(evicted) AS io_sum_local_evictions_before FROM pg_stat_io WHERE io_context = 'local' \gset
+SELECT sum(read) AS io_sum_local_reads_before FROM pg_stat_io WHERE io_context = 'local' \gset
+SELECT sum(written) AS io_sum_local_writes_before FROM pg_stat_io WHERE io_context = 'local' \gset
+SELECT sum(extended) AS io_sum_local_extends_before FROM pg_stat_io WHERE io_context = 'local' \gset
+-- Insert enough values that we need to reuse and write out dirty local
+-- buffers.
+INSERT INTO test_io_local SELECT generate_series(1, 8000) as id, repeat('a', 100);
+-- Read in evicted buffers.
+SELECT COUNT(*) FROM test_io_local;
+SELECT pg_stat_force_next_flush();
+SELECT sum(evicted) AS io_sum_local_evictions_after FROM pg_stat_io WHERE io_context = 'local' \gset
+SELECT sum(read) AS io_sum_local_reads_after FROM pg_stat_io WHERE io_context = 'local' \gset
+SELECT sum(written) AS io_sum_local_writes_after FROM pg_stat_io WHERE io_context = 'local' \gset
+SELECT sum(extended) AS io_sum_local_extends_after FROM pg_stat_io WHERE io_context = 'local' \gset
+SELECT :io_sum_local_evictions_after > :io_sum_local_evictions_before;
+SELECT :io_sum_local_reads_after > :io_sum_local_reads_before;
+SELECT :io_sum_local_writes_after > :io_sum_local_writes_before;
+SELECT :io_sum_local_extends_after > :io_sum_local_extends_before;
+RESET temp_buffers;
+
+-- Test that reuse of strategy buffers and reads of blocks into these reused
+-- buffers while VACUUMing are tracked in pg_stat_io.
+
+-- Set wal_skip_threshold smaller than the expected size of
+-- test_io_vac_strategy so that, even if wal_level is minimal, VACUUM FULL will
+-- fsync the newly rewritten test_io_vac_strategy instead of writing it to WAL.
+-- Writing it to WAL will result in the newly written relation pages being in
+-- shared buffers -- preventing us from testing BAS_VACUUM BufferAccessStrategy
+-- reads.
+SET wal_skip_threshold = '1 kB';
+SELECT sum(reused) AS io_sum_vac_strategy_reuses_before FROM pg_stat_io WHERE io_context = 'vacuum' \gset
+SELECT sum(read) AS io_sum_vac_strategy_reads_before FROM pg_stat_io WHERE io_context = 'vacuum' \gset
+CREATE TABLE test_io_vac_strategy(a int, b int) WITH (autovacuum_enabled = 'false');
+INSERT INTO test_io_vac_strategy SELECT i, i from generate_series(1, 8000)i;
+-- Ensure that the next VACUUM will need to perform IO by rewriting the table
+-- first with VACUUM (FULL).
+VACUUM (FULL) test_io_vac_strategy;
+VACUUM (PARALLEL 0) test_io_vac_strategy;
+SELECT pg_stat_force_next_flush();
+SELECT sum(reused) AS io_sum_vac_strategy_reuses_after FROM pg_stat_io WHERE io_context = 'vacuum' \gset
+SELECT sum(read) AS io_sum_vac_strategy_reads_after FROM pg_stat_io WHERE io_context = 'vacuum' \gset
+SELECT :io_sum_vac_strategy_reads_after > :io_sum_vac_strategy_reads_before;
+SELECT :io_sum_vac_strategy_reuses_after > :io_sum_vac_strategy_reuses_before;
+RESET wal_skip_threshold;
+
+-- Test that extends done by a CTAS, which uses a BAS_BULKWRITE
+-- BufferAccessStrategy, are tracked in pg_stat_io.
+SELECT sum(extended) AS io_sum_bulkwrite_strategy_extends_before FROM pg_stat_io WHERE io_context = 'bulkwrite' \gset
+CREATE TABLE test_io_bulkwrite_strategy AS SELECT i FROM generate_series(1,100)i;
+SELECT pg_stat_force_next_flush();
+SELECT sum(extended) AS io_sum_bulkwrite_strategy_extends_after FROM pg_stat_io WHERE io_context = 'bulkwrite' \gset
+SELECT :io_sum_bulkwrite_strategy_extends_after > :io_sum_bulkwrite_strategy_extends_before;
+
+-- Test that reads of blocks into reused strategy buffers during database
+-- creation, which uses a BAS_BULKREAD BufferAccessStrategy, are tracked in
+-- pg_stat_io.
+SELECT sum(read) AS io_sum_bulkread_strategy_reads_before FROM pg_stat_io WHERE io_context = 'bulkread' \gset
+CREATE DATABASE test_io_bulkread_strategy_db;
+SELECT pg_stat_force_next_flush();
+SELECT sum(read) AS io_sum_bulkread_strategy_reads_after FROM pg_stat_io WHERE io_context = 'bulkread' \gset
+SELECT :io_sum_bulkread_strategy_reads_after > :io_sum_bulkread_strategy_reads_before;
+
+-- Test IO stats reset
+SELECT sum(evicted) + sum(reused) + sum(extended) + sum(files_synced) + sum(read) + sum(written) AS io_stats_pre_reset FROM pg_stat_io \gset
+SELECT pg_stat_reset_shared('io');
+SELECT sum(evicted) + sum(reused) + sum(extended) + sum(files_synced) + sum(read) + sum(written) AS io_stats_post_reset FROM pg_stat_io \gset
+SELECT :io_stats_post_reset < :io_stats_pre_reset;
+
 -- End of Stats Test
-- 
2.34.1

From 4746ef5834de99836f81be8ffd322d139c940a25 Mon Sep 17 00:00:00 2001
From: Andres Freund <and...@anarazel.de>
Date: Thu, 13 Oct 2022 11:03:05 -0700
Subject: [PATCH v36 1/5] Remove BufferAccessStrategyData->current_was_in_ring

It is a duplication of StrategyGetBuffer->from_ring.
---
 src/backend/storage/buffer/bufmgr.c   |  2 +-
 src/backend/storage/buffer/freelist.c | 15 ++-------------
 src/include/storage/buf_internals.h   |  2 +-
 3 files changed, 4 insertions(+), 15 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 6b95381481..4e7b0b31bb 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1254,7 +1254,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 					UnlockBufHdr(buf, buf_state);
 
 					if (XLogNeedsFlush(lsn) &&
-						StrategyRejectBuffer(strategy, buf))
+						StrategyRejectBuffer(strategy, buf, from_ring))
 					{
 						/* Drop lock/pin and loop around for another buffer */
 						LWLockRelease(BufferDescriptorGetContentLock(buf));
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 990e081aae..64728bd7ce 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -81,12 +81,6 @@ typedef struct BufferAccessStrategyData
 	 */
 	int			current;
 
-	/*
-	 * True if the buffer just returned by StrategyGetBuffer had been in the
-	 * ring already.
-	 */
-	bool		current_was_in_ring;
-
 	/*
 	 * Array of buffer numbers.  InvalidBuffer (that is, zero) indicates we
 	 * have not yet selected a buffer for this ring slot.  For allocation
@@ -625,10 +619,7 @@ GetBufferFromRing(BufferAccessStrategy strategy, uint32 *buf_state)
 	 */
 	bufnum = strategy->buffers[strategy->current];
 	if (bufnum == InvalidBuffer)
-	{
-		strategy->current_was_in_ring = false;
 		return NULL;
-	}
 
 	/*
 	 * If the buffer is pinned we cannot use it under any circumstances.
@@ -644,7 +635,6 @@ GetBufferFromRing(BufferAccessStrategy strategy, uint32 *buf_state)
 	if (BUF_STATE_GET_REFCOUNT(local_buf_state) == 0
 		&& BUF_STATE_GET_USAGECOUNT(local_buf_state) <= 1)
 	{
-		strategy->current_was_in_ring = true;
 		*buf_state = local_buf_state;
 		return buf;
 	}
@@ -654,7 +644,6 @@ GetBufferFromRing(BufferAccessStrategy strategy, uint32 *buf_state)
 	 * Tell caller to allocate a new buffer with the normal allocation
 	 * strategy.  He'll then replace this ring element via AddBufferToRing.
 	 */
-	strategy->current_was_in_ring = false;
 	return NULL;
 }
 
@@ -682,14 +671,14 @@ AddBufferToRing(BufferAccessStrategy strategy, BufferDesc *buf)
  * if this buffer should be written and re-used.
  */
 bool
-StrategyRejectBuffer(BufferAccessStrategy strategy, BufferDesc *buf)
+StrategyRejectBuffer(BufferAccessStrategy strategy, BufferDesc *buf, bool from_ring)
 {
 	/* We only do this in bulkread mode */
 	if (strategy->btype != BAS_BULKREAD)
 		return false;
 
 	/* Don't muck with behavior of normal buffer-replacement strategy */
-	if (!strategy->current_was_in_ring ||
+	if (!from_ring ||
 		strategy->buffers[strategy->current] != BufferDescriptorGetBuffer(buf))
 		return false;
 
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 406db6be78..b75481450d 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -395,7 +395,7 @@ extern BufferDesc *StrategyGetBuffer(BufferAccessStrategy strategy,
 									 uint32 *buf_state);
 extern void StrategyFreeBuffer(BufferDesc *buf);
 extern bool StrategyRejectBuffer(BufferAccessStrategy strategy,
-								 BufferDesc *buf);
+								 BufferDesc *buf, bool from_ring);
 
 extern int	StrategySyncStart(uint32 *complete_passes, uint32 *num_buf_alloc);
 extern void StrategyNotifyBgWriter(int bgwprocno);
-- 
2.34.1

Reply via email to