Re: pg_buffercache: Add per-relation summary stats

Ashutosh Bapat Fri, 27 Mar 2026 21:18:59 -0700

On Sat, Mar 28, 2026 at 4:28 AM Tomas Vondra <[email protected]> wrote:
>
> On 3/26/26 05:21, Ashutosh Bapat wrote:
> > On Wed, Mar 25, 2026 at 10:19 PM Masahiko Sawada <[email protected]> 
> > wrote:
> >>
> >> On Tue, Mar 24, 2026 at 11:47 PM Lukas Fittl <[email protected]> wrote:
> >>>
> >>> Hi Ashutosh,
> >>>
> >>> On Tue, Mar 24, 2026 at 11:24 PM Ashutosh Bapat
> >>> <[email protected]> wrote:
> >>>> I know we already have a couple of hand-aggregation functions but I am
> >>>> hesitant to add more of these. Question is where do we stop? For
> >>>> example, the current function is useless if someone wants to find the
> >>>> parts of a relation which are hot since it doesn't include page
> >>>> numbers. Do we write another function for the same? Or we add page
> >>>> numbers to this function and then there's hardly any aggregation
> >>>> happening. What if somebody wanted to perform an aggregation more
> >>>> complex than just count() like average number of buffers per relation
> >>>> or distribution of relation buffers in the cache, do they write
> >>>> separate functions?
> >>>
> >>> I think the problem this solves for, which is a very common question I
> >>> hear from end users, is "how much of this table/index is in cache" and
> >>> "was our query slow because the cache contents changed?".
> >>>
> >>> It can't provide a perfect answer to all questions regarding what's in
> >>> the cache (i.e. it won't tell you which part of the table is cached),
> >>> but its in line with other statistics we do already provide in
> >>> pg_stat_user_tables etc., which are all aggregate counts, not further
> >>> breakdowns.
> >>>
> >>> Its also a reasonable compromise on providing something usable that
> >>> can be shown on dashboards, as I've seen in collecting this
> >>> information using the existing methods from small production systems
> >>> in practice over the last ~1.5 years.
> >>
> >> Regarding the proposed statistics, I find them reasonably useful for
> >> many users. I'm not sure we need to draw a strict line on what belongs
> >> in the module. If a proposed function does exactly what most
> >> pg_buffercache users want or are already writing themselves, that is
> >> good enough motivation to include it.
> >>
> >> I think pg_visibility is a good precedent here. In that module, we
> >> have both pg_visibility_map() and pg_visibility_map_summary(), even
> >> though we can retrieve the exact same results as the latter by simply
> >> using the former:
> >>
> >> select sum(all_visible::int), sum(all_frozen::int) from
> >> pg_visibility_map('test') ;
> >>
> >
> > A summary may still be ok, but this proposal is going a bit farther,
> > it's grouping by one subset which should really be done by GROUP BY in
> > SQL. And I do
> >
> > I am afraid that at some point, we will start finding all of these to
> > be a maintenance burden. At that point, removing them will become a
> > real pain for the backward compatibility reason. For example
> > 1. The proposed function is going to add one more test to an already
> > huge testing exercise for shared buffers resizing.
> > 2. If we change the way to manage buffer cache e.g. use a tree based
> > cache instead of hash + array cache, each of the functions which
> > traverses the buffer cache array is going to add work - adjusting it
> > to the new data structure - and make a hard project even harder. In
> > this case we have other ways to get the summary, so the code level
> > scan of buffer cache is entirely avoidable.
> >
> > If I am the only one opposing it, and there are more senior
> > contributors in favour of adding this function, we can accept it.
> >
>
> I understand this argument - we have SQL, which allows us to process the
> data in a flexible way, without hard-coding all interesting groupings.
> The question is whether this particular grouping is special enough to
> warrant a custom *faster* function.
>
> The main argument here seems to be the performance, and the initial
> message demonstrates a 10x speedup (2ms vs. 20ms) on a cluster with
> 128MB shared buffers. Unless I misunderstood what config it uses.
>
> I gave it a try on an azure VM with 32GB shared buffers, to make it a
> bit more realistic, and my timings are 10ms vs. 700ms. But I also wonder
> if the original timings really were from a cluster with 128MB, because
> for me that shows 0.3ms vs. 3ms (so an order of magnitude faster than
> what was reported). But I suppose that's also hw specific.
>
> Nevertheless, it is much faster. I haven't profiled this but I assume
> it's thanks to not having to write the entries into a tuplestore (and
> possibly into a tempfile).


Parallely myself and Palak Chaturvedi developed a quick patch to
modernise pg_buffercache_pages() and use tuplestore so that it doesn't
have to rely on NBuffers being the same between start of the scan,
when memory allocated, when the scan ends - a condition possible with
resizing buffer cache. It seems to improve the timings by about 10-30%
on my laptop for 128MB buffercache size. Without this patch the time
taken to execute Lukas's query varies between 10-15ms on my laptop.
With this patch it varies between 8-9ms. So the timing is more stable
as a side effect. It's not a 10x improvement that we are looking for
but it looks like a step in the right direction. That improvement
seems to come purely because we avoid creating a heap tuple. I wonder
if there are some places up in the execution tree where full
heaptuples get formed again instead of continuing to use minimal
tuples or places where we perform some extra actions that are not
required.

I didn't dig into the history to find out why we didn't modernize
pg_buffercache_pages(). I don't see any hazard though.

Lukas's patch allocates the hash table in memory entirely, whereas
tuplestore restricts memory usage to work_mem, so it might cause the
function to use more memory than user expects it to use when size of
the hash table grows beyond work_mem.

-- 
Best Wishes,
Ashutosh Bapat

From 9112536d59124057f81b01c9e48feed31f0cc98d Mon Sep 17 00:00:00 2001
From: Ashutosh Bapat <[email protected]>
Date: Wed, 11 Mar 2026 15:57:02 +0530
Subject: [PATCH v20260328] pg_buffercache_pages() modernization and
 optimization

Many of the set returning functions (SRFs) use InitMaterializedSRF() and
tuplestore to store the result. But pg_buffercache_pages() uses its own
code to initialize SRF state and does not use tuplestore. Because of the
later it has to create a full heap tuple when not necessary. A
tuplestore on the other hand has ability to store minimal tuple which
saves some CPU cycles forming and deforming a full heap tuple. Modernize
pg_buffercache_pages() to use SRF infrastructure and tuplestore.

Author: Ashutosh Bapat <[email protected]>
Author: Palak Chaturvedi <[email protected]>
---
 contrib/pg_buffercache/pg_buffercache_pages.c | 257 ++++++------------
 1 file changed, 81 insertions(+), 176 deletions(-)

diff --git a/contrib/pg_buffercache/pg_buffercache_pages.c b/contrib/pg_buffercache/pg_buffercache_pages.c
index db4d711cce7..fae531573cb 100644
--- a/contrib/pg_buffercache/pg_buffercache_pages.c
+++ b/contrib/pg_buffercache/pg_buffercache_pages.c
@@ -37,39 +37,6 @@ PG_MODULE_MAGIC_EXT(
 					.version = PG_VERSION
 );
 
-/*
- * Record structure holding the to be exposed cache data.
- */
-typedef struct
-{
-	uint32		bufferid;
-	RelFileNumber relfilenumber;
-	Oid			reltablespace;
-	Oid			reldatabase;
-	ForkNumber	forknum;
-	BlockNumber blocknum;
-	bool		isvalid;
-	bool		isdirty;
-	uint16		usagecount;
-
-	/*
-	 * An int32 is sufficiently large, as MAX_BACKENDS prevents a buffer from
-	 * being pinned by too many backends and each backend will only pin once
-	 * because of bufmgr.c's PrivateRefCount infrastructure.
-	 */
-	int32		pinning_backends;
-} BufferCachePagesRec;
-
-
-/*
- * Function context for data persisting over repeated calls.
- */
-typedef struct
-{
-	TupleDesc	tupdesc;
-	BufferCachePagesRec *record;
-} BufferCachePagesContext;
-
 /*
  * Record structure holding the to be exposed cache data for OS pages.  This
  * structure is used by pg_buffercache_os_pages(), where NUMA information may
@@ -117,142 +84,90 @@ static bool firstNumaTouch = true;
 Datum
 pg_buffercache_pages(PG_FUNCTION_ARGS)
 {
-	FuncCallContext *funcctx;
-	Datum		result;
-	MemoryContext oldcontext;
-	BufferCachePagesContext *fctx;	/* User function context. */
-	TupleDesc	tupledesc;
+	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
 	TupleDesc	expected_tupledesc;
-	HeapTuple	tuple;
-
-	if (SRF_IS_FIRSTCALL())
-	{
-		int			i;
-
-		funcctx = SRF_FIRSTCALL_INIT();
-
-		/* Switch context when allocating stuff to be used in later calls */
-		oldcontext = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx);
-
-		/* Create a user function context for cross-call persistence */
-		fctx = palloc_object(BufferCachePagesContext);
-
-		/*
-		 * To smoothly support upgrades from version 1.0 of this extension
-		 * transparently handle the (non-)existence of the pinning_backends
-		 * column. We unfortunately have to get the result type for that... -
-		 * we can't use the result type determined by the function definition
-		 * without potentially crashing when somebody uses the old (or even
-		 * wrong) function definition though.
-		 */
-		if (get_call_result_type(fcinfo, NULL, &expected_tupledesc) != TYPEFUNC_COMPOSITE)
-			elog(ERROR, "return type must be a row type");
-
-		if (expected_tupledesc->natts < NUM_BUFFERCACHE_PAGES_MIN_ELEM ||
-			expected_tupledesc->natts > NUM_BUFFERCACHE_PAGES_ELEM)
-			elog(ERROR, "incorrect number of output arguments");
-
-		/* Construct a tuple descriptor for the result rows. */
-		tupledesc = CreateTemplateTupleDesc(expected_tupledesc->natts);
-		TupleDescInitEntry(tupledesc, (AttrNumber) 1, "bufferid",
-						   INT4OID, -1, 0);
-		TupleDescInitEntry(tupledesc, (AttrNumber) 2, "relfilenode",
-						   OIDOID, -1, 0);
-		TupleDescInitEntry(tupledesc, (AttrNumber) 3, "reltablespace",
-						   OIDOID, -1, 0);
-		TupleDescInitEntry(tupledesc, (AttrNumber) 4, "reldatabase",
-						   OIDOID, -1, 0);
-		TupleDescInitEntry(tupledesc, (AttrNumber) 5, "relforknumber",
-						   INT2OID, -1, 0);
-		TupleDescInitEntry(tupledesc, (AttrNumber) 6, "relblocknumber",
-						   INT8OID, -1, 0);
-		TupleDescInitEntry(tupledesc, (AttrNumber) 7, "isdirty",
-						   BOOLOID, -1, 0);
-		TupleDescInitEntry(tupledesc, (AttrNumber) 8, "usage_count",
-						   INT2OID, -1, 0);
-
-		if (expected_tupledesc->natts == NUM_BUFFERCACHE_PAGES_ELEM)
-			TupleDescInitEntry(tupledesc, (AttrNumber) 9, "pinning_backends",
-							   INT4OID, -1, 0);
-
-		TupleDescFinalize(tupledesc);
-		fctx->tupdesc = BlessTupleDesc(tupledesc);
-
-		/* Allocate NBuffers worth of BufferCachePagesRec records. */
-		fctx->record = (BufferCachePagesRec *)
-			MemoryContextAllocHuge(CurrentMemoryContext,
-								   sizeof(BufferCachePagesRec) * NBuffers);
-
-		/* Set max calls and remember the user function context. */
-		funcctx->max_calls = NBuffers;
-		funcctx->user_fctx = fctx;
+	Datum		values[NUM_BUFFERCACHE_PAGES_ELEM];
+	bool		nulls[NUM_BUFFERCACHE_PAGES_ELEM];
+	int			i;
 
-		/* Return to original context when allocating transient memory */
-		MemoryContextSwitchTo(oldcontext);
-
-		/*
-		 * Scan through all the buffers, saving the relevant fields in the
-		 * fctx->record structure.
-		 *
-		 * We don't hold the partition locks, so we don't get a consistent
-		 * snapshot across all buffers, but we do grab the buffer header
-		 * locks, so the information of each buffer is self-consistent.
-		 */
-		for (i = 0; i < NBuffers; i++)
-		{
-			BufferDesc *bufHdr;
-			uint64		buf_state;
+	/*
+	 * To smoothly support upgrades from version 1.0 of this extension
+	 * transparently handle the (non-)existence of the pinning_backends
+	 * column. We unfortunately have to get the result type for that... - we
+	 * can't use the result type determined by the function definition without
+	 * potentially crashing when somebody uses the old (or even wrong)
+	 * function definition though.
+	 */
+	if (get_call_result_type(fcinfo, NULL, &expected_tupledesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
 
-			CHECK_FOR_INTERRUPTS();
+	if (expected_tupledesc->natts < NUM_BUFFERCACHE_PAGES_MIN_ELEM ||
+		expected_tupledesc->natts > NUM_BUFFERCACHE_PAGES_ELEM)
+		elog(ERROR, "incorrect number of output arguments");
 
-			bufHdr = GetBufferDescriptor(i);
-			/* Lock each buffer header before inspecting. */
-			buf_state = LockBufHdr(bufHdr);
+	InitMaterializedSRF(fcinfo, 0);
 
-			fctx->record[i].bufferid = BufferDescriptorGetBuffer(bufHdr);
-			fctx->record[i].relfilenumber = BufTagGetRelNumber(&bufHdr->tag);
-			fctx->record[i].reltablespace = bufHdr->tag.spcOid;
-			fctx->record[i].reldatabase = bufHdr->tag.dbOid;
-			fctx->record[i].forknum = BufTagGetForkNum(&bufHdr->tag);
-			fctx->record[i].blocknum = bufHdr->tag.blockNum;
-			fctx->record[i].usagecount = BUF_STATE_GET_USAGECOUNT(buf_state);
-			fctx->record[i].pinning_backends = BUF_STATE_GET_REFCOUNT(buf_state);
+	/*
+	 * Scan through all the buffers, adding one row for each of the buffers to
+	 * the tuplestore.
+	 *
+	 * We don't hold the partition locks, so we don't get a consistent
+	 * snapshot across all buffers, but we do grab the buffer header locks, so
+	 * the information of each buffer is self-consistent.
+	 */
+	for (i = 0; i < NBuffers; i++)
+	{
+		BufferDesc *bufHdr;
+		uint64		buf_state;
+		uint32		bufferid;
+		RelFileNumber relfilenumber;
+		Oid			reltablespace;
+		Oid			reldatabase;
+		ForkNumber	forknum;
+		BlockNumber blocknum;
+		bool		isvalid;
+		bool		isdirty;
+		uint16		usagecount;
+		int32		pinning_backends;
 
-			if (buf_state & BM_DIRTY)
-				fctx->record[i].isdirty = true;
-			else
-				fctx->record[i].isdirty = false;
+		CHECK_FOR_INTERRUPTS();
 
-			/* Note if the buffer is valid, and has storage created */
-			if ((buf_state & BM_VALID) && (buf_state & BM_TAG_VALID))
-				fctx->record[i].isvalid = true;
-			else
-				fctx->record[i].isvalid = false;
+		bufHdr = GetBufferDescriptor(i);
+		/* Lock each buffer header before inspecting. */
+		buf_state = LockBufHdr(bufHdr);
+
+		bufferid = BufferDescriptorGetBuffer(bufHdr);
+		relfilenumber = BufTagGetRelNumber(&bufHdr->tag);
+		reltablespace = bufHdr->tag.spcOid;
+		reldatabase = bufHdr->tag.dbOid;
+		forknum = BufTagGetForkNum(&bufHdr->tag);
+		blocknum = bufHdr->tag.blockNum;
+		usagecount = BUF_STATE_GET_USAGECOUNT(buf_state);
+		pinning_backends = BUF_STATE_GET_REFCOUNT(buf_state);
 
-			UnlockBufHdr(bufHdr);
-		}
-	}
+		if (buf_state & BM_DIRTY)
+			isdirty = true;
+		else
+			isdirty = false;
 
-	funcctx = SRF_PERCALL_SETUP();
+		/* Note if the buffer is valid, and has storage created */
+		if ((buf_state & BM_VALID) && (buf_state & BM_TAG_VALID))
+			isvalid = true;
+		else
+			isvalid = false;
 
-	/* Get the saved state */
-	fctx = funcctx->user_fctx;
+		UnlockBufHdr(bufHdr);
 
-	if (funcctx->call_cntr < funcctx->max_calls)
-	{
-		uint32		i = funcctx->call_cntr;
-		Datum		values[NUM_BUFFERCACHE_PAGES_ELEM];
-		bool		nulls[NUM_BUFFERCACHE_PAGES_ELEM];
+		/* Build the tuple and add it to tuplestore */
+		memset(nulls, 0, sizeof(nulls));
 
-		values[0] = Int32GetDatum(fctx->record[i].bufferid);
-		nulls[0] = false;
+		values[0] = Int32GetDatum(bufferid);
 
 		/*
 		 * Set all fields except the bufferid to null if the buffer is unused
 		 * or not valid.
 		 */
-		if (fctx->record[i].blocknum == InvalidBlockNumber ||
-			fctx->record[i].isvalid == false)
+		if (blocknum == InvalidBlockNumber || isvalid == false)
 		{
 			nulls[1] = true;
 			nulls[2] = true;
@@ -262,37 +177,27 @@ pg_buffercache_pages(PG_FUNCTION_ARGS)
 			nulls[6] = true;
 			nulls[7] = true;
 			/* unused for v1.0 callers, but the array is always long enough */
-			nulls[8] = true;
+			if (expected_tupledesc->natts == NUM_BUFFERCACHE_PAGES_ELEM)
+				nulls[8] = true;
 		}
 		else
 		{
-			values[1] = ObjectIdGetDatum(fctx->record[i].relfilenumber);
-			nulls[1] = false;
-			values[2] = ObjectIdGetDatum(fctx->record[i].reltablespace);
-			nulls[2] = false;
-			values[3] = ObjectIdGetDatum(fctx->record[i].reldatabase);
-			nulls[3] = false;
-			values[4] = Int16GetDatum(fctx->record[i].forknum);
-			nulls[4] = false;
-			values[5] = Int64GetDatum((int64) fctx->record[i].blocknum);
-			nulls[5] = false;
-			values[6] = BoolGetDatum(fctx->record[i].isdirty);
-			nulls[6] = false;
-			values[7] = UInt16GetDatum(fctx->record[i].usagecount);
-			nulls[7] = false;
+			values[1] = ObjectIdGetDatum(relfilenumber);
+			values[2] = ObjectIdGetDatum(reltablespace);
+			values[3] = ObjectIdGetDatum(reldatabase);
+			values[4] = Int16GetDatum(forknum);
+			values[5] = Int64GetDatum((int64) blocknum);
+			values[6] = BoolGetDatum(isdirty);
+			values[7] = UInt16GetDatum(usagecount);
 			/* unused for v1.0 callers, but the array is always long enough */
-			values[8] = Int32GetDatum(fctx->record[i].pinning_backends);
-			nulls[8] = false;
+			if (expected_tupledesc->natts == NUM_BUFFERCACHE_PAGES_ELEM)
+				values[8] = Int32GetDatum(pinning_backends);
 		}
 
-		/* Build and return the tuple. */
-		tuple = heap_form_tuple(fctx->tupdesc, values, nulls);
-		result = HeapTupleGetDatum(tuple);
-
-		SRF_RETURN_NEXT(funcctx, result);
+		tuplestore_putvalues(rsinfo->setResult, rsinfo->setDesc, values, nulls);
 	}
-	else
-		SRF_RETURN_DONE(funcctx);
+
+	return (Datum) 0;
 }
 
 /*

base-commit: 999dec9ec6a81668057427c2e9312b20635fba02
-- 
2.34.1

Re: pg_buffercache: Add per-relation summary stats

Reply via email to