Hi, On Thu, Nov 20, 2025 at 04:59:07PM +0000, Bertrand Drouvot wrote: > On Wed, Nov 19, 2025 at 10:49:49PM +0900, Michael Paquier wrote: > > > > Hmm. I can think about an option 3 here: pg_buffercache outlines the > > view pg_buffercache_numa as the primary choice over > > pg_buffercache_numa_pages(). So I would suggest a more drastic > > strategy, that should not break monitoring queries with the views > > being the primary source for the results: > > - Rename of pg_buffercache_numa_pages() to pg_buffercache_os_pages(), > > that takes in input a boolean argument to decide if numa should be > > executed or not. > > - Creation of a second view for the OS pages that calls > > pg_buffercache_os_pages() without the numa code activated, for the two > > attributes that matter. > > - Switch the existing view pg_buffercache_numa to call > > pg_buffercache_os_pages() with the numa code activated. If NUMA > > cannot be set up, elog(ERROR). > > Love the idea: the new view would not suffer from the numa availability > overhead > and the current behavior is kept. Will look at it, thanks!
Here they are: 0001: Is nothing but the same as the one shared in [1]. 0002: Introduce GET_MAX_BUFFER_ENTRIES and get_buffer_page_boundaries It's not really needed anymore since we'll avoid code duplication with the new approach. That said I think they help for code readability so keeping them (I don't have a strong opinion about it if other prefer not to add them). 0003: Adding pg_buffercache_numa_pages_internal() This new function makes NUMA data collection conditional. It extracts the core current pg_buffercache_numa_pages() logic into an internal function that accepts a boolean parameter. It's currently only called with the boolean set to true to serve the pg_buffercache_numa view needs. It's done that way to ease to review but could be pushed as is. 0004: Add pg_buffercache_os_pages function and view The patch: - renames pg_buffercache_numa_pages_internal() to pg_buffercache_os_pages() - keep pg_buffercache_numa_pages() as a backward compatibility wrapper - re-create the pg_buffercache_numa view on top of pg_buffercache_os_pages using true as argument - adds doc - adds test Remark for the doc: the patch does not show the pg_buffercache_os_pages() parameter. It just mentions that it exists. I think that's fine given that a) the same is true for pg_buffercache_evict() and pg_buffercache_evict_relation() (maybe that should be changed though), b) the only purpose of this function is to be linked to the pg_buffercache_os_pages and pg_buffercache_numa views. [1]: https://www.postgresql.org/message-id/aSBOKX6pLJzumbmF%40ip-10-97-1-34.eu-west-3.compute.internal Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
>From 75b4d5cd73c4dd16d762007007bf2986367af133 Mon Sep 17 00:00:00 2001 From: Bertrand Drouvot <[email protected]> Date: Fri, 21 Nov 2025 05:53:50 +0000 Subject: [PATCH v8 1/4] Remove unused fields from BufferCacheNumaRec These were added by ba2a3c2302f and never been used. Also move (and re-word a bit) a comment that was not at the right place. --- contrib/pg_buffercache/pg_buffercache_pages.c | 13 +++++-------- 1 file changed, 5 insertions(+), 8 deletions(-) 100.0% contrib/pg_buffercache/ diff --git a/contrib/pg_buffercache/pg_buffercache_pages.c b/contrib/pg_buffercache/pg_buffercache_pages.c index c29b784dfa1..1fe350783b1 100644 --- a/contrib/pg_buffercache/pg_buffercache_pages.c +++ b/contrib/pg_buffercache/pg_buffercache_pages.c @@ -82,9 +82,6 @@ typedef struct typedef struct { TupleDesc tupdesc; - int buffers_per_page; - int pages_per_buffer; - int os_page_size; BufferCacheNumaRec *record; } BufferCacheNumaContext; @@ -368,7 +365,11 @@ pg_buffercache_numa_pages(PG_FUNCTION_ARGS) os_page_ptrs = palloc0(sizeof(void *) * os_page_count); os_page_status = palloc(sizeof(uint64) * os_page_count); - /* Fill pointers for all the memory pages. */ + /* + * Fill pointers for all the memory pages. This loop stores into + * os_page_ptrs[] and touches (if needed) addresses as input to one + * big move_pages(2) inquiry system call. + */ idx = 0; for (char *ptr = startptr; ptr < endptr; ptr += os_page_size) { @@ -449,10 +450,6 @@ pg_buffercache_numa_pages(PG_FUNCTION_ARGS) * We don't hold the partition locks, so we don't get a consistent * snapshot across all buffers, but we do grab the buffer header * locks, so the information of each buffer is self-consistent. - * - * This loop touches and stores addresses into os_page_ptrs[] as input - * to one big move_pages(2) inquiry system call. Basically we ask for - * all memory pages for NBuffers. */ startptr = (char *) TYPEALIGN_DOWN(os_page_size, (char *) BufferGetBlock(1)); idx = 0; -- 2.34.1
>From d4d8c25ac42d7a98c3b919327ebd4b02f75fd059 Mon Sep 17 00:00:00 2001 From: Bertrand Drouvot <[email protected]> Date: Fri, 21 Nov 2025 04:52:08 +0000 Subject: [PATCH v8 2/4] Introduce GET_MAX_BUFFER_ENTRIES and get_buffer_page_boundaries Those new macro and function are extracted from pg_buffercache_numa_pages() and help for code readability. --- contrib/pg_buffercache/pg_buffercache_pages.c | 51 ++++++++++++++----- 1 file changed, 38 insertions(+), 13 deletions(-) 100.0% contrib/pg_buffercache/ diff --git a/contrib/pg_buffercache/pg_buffercache_pages.c b/contrib/pg_buffercache/pg_buffercache_pages.c index 1fe350783b1..0e062f88a20 100644 --- a/contrib/pg_buffercache/pg_buffercache_pages.c +++ b/contrib/pg_buffercache/pg_buffercache_pages.c @@ -28,6 +28,12 @@ #define NUM_BUFFERCACHE_NUMA_ELEM 3 +/* + * Get the maximum buffer cache entries needed. + */ +#define GET_MAX_BUFFER_ENTRIES(nbuffers, os_page_size) \ + ((nbuffers) * (Max(1, BLCKSZ / (os_page_size)) + 1)) + PG_MODULE_MAGIC_EXT( .name = "pg_buffercache", .version = PG_VERSION @@ -102,6 +108,34 @@ PG_FUNCTION_INFO_V1(pg_buffercache_evict_all); /* Only need to touch memory once per backend process lifetime */ static bool firstNumaTouch = true; +/* + * Helper function to get buffer page boundaries. + * + * Given a buffer pointer and OS page size, calculates the start/end + * pointers and first page number. + */ +static void +get_buffer_page_boundaries(char *buffptr, Size os_page_size, char *startptr, + char **startptr_buff, char **endptr_buff, + int32 *page_num) +{ + char *start_ptr; + char *end_ptr; + + /* start of the first page of this buffer */ + start_ptr = (char *) TYPEALIGN_DOWN(os_page_size, buffptr); + + /* end of the buffer (no need to align to memory page) */ + end_ptr = buffptr + BLCKSZ; + + Assert(start_ptr < end_ptr); + + /* calculate ID of the first page for this buffer */ + *page_num = (start_ptr - startptr) / os_page_size; + *startptr_buff = start_ptr; + *endptr_buff = end_ptr; +} + Datum pg_buffercache_pages(PG_FUNCTION_ARGS) @@ -317,7 +351,6 @@ pg_buffercache_numa_pages(PG_FUNCTION_ARGS) void **os_page_ptrs; int *os_page_status; uint64 os_page_count; - int pages_per_buffer; int max_entries; char *startptr, *endptr; @@ -429,8 +462,7 @@ pg_buffercache_numa_pages(PG_FUNCTION_ARGS) * number as we're walking buffers. That way we can do it in one pass, * without reallocating memory. */ - pages_per_buffer = Max(1, BLCKSZ / os_page_size) + 1; - max_entries = NBuffers * pages_per_buffer; + max_entries = GET_MAX_BUFFER_ENTRIES(NBuffers, os_page_size); /* Allocate entries for BufferCachePagesRec records. */ fctx->record = (BufferCacheNumaRec *) @@ -471,16 +503,9 @@ pg_buffercache_numa_pages(PG_FUNCTION_ARGS) bufferid = BufferDescriptorGetBuffer(bufHdr); UnlockBufHdr(bufHdr); - /* start of the first page of this buffer */ - startptr_buff = (char *) TYPEALIGN_DOWN(os_page_size, buffptr); - - /* end of the buffer (no need to align to memory page) */ - endptr_buff = buffptr + BLCKSZ; - - Assert(startptr_buff < endptr_buff); - - /* calculate ID of the first page for this buffer */ - page_num = (startptr_buff - startptr) / os_page_size; + /* Get page boundaries for this buffer. */ + get_buffer_page_boundaries(buffptr, os_page_size, startptr, + &startptr_buff, &endptr_buff, &page_num); /* Add an entry for each OS page overlapping with this buffer. */ for (char *ptr = startptr_buff; ptr < endptr_buff; ptr += os_page_size) -- 2.34.1
>From 3f9a9d476b68e58f9b8e09f0de8347f07b24ccb2 Mon Sep 17 00:00:00 2001 From: Bertrand Drouvot <[email protected]> Date: Fri, 21 Nov 2025 09:10:07 +0000 Subject: [PATCH v8 3/4] Adding pg_buffercache_numa_pages_internal() This new function makes NUMA data collection conditional. It extracts the core current pg_buffercache_numa_pages() logic into an internal function that accepts a boolean parameter. It's currently only called with the boolean set to true to serve the pg_buffercache_numa view needs. It will also be called with false in a following commit to serve a new pg_buffercache_os_pages view needs. --- contrib/pg_buffercache/pg_buffercache_pages.c | 148 +++++++++++------- 1 file changed, 89 insertions(+), 59 deletions(-) 100.0% contrib/pg_buffercache/ diff --git a/contrib/pg_buffercache/pg_buffercache_pages.c b/contrib/pg_buffercache/pg_buffercache_pages.c index 0e062f88a20..cc02f7ea9f5 100644 --- a/contrib/pg_buffercache/pg_buffercache_pages.c +++ b/contrib/pg_buffercache/pg_buffercache_pages.c @@ -88,6 +88,7 @@ typedef struct typedef struct { TupleDesc tupdesc; + bool include_numa; BufferCacheNumaRec *record; } BufferCacheNumaContext; @@ -318,22 +319,26 @@ pg_buffercache_pages(PG_FUNCTION_ARGS) } /* - * Inquire about NUMA memory mappings for shared buffers. + * Internal function to inquire about OS pages mappings for shared buffers, + * with optional NUMA information. * - * Returns NUMA node ID for each memory page used by the buffer. Buffers may - * be smaller or larger than OS memory pages. For each buffer we return one - * entry for each memory page used by the buffer (if the buffer is smaller, - * it only uses a part of one memory page). + * When 'include_numa' is: + * - false: Returns buffer to OS page mappings quickly, with numa_node as NULL. + * - true: Initializes NUMA and returns numa_node values. + * + * Buffers may be smaller or larger than OS memory pages. For each buffer we + * return one entry for each memory page used by the buffer (if the buffer is + * smaller, it only uses a part of one memory page). * * We expect both sizes (for buffers and memory pages) to be a power-of-2, so * one is always a multiple of the other. * - * In order to get reliable results we also need to touch memory pages, so - * that the inquiry about NUMA memory node doesn't return -2 (which indicates - * unmapped/unallocated pages). + * When 'include_numa' is true, in order to get reliable results we also need + * to touch memory pages, so that the inquiry about NUMA memory node doesn't + * return -2 (which indicates unmapped/unallocated pages). */ -Datum -pg_buffercache_numa_pages(PG_FUNCTION_ARGS) +static Datum +pg_buffercache_numa_pages_internal(PG_FUNCTION_ARGS, bool include_numa) { FuncCallContext *funcctx; MemoryContext oldcontext; @@ -348,14 +353,14 @@ pg_buffercache_numa_pages(PG_FUNCTION_ARGS) int i, idx; Size os_page_size; - void **os_page_ptrs; - int *os_page_status; - uint64 os_page_count; + int *os_page_status = NULL; + uint64 os_page_count = 0; int max_entries; char *startptr, *endptr; - if (pg_numa_init() == -1) + /* If NUMA information is requested, initialize NUMA support. */ + if (include_numa && pg_numa_init() == -1) elog(ERROR, "libnuma initialization failed or NUMA is not supported on this platform"); /* @@ -383,50 +388,55 @@ pg_buffercache_numa_pages(PG_FUNCTION_ARGS) */ Assert((os_page_size % BLCKSZ == 0) || (BLCKSZ % os_page_size == 0)); - /* - * How many addresses we are going to query? Simply get the page for - * the first buffer, and first page after the last buffer, and count - * the pages from that. - */ - startptr = (char *) TYPEALIGN_DOWN(os_page_size, - BufferGetBlock(1)); - endptr = (char *) TYPEALIGN(os_page_size, - (char *) BufferGetBlock(NBuffers) + BLCKSZ); - os_page_count = (endptr - startptr) / os_page_size; - - /* Used to determine the NUMA node for all OS pages at once */ - os_page_ptrs = palloc0(sizeof(void *) * os_page_count); - os_page_status = palloc(sizeof(uint64) * os_page_count); - - /* - * Fill pointers for all the memory pages. This loop stores into - * os_page_ptrs[] and touches (if needed) addresses as input to one - * big move_pages(2) inquiry system call. - */ - idx = 0; - for (char *ptr = startptr; ptr < endptr; ptr += os_page_size) + if (include_numa) { - os_page_ptrs[idx++] = ptr; + void **os_page_ptrs = NULL; + + /* + * How many addresses we are going to query? Simply get the page + * for the first buffer, and first page after the last buffer, and + * count the pages from that. + */ + startptr = (char *) TYPEALIGN_DOWN(os_page_size, + BufferGetBlock(1)); + endptr = (char *) TYPEALIGN(os_page_size, + (char *) BufferGetBlock(NBuffers) + BLCKSZ); + os_page_count = (endptr - startptr) / os_page_size; + + /* Used to determine the NUMA node for all OS pages at once */ + os_page_ptrs = palloc0(sizeof(void *) * os_page_count); + os_page_status = palloc(sizeof(uint64) * os_page_count); + + /* + * Fill pointers for all the memory pages. This loop stores into + * os_page_ptrs[] and touches (if needed) addresses as input to + * one big move_pages(2) inquiry system call. + */ + idx = 0; + for (char *ptr = startptr; ptr < endptr; ptr += os_page_size) + { + os_page_ptrs[idx++] = ptr; - /* Only need to touch memory once per backend process lifetime */ - if (firstNumaTouch) - pg_numa_touch_mem_if_required(ptr); - } + /* Only need to touch memory once per backend process lifetime */ + if (firstNumaTouch) + pg_numa_touch_mem_if_required(ptr); + } - Assert(idx == os_page_count); + Assert(idx == os_page_count); - elog(DEBUG1, "NUMA: NBuffers=%d os_page_count=" UINT64_FORMAT " " - "os_page_size=%zu", NBuffers, os_page_count, os_page_size); + elog(DEBUG1, "NUMA: NBuffers=%d os_page_count=" UINT64_FORMAT " " + "os_page_size=%zu", NBuffers, os_page_count, os_page_size); - /* - * If we ever get 0xff back from kernel inquiry, then we probably have - * bug in our buffers to OS page mapping code here. - */ - memset(os_page_status, 0xff, sizeof(int) * os_page_count); + /* + * If we ever get 0xff back from kernel inquiry, then we probably + * have bug in our buffers to OS page mapping code here. + */ + memset(os_page_status, 0xff, sizeof(int) * os_page_count); - /* Query NUMA status for all the pointers */ - if (pg_numa_query_pages(0, os_page_count, os_page_ptrs, os_page_status) == -1) - elog(ERROR, "failed NUMA pages inquiry: %m"); + /* Query NUMA status for all the pointers */ + if (pg_numa_query_pages(0, os_page_count, os_page_ptrs, os_page_status) == -1) + elog(ERROR, "failed NUMA pages inquiry: %m"); + } /* Initialize the multi-call context, load entries about buffers */ @@ -454,6 +464,7 @@ pg_buffercache_numa_pages(PG_FUNCTION_ARGS) INT4OID, -1, 0); fctx->tupdesc = BlessTupleDesc(tupledesc); + fctx->include_numa = include_numa; /* * Each buffer needs at least one entry, but it might be offset in @@ -472,7 +483,7 @@ pg_buffercache_numa_pages(PG_FUNCTION_ARGS) /* Return to original context when allocating transient memory */ MemoryContextSwitchTo(oldcontext); - if (firstNumaTouch) + if (include_numa && firstNumaTouch) elog(DEBUG1, "NUMA: page-faulting the buffercache for proper NUMA readouts"); /* @@ -512,7 +523,7 @@ pg_buffercache_numa_pages(PG_FUNCTION_ARGS) { fctx->record[idx].bufferid = bufferid; fctx->record[idx].page_num = page_num; - fctx->record[idx].numa_node = os_page_status[page_num]; + fctx->record[idx].numa_node = include_numa ? os_page_status[page_num] : -1; /* advance to the next entry/page */ ++idx; @@ -520,14 +531,18 @@ pg_buffercache_numa_pages(PG_FUNCTION_ARGS) } } - Assert((idx >= os_page_count) && (idx <= max_entries)); + Assert(idx <= max_entries); + + if (include_numa) + Assert(idx >= os_page_count); /* Set max calls and remember the user function context. */ funcctx->max_calls = idx; funcctx->user_fctx = fctx; - /* Remember this backend touched the pages */ - firstNumaTouch = false; + /* Remember this backend touched the pages (only relevant for NUMA) */ + if (include_numa) + firstNumaTouch = false; } funcctx = SRF_PERCALL_SETUP(); @@ -547,8 +562,16 @@ pg_buffercache_numa_pages(PG_FUNCTION_ARGS) values[1] = Int64GetDatum(fctx->record[i].page_num); nulls[1] = false; - values[2] = Int32GetDatum(fctx->record[i].numa_node); - nulls[2] = false; + if (fctx->include_numa) + { + values[2] = Int32GetDatum(fctx->record[i].numa_node); + nulls[2] = false; + } + else + { + values[2] = (Datum) 0; + nulls[2] = true; + } /* Build and return the tuple. */ tuple = heap_form_tuple(fctx->tupdesc, values, nulls); @@ -560,6 +583,13 @@ pg_buffercache_numa_pages(PG_FUNCTION_ARGS) SRF_RETURN_DONE(funcctx); } +/* Entry point for extension. */ +Datum +pg_buffercache_numa_pages(PG_FUNCTION_ARGS) +{ + return pg_buffercache_numa_pages_internal(fcinfo, true); +} + Datum pg_buffercache_summary(PG_FUNCTION_ARGS) { -- 2.34.1
>From b9247d0080b6e93488578a732853b9fc00df11cf Mon Sep 17 00:00:00 2001 From: Bertrand Drouvot <[email protected]> Date: Fri, 21 Nov 2025 05:17:20 +0000 Subject: [PATCH v8 4/4] Add pg_buffercache_os_pages function and view ba2a3c2302f added a way to check if a buffer is spread across multiple pages. Adding those new function and view so that one does not need NUMA support enabled to get this information. To do so this commit: - renames pg_buffercache_numa_pages_internal() to pg_buffercache_os_pages() - keep pg_buffercache_numa_pages() as a backward compatibility wrapper - re-create the pg_buffercache_numa view on top of pg_buffercache_os_pages using true as argument That way we avoid code duplication and the pg_buffercache_os_pages view does not get performance penalties when NUMA is available. --- contrib/pg_buffercache/Makefile | 2 +- .../expected/pg_buffercache.out | 18 +++ contrib/pg_buffercache/meson.build | 1 + .../pg_buffercache--1.6--1.7.sql | 34 +++++ contrib/pg_buffercache/pg_buffercache.control | 2 +- contrib/pg_buffercache/pg_buffercache_pages.c | 59 ++++++--- contrib/pg_buffercache/sql/pg_buffercache.sql | 8 ++ doc/src/sgml/pgbuffercache.sgml | 120 ++++++++++++++++-- src/tools/pgindent/typedefs.list | 2 + 9 files changed, 217 insertions(+), 29 deletions(-) 4.4% contrib/pg_buffercache/expected/ 3.2% contrib/pg_buffercache/sql/ 38.7% contrib/pg_buffercache/ 53.0% doc/src/sgml/ diff --git a/contrib/pg_buffercache/Makefile b/contrib/pg_buffercache/Makefile index 5f748543e2e..0e618f66aec 100644 --- a/contrib/pg_buffercache/Makefile +++ b/contrib/pg_buffercache/Makefile @@ -9,7 +9,7 @@ EXTENSION = pg_buffercache DATA = pg_buffercache--1.2.sql pg_buffercache--1.2--1.3.sql \ pg_buffercache--1.1--1.2.sql pg_buffercache--1.0--1.1.sql \ pg_buffercache--1.3--1.4.sql pg_buffercache--1.4--1.5.sql \ - pg_buffercache--1.5--1.6.sql + pg_buffercache--1.5--1.6.sql pg_buffercache--1.6--1.7.sql PGFILEDESC = "pg_buffercache - monitoring of shared buffer cache in real-time" REGRESS = pg_buffercache pg_buffercache_numa diff --git a/contrib/pg_buffercache/expected/pg_buffercache.out b/contrib/pg_buffercache/expected/pg_buffercache.out index 9a9216dc7b1..26c2d5f5710 100644 --- a/contrib/pg_buffercache/expected/pg_buffercache.out +++ b/contrib/pg_buffercache/expected/pg_buffercache.out @@ -8,6 +8,16 @@ from pg_buffercache; t (1 row) +-- For pg_buffercache_os_pages, we expect at least one entry for each buffer +select count(*) >= (select setting::bigint + from pg_settings + where name = 'shared_buffers') +from pg_buffercache_os_pages; + ?column? +---------- + t +(1 row) + select buffers_used + buffers_unused > 0, buffers_dirty <= buffers_used, buffers_pinned <= buffers_used @@ -28,6 +38,8 @@ SELECT count(*) > 0 FROM pg_buffercache_usage_counts() WHERE buffers >= 0; SET ROLE pg_database_owner; SELECT * FROM pg_buffercache; ERROR: permission denied for view pg_buffercache +SELECT * FROM pg_buffercache_os_pages; +ERROR: permission denied for view pg_buffercache_os_pages SELECT * FROM pg_buffercache_pages() AS p (wrong int); ERROR: permission denied for function pg_buffercache_pages SELECT * FROM pg_buffercache_summary(); @@ -43,6 +55,12 @@ SELECT count(*) > 0 FROM pg_buffercache; t (1 row) +SELECT count(*) > 0 FROM pg_buffercache_os_pages; + ?column? +---------- + t +(1 row) + SELECT buffers_used + buffers_unused > 0 FROM pg_buffercache_summary(); ?column? ---------- diff --git a/contrib/pg_buffercache/meson.build b/contrib/pg_buffercache/meson.build index 7cd039a1df9..7c31141881f 100644 --- a/contrib/pg_buffercache/meson.build +++ b/contrib/pg_buffercache/meson.build @@ -24,6 +24,7 @@ install_data( 'pg_buffercache--1.3--1.4.sql', 'pg_buffercache--1.4--1.5.sql', 'pg_buffercache--1.5--1.6.sql', + 'pg_buffercache--1.6--1.7.sql', 'pg_buffercache.control', kwargs: contrib_data_args, ) diff --git a/contrib/pg_buffercache/pg_buffercache--1.6--1.7.sql b/contrib/pg_buffercache/pg_buffercache--1.6--1.7.sql new file mode 100644 index 00000000000..3cfe723c0c2 --- /dev/null +++ b/contrib/pg_buffercache/pg_buffercache--1.6--1.7.sql @@ -0,0 +1,34 @@ +/* contrib/pg_buffercache/pg_buffercache--1.6--1.7.sql */ + +-- complain if script is sourced in psql, rather than via ALTER EXTENSION +\echo Use "ALTER EXTENSION pg_buffercache UPDATE TO '1.7'" to load this file. \quit + +-- Register the new function with boolean parameter +-- This function is the core implementation for both OS pages and NUMA queries +CREATE FUNCTION pg_buffercache_os_pages(IN include_numa boolean, + OUT bufferid integer, + OUT os_page_num bigint, + OUT numa_node integer) +RETURNS SETOF record +AS 'MODULE_PATHNAME', 'pg_buffercache_os_pages' +LANGUAGE C PARALLEL SAFE; + +-- Create a view for convenient access. +CREATE VIEW pg_buffercache_os_pages AS + SELECT bufferid, os_page_num + FROM pg_buffercache_os_pages(false); + +DROP VIEW pg_buffercache_numa; + +-- Create a view for convenient access. +CREATE VIEW pg_buffercache_numa AS + SELECT bufferid, os_page_num, numa_node + FROM pg_buffercache_os_pages(true); + +REVOKE ALL ON FUNCTION pg_buffercache_os_pages(boolean) FROM PUBLIC; +REVOKE ALL ON pg_buffercache_os_pages FROM PUBLIC; +REVOKE ALL ON pg_buffercache_numa FROM PUBLIC; + +GRANT EXECUTE ON FUNCTION pg_buffercache_os_pages(boolean) TO pg_monitor; +GRANT SELECT ON pg_buffercache_os_pages TO pg_monitor; +GRANT SELECT ON pg_buffercache_numa TO pg_monitor; diff --git a/contrib/pg_buffercache/pg_buffercache.control b/contrib/pg_buffercache/pg_buffercache.control index b030ba3a6fa..11499550945 100644 --- a/contrib/pg_buffercache/pg_buffercache.control +++ b/contrib/pg_buffercache/pg_buffercache.control @@ -1,5 +1,5 @@ # pg_buffercache extension comment = 'examine the shared buffer cache' -default_version = '1.6' +default_version = '1.7' module_pathname = '$libdir/pg_buffercache' relocatable = true diff --git a/contrib/pg_buffercache/pg_buffercache_pages.c b/contrib/pg_buffercache/pg_buffercache_pages.c index cc02f7ea9f5..dbd4a601ccb 100644 --- a/contrib/pg_buffercache/pg_buffercache_pages.c +++ b/contrib/pg_buffercache/pg_buffercache_pages.c @@ -26,7 +26,7 @@ #define NUM_BUFFERCACHE_EVICT_RELATION_ELEM 3 #define NUM_BUFFERCACHE_EVICT_ALL_ELEM 3 -#define NUM_BUFFERCACHE_NUMA_ELEM 3 +#define NUM_BUFFERCACHE_OS_PAGES_ELEM 3 /* * Get the maximum buffer cache entries needed. @@ -73,14 +73,16 @@ typedef struct } BufferCachePagesContext; /* - * Record structure holding the to be exposed cache data. + * Record structure holding the to be exposed cache data for OS pages. + * This structure is used by pg_buffercache_os_pages() which takes a + * boolean parameter to control whether NUMA information is included. */ typedef struct { uint32 bufferid; int64 page_num; int32 numa_node; -} BufferCacheNumaRec; +} BufferCacheOsPagesRec; /* * Function context for data persisting over repeated calls. @@ -89,8 +91,8 @@ typedef struct { TupleDesc tupdesc; bool include_numa; - BufferCacheNumaRec *record; -} BufferCacheNumaContext; + BufferCacheOsPagesRec *record; +} BufferCacheOsPagesContext; /* @@ -98,6 +100,7 @@ typedef struct * relation node/tablespace/database/blocknum and dirty indicator. */ PG_FUNCTION_INFO_V1(pg_buffercache_pages); +PG_FUNCTION_INFO_V1(pg_buffercache_os_pages); PG_FUNCTION_INFO_V1(pg_buffercache_numa_pages); PG_FUNCTION_INFO_V1(pg_buffercache_summary); PG_FUNCTION_INFO_V1(pg_buffercache_usage_counts); @@ -319,8 +322,8 @@ pg_buffercache_pages(PG_FUNCTION_ARGS) } /* - * Internal function to inquire about OS pages mappings for shared buffers, - * with optional NUMA information. + * Inquire about OS pages mappings for shared buffers, with optional NUMA + * information. * * When 'include_numa' is: * - false: Returns buffer to OS page mappings quickly, with numa_node as NULL. @@ -337,12 +340,12 @@ pg_buffercache_pages(PG_FUNCTION_ARGS) * to touch memory pages, so that the inquiry about NUMA memory node doesn't * return -2 (which indicates unmapped/unallocated pages). */ -static Datum -pg_buffercache_numa_pages_internal(PG_FUNCTION_ARGS, bool include_numa) +Datum +pg_buffercache_os_pages(PG_FUNCTION_ARGS) { FuncCallContext *funcctx; MemoryContext oldcontext; - BufferCacheNumaContext *fctx; /* User function context. */ + BufferCacheOsPagesContext *fctx; /* User function context. */ TupleDesc tupledesc; TupleDesc expected_tupledesc; HeapTuple tuple; @@ -350,6 +353,7 @@ pg_buffercache_numa_pages_internal(PG_FUNCTION_ARGS, bool include_numa) if (SRF_IS_FIRSTCALL()) { + bool include_numa; int i, idx; Size os_page_size; @@ -359,6 +363,9 @@ pg_buffercache_numa_pages_internal(PG_FUNCTION_ARGS, bool include_numa) char *startptr, *endptr; + /* Get the boolean parameter that controls NUMA behavior */ + include_numa = PG_GETARG_BOOL(0); + /* If NUMA information is requested, initialize NUMA support. */ if (include_numa && pg_numa_init() == -1) elog(ERROR, "libnuma initialization failed or NUMA is not supported on this platform"); @@ -446,12 +453,12 @@ pg_buffercache_numa_pages_internal(PG_FUNCTION_ARGS, bool include_numa) oldcontext = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx); /* Create a user function context for cross-call persistence */ - fctx = (BufferCacheNumaContext *) palloc(sizeof(BufferCacheNumaContext)); + fctx = (BufferCacheOsPagesContext *) palloc(sizeof(BufferCacheOsPagesContext)); if (get_call_result_type(fcinfo, NULL, &expected_tupledesc) != TYPEFUNC_COMPOSITE) elog(ERROR, "return type must be a row type"); - if (expected_tupledesc->natts != NUM_BUFFERCACHE_NUMA_ELEM) + if (expected_tupledesc->natts != NUM_BUFFERCACHE_OS_PAGES_ELEM) elog(ERROR, "incorrect number of output arguments"); /* Construct a tuple descriptor for the result rows. */ @@ -475,10 +482,10 @@ pg_buffercache_numa_pages_internal(PG_FUNCTION_ARGS, bool include_numa) */ max_entries = GET_MAX_BUFFER_ENTRIES(NBuffers, os_page_size); - /* Allocate entries for BufferCachePagesRec records. */ - fctx->record = (BufferCacheNumaRec *) + /* Allocate entries for BufferCacheOsPagesRec records. */ + fctx->record = (BufferCacheOsPagesRec *) MemoryContextAllocHuge(CurrentMemoryContext, - sizeof(BufferCacheNumaRec) * max_entries); + sizeof(BufferCacheOsPagesRec) * max_entries); /* Return to original context when allocating transient memory */ MemoryContextSwitchTo(oldcontext); @@ -553,8 +560,8 @@ pg_buffercache_numa_pages_internal(PG_FUNCTION_ARGS, bool include_numa) if (funcctx->call_cntr < funcctx->max_calls) { uint32 i = funcctx->call_cntr; - Datum values[NUM_BUFFERCACHE_NUMA_ELEM]; - bool nulls[NUM_BUFFERCACHE_NUMA_ELEM]; + Datum values[NUM_BUFFERCACHE_OS_PAGES_ELEM]; + bool nulls[NUM_BUFFERCACHE_OS_PAGES_ELEM]; values[0] = Int32GetDatum(fctx->record[i].bufferid); nulls[0] = false; @@ -583,11 +590,25 @@ pg_buffercache_numa_pages_internal(PG_FUNCTION_ARGS, bool include_numa) SRF_RETURN_DONE(funcctx); } -/* Entry point for extension. */ +/* Backward compatibility wrapper. */ Datum pg_buffercache_numa_pages(PG_FUNCTION_ARGS) { - return pg_buffercache_numa_pages_internal(fcinfo, true); + LOCAL_FCINFO(newfcinfo, 1); + Datum result; + + /* Initialize the new fcinfo structure. */ + InitFunctionCallInfoData(*newfcinfo, fcinfo->flinfo, 1, fcinfo->fncollation, + NULL, NULL); + + /* Set the include_numa parameter to true. */ + newfcinfo->args[0].value = BoolGetDatum(true); + newfcinfo->args[0].isnull = false; + + /* Call pg_buffercache_os_pages with include_numa=true */ + result = pg_buffercache_os_pages(newfcinfo); + + return result; } Datum diff --git a/contrib/pg_buffercache/sql/pg_buffercache.sql b/contrib/pg_buffercache/sql/pg_buffercache.sql index 47cca1907c7..3c70ee9ef4a 100644 --- a/contrib/pg_buffercache/sql/pg_buffercache.sql +++ b/contrib/pg_buffercache/sql/pg_buffercache.sql @@ -5,6 +5,12 @@ select count(*) = (select setting::bigint where name = 'shared_buffers') from pg_buffercache; +-- For pg_buffercache_os_pages, we expect at least one entry for each buffer +select count(*) >= (select setting::bigint + from pg_settings + where name = 'shared_buffers') +from pg_buffercache_os_pages; + select buffers_used + buffers_unused > 0, buffers_dirty <= buffers_used, buffers_pinned <= buffers_used @@ -16,6 +22,7 @@ SELECT count(*) > 0 FROM pg_buffercache_usage_counts() WHERE buffers >= 0; -- having to create a dedicated user, use the pg_database_owner pseudo-role. SET ROLE pg_database_owner; SELECT * FROM pg_buffercache; +SELECT * FROM pg_buffercache_os_pages; SELECT * FROM pg_buffercache_pages() AS p (wrong int); SELECT * FROM pg_buffercache_summary(); SELECT * FROM pg_buffercache_usage_counts(); @@ -24,6 +31,7 @@ RESET role; -- Check that pg_monitor is allowed to query view / function SET ROLE pg_monitor; SELECT count(*) > 0 FROM pg_buffercache; +SELECT count(*) > 0 FROM pg_buffercache_os_pages; SELECT buffers_used + buffers_unused > 0 FROM pg_buffercache_summary(); SELECT count(*) > 0 FROM pg_buffercache_usage_counts(); RESET role; diff --git a/doc/src/sgml/pgbuffercache.sgml b/doc/src/sgml/pgbuffercache.sgml index 99ad2e68785..67f96e289b2 100644 --- a/doc/src/sgml/pgbuffercache.sgml +++ b/doc/src/sgml/pgbuffercache.sgml @@ -46,8 +46,9 @@ <para> This module provides the <function>pg_buffercache_pages()</function> function (wrapped in the <structname>pg_buffercache</structname> view), the - <function>pg_buffercache_numa_pages()</function> function (wrapped in the - <structname>pg_buffercache_numa</structname> view), the + <function>pg_buffercache_os_pages()</function> function (wrapped in the + <structname>pg_buffercache_os_pages</structname> and + <structname>pg_buffercache_numa</structname> views), the <function>pg_buffercache_summary()</function> function, the <function>pg_buffercache_usage_counts()</function> function, the <function>pg_buffercache_evict()</function> function, the @@ -63,12 +64,14 @@ </para> <para> - The <function>pg_buffercache_numa_pages()</function> function provides - <acronym>NUMA</acronym> node mappings for shared buffer entries. This - information is not part of <function>pg_buffercache_pages()</function> - itself, as it is much slower to retrieve. - The <structname>pg_buffercache_numa</structname> view wraps the function for - convenient use. + The <function>pg_buffercache_os_pages()</function> function provides OS + pages mappings for shared buffer entries. When its argument is <literal>true</literal>, + it also provides <acronym>NUMA</acronym> node mappings for shared buffer entries ( + this information is not part of <function>pg_buffercache_pages()</function> + itself, as it is much slower to retrieve). + The <structname>pg_buffercache_os_pages</structname> and <structname>pg_buffercache_numa</structname> + views wrap the function for convenient use (with its argument set to <literal>false</literal> + and <literal>true</literal> respectively). </para> <para> @@ -242,6 +245,52 @@ </para> </sect2> + <sect2 id="pgbuffercache-pg-buffercache-os-pages"> + <title>The <structname>pg_buffercache_os_pages</structname> View</title> + + <para> + The definitions of the columns exposed by the view are shown in <xref linkend="pgbuffercache-os-pages-columns"/>. + </para> + + <table id="pgbuffercache-os-pages-columns"> + <title><structname>pg_buffercache_os_pages</structname> Columns</title> + <tgroup cols="1"> + <thead> + <row> + <entry role="catalog_table_entry"><para role="column_definition"> + Column Type + </para> + <para> + Description + </para></entry> + </row> + </thead> + + <tbody> + <row> + <entry role="catalog_table_entry"><para role="column_definition"> + <structfield>bufferid</structfield> <type>integer</type> + </para> + <para> + ID, in the range 1..<varname>shared_buffers</varname> + </para></entry> + </row> + + <row> + <entry role="catalog_table_entry"><para role="column_definition"> + <structfield>os_page_num</structfield> <type>bigint</type> + </para> + <para> + Number of OS memory page for this buffer + </para></entry> + </row> + + </tbody> + </tgroup> + </table> + + </sect2> + <sect2 id="pgbuffercache-pg-buffercache-numa"> <title>The <structname>pg_buffercache_numa</structname> View</title> @@ -558,6 +607,61 @@ regression=# SELECT n.nspname, c.relname, count(*) AS buffers public | spgist_text_tbl | 182 (10 rows) +regression=# SELECT pages_per_buffer, COUNT(*) as buffer_count + FROM ( + SELECT bufferid, COUNT(*) as pages_per_buffer + FROM pg_buffercache_os_pages + GROUP BY bufferid + ) + GROUP BY pages_per_buffer + ORDER BY pages_per_buffer; + + pages_per_buffer | buffer_count +------------------+-------------- + 1 | 261120 + 2 | 1024 +(2 rows) + +regression=# SELECT n.nspname, c.relname, count(*) AS buffers_on_multiple_pages + FROM pg_buffercache b JOIN pg_class c + ON b.relfilenode = pg_relation_filenode(c.oid) AND + b.reldatabase IN (0, (SELECT oid FROM pg_database + WHERE datname = current_database())) + JOIN pg_namespace n ON n.oid = c.relnamespace + JOIN (SELECT bufferid FROM pg_buffercache_os_pages + GROUP BY bufferid HAVING count(*) > 1) m on m.bufferid = b.bufferid + GROUP BY n.nspname, c.relname + ORDER BY 3 DESC + LIMIT 10; + + nspname | relname | buffers_on_multiple_pages +------------+------------------------------+--------------------------- + public | delete_test_table | 3 + public | gin_test_idx | 2 + pg_catalog | pg_depend | 2 + public | quad_poly_tbl | 2 + pg_catalog | pg_depend_reference_index | 1 + pg_catalog | pg_index_indexrelid_index | 1 + pg_catalog | pg_constraint_contypid_index | 1 + pg_catalog | pg_statistic | 1 + pg_catalog | pg_depend_depender_index | 1 + pg_catalog | pg_operator | 1 +(10 rows) + + + nspname | relname | buffers_on_multiple_pages +------------+---------------------------------+--------------------------- + public | gin_test_tbl | 4 + public | delete_test_table | 4 + public | tenk1 | 4 + pg_catalog | pg_attribute_relid_attnum_index | 4 + pg_catalog | pg_class | 2 + pg_catalog | pg_depend_depender_index | 2 + pg_catalog | pg_attribute | 2 + pg_catalog | pg_opfamily | 2 + pg_catalog | pg_opclass_oid_index | 2 + pg_catalog | pg_description | 2 +(10 rows) regression=# SELECT * FROM pg_buffercache_summary(); buffers_used | buffers_unused | buffers_dirty | buffers_pinned | usagecount_avg diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list index c751c25a04d..f6a398881f6 100644 --- a/src/tools/pgindent/typedefs.list +++ b/src/tools/pgindent/typedefs.list @@ -340,6 +340,8 @@ BufferAccessStrategy BufferAccessStrategyType BufferCacheNumaContext BufferCacheNumaRec +BufferCacheOsPagesContext +BufferCacheOsPagesRec BufferCachePagesContext BufferCachePagesRec BufferDesc -- 2.34.1
