Re: pgsql: Introduce pg_shmem_allocations_numa view

Tomas Vondra Mon, 30 Jun 2025 11:57:12 -0700

On 6/27/25 19:33, Bertrand Drouvot wrote:
> Hi,
> 
> On Fri, Jun 27, 2025 at 04:52:08PM +0200, Tomas Vondra wrote:
>> Here's three small patches, that should handle the issue
> 
> Thanks for the patches!
> 
>> 0001 - Adds the batching into pg_numa_query_pages, so that the callers
>> don't need to do anything.
>>
>> The batching doesn't seem to cause any performance regression. 32-bit
>> systems can't use that much memory anyway, and on 64-bit systems the
>> batch is sufficiently large (1024).
> 
> === 1
> 
> -#define pg_numa_touch_mem_if_required(ro_volatile_var, ptr) \
> +#define pg_numa_touch_mem_if_required(ptr) \
> 
> Looks unrelated, should be in 0002?
>


Of course, I merged it into the wrong patch. Here's a v2 that fixes
this, and also reworded some of the comments and commit messages a
little bit.

In particular it now uses "chunking" instead of "batching". I believe
bathing is "combining multiple requests into a single one", but we're
doing exactly the opposite - splitting a large request into smaller
ones. Which is what "chunking" does.

> === 2
> 
> I thought that it would be better to provide a batch size only in the 32-bit
> case (see [1]), but I now think it makes sense to also provide (a larger) one
> for non 32-bit (as you did) due to the CFI added in 0003 (as it's also good to
> have it for non 32-bit).
> 

Agreed, I think the CFI is a good thing to have.

>> 0002 - Silences the valgrind about the memory touching. It replaces the
>> macro with a static inline function, and adds suppressions for both
>> 32-bit and 64-bits. The 32-bit may be a bit pointless, because on my
>> rpi5 valgrind produces about a bunch of other stuff anyway. But doesn't
>> hurt.
>>
>> The function now looks like this:
>>
>>   static inline void
>>   pg_numa_touch_mem_if_required(void *ptr)
>>   {
>>       volatile uint64 touch pg_attribute_unused();
>>       touch = *(volatile uint64 *) ptr;
>>   }
>>
>> I did a lot of testing on multiple systems to check replacing the macro
>> with a static inline function still works - and it seems it does. But if
>> someone thinks the function won't work, I'd like to know.
> 
> LGTM.
> >> 0003 - While working on these patches, it occurred to me we could/should
>> add CHECK_FOR_INTERRUPTS() into the batch loop. This querying can take
>> quite a bit of time, so letting people to interrupt it seems reasonable.
>> It wasn't possible with just one call into the kernel, but with the
>> batching we can add a CFI.
> 
> Yeah, LGTM.
> 

Thanks!

I plan to push this tomorrow morning.

-- 
Tomas Vondra

From d5dd0631c5c5233cabe2000f310a3f902230a284 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <to...@vondra.me>
Date: Fri, 27 Jun 2025 12:43:20 +0200
Subject: [PATCH v2 1/3] Limit the size of numa_move_pages requests

There's a kernel bug in do_pages_stat(), affecting systems combining
64-bit kernel and 32-bit user space. The function splits the request
into chunks of 16 pointers, but forgets the pointers are 32-bit when
advancing to the next chunk. Some of the pointers get skipped, and
memory after the array is interpreted as pointers. The result is that
the produced status of memory pages is mostly bogus.

Systems combining 64-bit and 32-bit environments like this might seem
rare, but that's not the case - all 32-bit Debian packages are built in
a 32-bit chroot on a system with 64-bit kernel.

This is a long-standing kernel bug (since 2010), affecting pretty much
all kernels, so it'll take time until all systems get a fixed kernel.
Luckily, we can work around the issue by chunking the requests the same
way do_pages_stat() does, at least on affected systems. We don't know
what kernel a 32-bit build will run on, so all 32-bit builds use chunks
of 16 elements (the largest chunk before hitting the issue).

64-bit builds are not affected by this issue, and so could work without
the chunking. But chunking has other advantages, so we apply chunking
even for 64-bit builds, with chunks of 1024 elements.

Reported-by: Christoph Berg <m...@debian.org>
Author: Christoph Berg <m...@debian.org>
Author: Bertrand Drouvot <bertranddrouvot...@gmail.com>
Discussion: https://postgr.es/m/aetdozlmtzdda...@msg.df7cb.de
---
 src/port/pg_numa.c | 50 +++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 49 insertions(+), 1 deletion(-)

diff --git a/src/port/pg_numa.c b/src/port/pg_numa.c
index 4b487a2a4e8..d5935207d0a 100644
--- a/src/port/pg_numa.c
+++ b/src/port/pg_numa.c
@@ -29,6 +29,19 @@
 #include <numa.h>
 #include <numaif.h>

+/*
+ * numa_move_pages() chunk size, has to be <= 16 to work around a kernel bug
+ * in do_pages_stat() (chunked by DO_PAGES_STAT_CHUNK_NR). By using the same
+ * chunk size, we make it work even on unfixed kernels.
+ *
+ * 64-bit system are not affected by the bug, and so use much larger chunks.
+ */
+#if SIZEOF_SIZE_T == 4
+#define NUMA_QUERY_CHUNK_SIZE 16
+#else
+#define NUMA_QUERY_CHUNK_SIZE 1024
+#endif
+
 /* libnuma requires initialization as per numa(3) on Linux */
 int
 pg_numa_init(void)
@@ -42,11 +55,46 @@ pg_numa_init(void)
  * We use move_pages(2) syscall here - instead of get_mempolicy(2) - as the
  * first one allows us to batch and query about many memory pages in one single
  * giant system call that is way faster.
+ *
+ * We call numa_move_pages() for smaller chunks of the whole array. The first
+ * reason is to work around a kernel bug, but also to allow interrupting the
+ * query between the calls (for many pointers processing the whole array can
+ * take a lot of time).
  */
 int
 pg_numa_query_pages(int pid, unsigned long count, void **pages, int *status)
 {
-	return numa_move_pages(pid, count, pages, NULL, status, 0);
+	unsigned long	next = 0;
+	int				ret = 0;
+
+	/*
+	 * Chunk pointers passed to numa_move_pages to NUMA_QUERY_CHUNK_SIZE
+	 * items, to work around a kernel bug in do_pages_stat().
+	 */
+	while (next < count)
+	{
+		unsigned long count_chunk = Min(count - next,
+										NUMA_QUERY_CHUNK_SIZE);
+
+		/*
+		 * Bail out if any of the chunks errors out (ret<0). We ignore
+		 * (ret>0) which is used to return number of nonmigrated pages,
+		 * but we're not migrating any pages here.
+		 */
+		ret = numa_move_pages(pid, count_chunk, &pages[next], NULL, &status[next], 0);
+		if (ret < 0)
+		{
+			/* plain error, return as is */
+			return ret;
+		}
+
+		next += count_chunk;
+	}
+
+	/* should have consumed the input array exactly */
+	Assert(next == count);
+
+	return 0;
 }

 int
-- 
2.49.0

From 9cb095a2061deea7f0a781177f4c89928392d7ce Mon Sep 17 00:00:00 2001
From: Tomas Vondra <to...@vondra.me>
Date: Fri, 27 Jun 2025 12:47:38 +0200
Subject: [PATCH v2 2/3] Silence valgrind about pg_numa_touch_mem_if_required

When querying NUMA status of pages in shared memory, we need to touch
the memory first to get valid results. This may trigger valgrind
reports, because some of the memory (e.g. unpinned buffers) may be
marked as noaccess.

Solved by adding a valgrind suppresion. An alternative would be to
adjust the access/noaccess status before touching the memory, but that
seems far too invasive. It would require all those places to have
detailed knowledge of what the shared memory stores.

The pg_numa_touch_mem_if_required() macro is replaced with a function.
Macros are invisible to suppressions, so it'd have to suppress reports
for the caller - e.g. pg_get_shmem_allocations_numa(). So we'd suppress
reports for the whole function, and that seems to heavy-handed. It might
easily hide other valid issues.

Reviewed-by: Christoph Berg <m...@debian.org>
Reviewed-by: Bertrand Drouvot <bertranddrouvot...@gmail.com>
Discussion: https://postgr.es/m/aetdozlmtzdda...@msg.df7cb.de
---
 contrib/pg_buffercache/pg_buffercache_pages.c |  3 +--
 src/backend/storage/ipc/shmem.c               |  4 +---
 src/include/port/pg_numa.h                    | 10 +++++++---
 src/tools/valgrind.supp                       | 14 ++++++++++++++
 4 files changed, 23 insertions(+), 8 deletions(-)

diff --git a/contrib/pg_buffercache/pg_buffercache_pages.c b/contrib/pg_buffercache/pg_buffercache_pages.c
index 4b007f6e1b0..ae0291e6e96 100644
--- a/contrib/pg_buffercache/pg_buffercache_pages.c
+++ b/contrib/pg_buffercache/pg_buffercache_pages.c
@@ -320,7 +320,6 @@ pg_buffercache_numa_pages(PG_FUNCTION_ARGS)
 		uint64		os_page_count;
 		int			pages_per_buffer;
 		int			max_entries;
-		volatile uint64 touch pg_attribute_unused();
 		char	   *startptr,
 				   *endptr;
 
@@ -375,7 +374,7 @@ pg_buffercache_numa_pages(PG_FUNCTION_ARGS)
 
 			/* Only need to touch memory once per backend process lifetime */
 			if (firstNumaTouch)
-				pg_numa_touch_mem_if_required(touch, ptr);
+				pg_numa_touch_mem_if_required(ptr);
 		}
 
 		Assert(idx == os_page_count);
diff --git a/src/backend/storage/ipc/shmem.c b/src/backend/storage/ipc/shmem.c
index c9ae3b45b76..ca3656fc76f 100644
--- a/src/backend/storage/ipc/shmem.c
+++ b/src/backend/storage/ipc/shmem.c
@@ -679,12 +679,10 @@ pg_get_shmem_allocations_numa(PG_FUNCTION_ARGS)
 		 */
 		for (i = 0; i < shm_ent_page_count; i++)
 		{
-			volatile uint64 touch pg_attribute_unused();
-
 			page_ptrs[i] = startptr + (i * os_page_size);
 
 			if (firstNumaTouch)
-				pg_numa_touch_mem_if_required(touch, page_ptrs[i]);
+				pg_numa_touch_mem_if_required(page_ptrs[i]);
 
 			CHECK_FOR_INTERRUPTS();
 		}
diff --git a/src/include/port/pg_numa.h b/src/include/port/pg_numa.h
index 40f1d324dcf..6c8b7103cc3 100644
--- a/src/include/port/pg_numa.h
+++ b/src/include/port/pg_numa.h
@@ -24,12 +24,16 @@ extern PGDLLIMPORT int pg_numa_get_max_node(void);
  * This is required on Linux, before pg_numa_query_pages() as we
  * need to page-fault before move_pages(2) syscall returns valid results.
  */
-#define pg_numa_touch_mem_if_required(ro_volatile_var, ptr) \
-	ro_volatile_var = *(volatile uint64 *) ptr
+static inline void
+pg_numa_touch_mem_if_required(void *ptr)
+{
+	volatile uint64 touch pg_attribute_unused();
+	touch = *(volatile uint64 *) ptr;
+}
 
 #else
 
-#define pg_numa_touch_mem_if_required(ro_volatile_var, ptr) \
+#define pg_numa_touch_mem_if_required(ptr) \
 	do {} while(0)
 
 #endif
diff --git a/src/tools/valgrind.supp b/src/tools/valgrind.supp
index 7ea464c8094..2ad5b81526d 100644
--- a/src/tools/valgrind.supp
+++ b/src/tools/valgrind.supp
@@ -180,3 +180,17 @@
    Memcheck:Cond
    fun:PyObject_Realloc
 }
+
+# NUMA introspection requires touching memory first, and some of it may
+# be marked as noacess (e.g. unpinned buffers). So just ignore that.
+{
+   pg_numa_touch_mem_if_required
+   Memcheck:Addr4
+   fun:pg_numa_touch_mem_if_required
+}
+
+{
+   pg_numa_touch_mem_if_required
+   Memcheck:Addr8
+   fun:pg_numa_touch_mem_if_required
+}
-- 
2.49.0

From 690904468235f7093214e1323714d14b8a22a6ca Mon Sep 17 00:00:00 2001
From: Tomas Vondra <to...@vondra.me>
Date: Fri, 27 Jun 2025 16:42:43 +0200
Subject: [PATCH v2 3/3] Add CHECK_FOR_INTERRUPTS into pg_numa_query_pages

Querying the NUMA status can be quite time consuming, especially with
large shared buffers. 8cc139bec34a simply called numa_move_pages(),
which meant we have to wait for the syscall to complete.

But with the chunking, introduced to work around the do_pages_stat()
bug, we can do CHECK_FOR_INTERRUPTS() after each chunk, to allow users
aborting the execution.

Reviewed-by: Christoph Berg <m...@debian.org>
Reviewed-by: Bertrand Drouvot <bertranddrouvot...@gmail.com>
Discussion: https://postgr.es/m/aetdozlmtzdda...@msg.df7cb.de
---
 src/port/pg_numa.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/src/port/pg_numa.c b/src/port/pg_numa.c
index d5935207d0a..c65f22020ea 100644
--- a/src/port/pg_numa.c
+++ b/src/port/pg_numa.c
@@ -16,6 +16,7 @@
 #include "c.h"
 #include <unistd.h>
 
+#include "miscadmin.h"
 #include "port/pg_numa.h"
 
 /*
@@ -76,6 +77,8 @@ pg_numa_query_pages(int pid, unsigned long count, void **pages, int *status)
 		unsigned long count_chunk = Min(count - next,
 										NUMA_QUERY_CHUNK_SIZE);
 
+		CHECK_FOR_INTERRUPTS();
+
 		/*
 		 * Bail out if any of the chunks errors out (ret<0). We ignore
 		 * (ret>0) which is used to return number of nonmigrated pages,
-- 
2.49.0

Re: pgsql: Introduce pg_shmem_allocations_numa view

Reply via email to