On 6/24/25 17:30, Christoph Berg wrote:
> Re: Tomas Vondra
>> If it's a reliable fix, then I guess we can do it like this. But won't
>> that be a performance penalty on everyone? Or does the system split the
>> array into 16-element chunks anyway, so this makes no difference?
>
> There's still the overhead of the syscall itself. But no idea how
> costly it is to have this 16-step loop in user or kernel space.
>
> We could claim that on 32-bit systems, shared_buffers would be smaller
> anyway, so there the overhead isn't that big. And the step size should
> be larger (if at all) on 64-bit.
>
>> Anyway, maybe we should start by reporting this to the kernel people. Do
>> you want me to do that, or shall one of you take care of that? I suppose
>> that'd be better, as you already wrote a fix / know the code better.
>
> Submitted: https://marc.info/?l=linux-mm&m=175077821909222&w=2
>
Thanks! Now we wait ...
Attached is a minor tweak of the valgrind suppresion rules, to add the
two places touching the memory. I was hoping I could add a single rule
for pg_numa_touch_mem_if_required, but that does not work - it's a
macro, not a function. So I had to add one rule for both functions,
querying the NUMA. That's a bit disappointing, because it means it'll
hide all other failues (of Memcheck:Addr8 type) in those functions.
Perhaps it'd be be better to turn pg_numa_touch_mem_if_required into a
proper (inlined) function, at least with USE_VALGRIND defined. Something
like the v2 patch - needs more testing to ensure the inlined function
doesn't break the touching or something silly like that.
regards
--
Tomas Vondra
diff --git a/src/tools/valgrind.supp b/src/tools/valgrind.supp
index 7ea464c8094..36bf3253f76 100644
--- a/src/tools/valgrind.supp
+++ b/src/tools/valgrind.supp
@@ -180,3 +180,22 @@
Memcheck:Cond
fun:PyObject_Realloc
}
+
+# Querying NUMA node for shared memory requires touching the memory so
+# that it gets allocated in the process. But we'll touch memory backing
+# buffers, but that memory may be marked as noaccess for buffers that
+# are not pinned. So just ignore that, we're not really accessing the
+# buffers, for both places querying the NUMA status.
+{
+ pg_buffercache_numa_pages
+ Memcheck:Addr8
+ fun:pg_buffercache_numa_pages
+ fun:ExecMakeTableFunctionResult
+}
+
+{
+ pg_get_shmem_allocations_numa
+ Memcheck:Addr8
+ fun:pg_get_shmem_allocations_numa
+ fun:ExecMakeTableFunctionResult
+}
diff --git a/src/include/port/pg_numa.h b/src/include/port/pg_numa.h
index 40f1d324dcf..3b9a5b42898 100644
--- a/src/include/port/pg_numa.h
+++ b/src/include/port/pg_numa.h
@@ -24,9 +24,22 @@ extern PGDLLIMPORT int pg_numa_get_max_node(void);
* This is required on Linux, before pg_numa_query_pages() as we
* need to page-fault before move_pages(2) syscall returns valid results.
*/
+#ifdef USE_VALGRIND
+
+static inline void
+pg_numa_touch_mem_if_required(uint64 tmp, char *ptr)
+{
+ volatile uint64 ro_volatile_var pg_attribute_unused();
+ ro_volatile_var = *(volatile uint64 *) ptr;
+}
+
+#else
+
#define pg_numa_touch_mem_if_required(ro_volatile_var, ptr) \
ro_volatile_var = *(volatile uint64 *) ptr
+#endif
+
#else
#define pg_numa_touch_mem_if_required(ro_volatile_var, ptr) \
diff --git a/src/tools/valgrind.supp b/src/tools/valgrind.supp
index 7ea464c8094..6b9a8998f82 100644
--- a/src/tools/valgrind.supp
+++ b/src/tools/valgrind.supp
@@ -180,3 +180,14 @@
Memcheck:Cond
fun:PyObject_Realloc
}
+
+# Querying NUMA node for shared memory requires touching the memory so
+# that it gets allocated in the process. But we'll touch memory backing
+# buffers, but that memory may be marked as noaccess for buffers that
+# are not pinned. So just ignore that, we're not really accessing the
+# buffers, for all places querying the NUMA status.
+{
+ pg_numa_touch_mem_if_required
+ Memcheck:Addr8
+ fun:pg_numa_touch_mem_if_required
+}