On 2/13/26 12:54 AM, Vlastimil Babka wrote:
On 2/12/26 22:25, JP Kobryn wrote:
On 2/12/26 7:24 AM, Vlastimil Babka wrote:
On 2/12/26 05:51, JP Kobryn wrote:
It would be useful to see a breakdown of allocations to understand which
NUMA policies are driving them. For example, when investigating memory
pressure, having policy-specific counts could show that allocations were
bound to the affected node (via MPOL_BIND).
Add per-policy page allocation counters as new node stat items. These
counters can provide correlation between a mempolicy and pressure on a
given node.
Signed-off-by: JP Kobryn <[email protected]>
Suggested-by: Johannes Weiner <[email protected]>
Are the numa_{hit,miss,etc.} counters insufficient? Could they be extended
in a way that would capture any missing important details? A counter per
policy type seems exhaustive, but then on one hand it might be not important
to distinguish beetween some of them, and on the other hand it doesn't track
the nodemask anyway.
The two patches of the series should complement each other. When
investigating memory pressure, we could identify the affected nodes
(patch 2). Then we can cross-reference the policy-specific stats to find
any correlation (this patch).
I think extending numa_* counters would call for more permutations to
account for the numa stat per policy. I think distinguishing between
MPOL_DEFAULT and MPOL_BIND is meaningful, for example. Am I
Are there other useful examples or would it be enough to add e.g. a
numa_bind counter to the numa_hit/miss/etc?
Aside from bind, it's worth emphasizing that with default policy
tracking we could see if the local node is the source of pressure. In
the interleave case, we would be able to see if the loads are being
balanced or, in the weighted case, being distributed properly.
On extending the numa stats instead, I looked into this some more. I'm
not sure if they're a good fit. They seem more about whether the
allocator succeeded at placement rather than which policy drove the
allocation. Thoughts?
What I'm trying to say the level of detail you are trying to add to the
always-on counters seems like more suitable for tracepoints. The counters
should be limited to what's known to be useful and not "everything we are
able to track and possibly could need one day".
In a triage scenario, having the stats collected up to the time of the
reported issue would be better. We make use of the tool called below[0].
It periodically samples the system and allows us to view the
historical state prior to the issue. If we started at the time of the
incident and attached tracepoints it would be too late.
The triage workflow would look like this:
1) Pressure/OOMs reported while system-wide memory is free.
2) Check per-node pgscan/pgsteal stats (provided by patch 2) to narrow
down node(s) under pressure.
3) Check per-policy allocation counters (this patch) on that node to
find what policy was driving it.
[0] https://github.com/facebookincubator/below