== Problem ==
We observed an issue in production on a multi-NUMA system where kswapd
runs endlessly, causing sustained heavy IO READ pressure across the
entire system.
The root cause is that direct reclaim triggered by cgroup memory.high
keeps resetting kswapd_failures to 0, even when the node cannot be
balanced. This prevents kswapd from ever stopping after reaching
MAX_RECLAIM_RETRIES.
```bash
bpftrace -e '
#include <linux/mmzone.h>
#include <linux/shrinker.h>
kprobe:balance_pgdat {
$pgdat = (struct pglist_data *)arg0;
if ($pgdat->kswapd_failures > 0) {
printf("[node %d] [%lu] kswapd end, kswapd_failures %d\n",
$pgdat->node_id, jiffies, $pgdat->kswapd_failures);
}
}
tracepoint:vmscan:mm_vmscan_direct_reclaim_end {
printf("[cpu %d] [%ul] reset kswapd_failures %d \n", cpu, jiffies,
args.nr_reclaimed)
}
'
```
The trace results showed that when kswapd_failures reaches 15, continuous
direct reclaim keeps resetting it to 0. This was accompanied by a flood of
kswapd_failures log entries, and shortly after, we observed massive
refaults occurring.
== Solution ==
Patch 1 fixes the issue by only resetting kswapd_failures when the node
is actually balanced. This introduces pgdat_try_reset_kswapd_failures()
as a wrapper that checks pgdat_balanced() before resetting.
Patch 2 extends the wrapper to track why kswapd_failures was reset,
adding tracepoints for better observability:
- mm_vmscan_reset_kswapd_failures: traces each reset with reason
- mm_vmscan_kswapd_reclaim_fail: traces each kswapd reclaim failure
---
v2 -> v3:
https://lore.kernel.org/all/[email protected]/
- Add tracepoints for kswapd_failures reset and reclaim failure
- Expand commit message with test results
v1 -> v2:
https://lore.kernel.org/all/[email protected]/
Jiayuan Chen (2):
mm/vmscan: mitigate spurious kswapd_failures reset from direct reclaim
mm/vmscan: add tracepoint and reason for kswapd_failures reset
include/linux/mmzone.h | 9 +++++++
include/trace/events/vmscan.h | 51 +++++++++++++++++++++++++++++++++++
mm/memory-tiers.c | 2 +-
mm/page_alloc.c | 2 +-
mm/vmscan.c | 33 ++++++++++++++++++++---
5 files changed, 91 insertions(+), 6 deletions(-)
--
2.43.0