On Wed, Jan 14, 2026 at 03:40:35PM +0800, Jiayuan Chen wrote: > From: Jiayuan Chen <[email protected]> > > When kswapd fails to reclaim memory, kswapd_failures is incremented. > Once it reaches MAX_RECLAIM_RETRIES, kswapd stops running to avoid > futile reclaim attempts. However, any successful direct reclaim > unconditionally resets kswapd_failures to 0, which can cause problems. > > We observed an issue in production on a multi-NUMA system where a > process allocated large amounts of anonymous pages on a single NUMA > node, causing its watermark to drop below high and evicting most file > pages: > > $ numastat -m > Per-node system memory usage (in MBs): > Node 0 Node 1 Total > --------------- --------------- --------------- > MemTotal 128222.19 127983.91 256206.11 > MemFree 1414.48 1432.80 2847.29 > MemUsed 126807.71 126551.11 252358.82 > SwapCached 0.00 0.00 0.00 > Active 29017.91 25554.57 54572.48 > Inactive 92749.06 95377.00 188126.06 > Active(anon) 28998.96 23356.47 52355.43 > Inactive(anon) 92685.27 87466.11 180151.39 > Active(file) 18.95 2198.10 2217.05 > Inactive(file) 63.79 7910.89 7974.68 > > With swap disabled, only file pages can be reclaimed. When kswapd is > woken (e.g., via wake_all_kswapds()), it runs continuously but cannot > raise free memory above the high watermark since reclaimable file pages > are insufficient. Normally, kswapd would eventually stop after > kswapd_failures reaches MAX_RECLAIM_RETRIES. > > However, containers on this machine have memory.high set in their > cgroup. Business processes continuously trigger the high limit, causing > frequent direct reclaim that keeps resetting kswapd_failures to 0. This > prevents kswapd from ever stopping. > > The key insight is that direct reclaim triggered by cgroup memory.high > performs aggressive scanning to throttle the allocating process. With > sufficiently aggressive scanning, even hot pages will eventually be > reclaimed, making direct reclaim "successful" at freeing some memory. > However, this success does not mean the node has reached a balanced > state - the freed memory may still be insufficient to bring free pages > above the high watermark. Unconditionally resetting kswapd_failures in > this case keeps kswapd alive indefinitely. > > The result is that kswapd runs endlessly. Unlike direct reclaim which > only reclaims from the allocating cgroup, kswapd scans the entire node's > memory. This causes hot file pages from all workloads on the node to be > evicted, not just those from the cgroup triggering memory.high. These > pages constantly refault, generating sustained heavy IO READ pressure > across the entire system. > > Fix this by only resetting kswapd_failures when the node is actually > balanced. This allows both kswapd and direct reclaim to clear > kswapd_failures upon successful reclaim, but only when the reclaim > actually resolves the memory pressure (i.e., the node becomes balanced). > > Signed-off-by: Jiayuan Chen <[email protected]> > Signed-off-by: Jiayuan Chen <[email protected]>
After incorporating suggestions from Johannes, you can add: Acked-by: Shakeel Butt <[email protected]>
