On 28.10.25 14:36, Nico Pache wrote:
On Mon, Oct 27, 2025 at 11:54 AM Lorenzo Stoakes
<[email protected]> wrote:

On Wed, Oct 22, 2025 at 12:37:08PM -0600, Nico Pache wrote:
The current mechanism for determining mTHP collapse scales the
khugepaged_max_ptes_none value based on the target order. This
introduces an undesirable feedback loop, or "creep", when max_ptes_none
is set to a value greater than HPAGE_PMD_NR / 2.

With this configuration, a successful collapse to order N will populate
enough pages to satisfy the collapse condition on order N+1 on the next
scan. This leads to unnecessary work and memory churn.

To fix this issue introduce a helper function that caps the max_ptes_none
to HPAGE_PMD_NR / 2 - 1 (255 on 4k page size). The function also scales
the max_ptes_none number by the (PMD_ORDER - target collapse order).

The limits can be ignored by passing full_scan=true, this is useful for
madvise_collapse (which ignores limits), or in the case of
collapse_scan_pmd(), allows the full PMD to be scanned when mTHP
collapse is available.

Signed-off-by: Nico Pache <[email protected]>
---
  mm/khugepaged.c | 35 ++++++++++++++++++++++++++++++++++-
  1 file changed, 34 insertions(+), 1 deletion(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 4ccebf5dda97..286c3a7afdee 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -459,6 +459,39 @@ void __khugepaged_enter(struct mm_struct *mm)
               wake_up_interruptible(&khugepaged_wait);
  }

+/**
+ * collapse_max_ptes_none - Calculate maximum allowed empty PTEs for collapse
+ * @order: The folio order being collapsed to
+ * @full_scan: Whether this is a full scan (ignore limits)
+ *
+ * For madvise-triggered collapses (full_scan=true), all limits are bypassed
+ * and allow up to HPAGE_PMD_NR - 1 empty PTEs.
+ *
+ * For PMD-sized collapses (order == HPAGE_PMD_ORDER), use the configured
+ * khugepaged_max_ptes_none value.
+ *
+ * For mTHP collapses, scale down the max_ptes_none proportionally to the folio
+ * order, but caps it at HPAGE_PMD_NR/2-1 to prevent a collapse feedback loop.
+ *
+ * Return: Maximum number of empty PTEs allowed for the collapse operation
+ */
+static unsigned int collapse_max_ptes_none(unsigned int order, bool full_scan)
+{
+     unsigned int max_ptes_none;
+
+     /* ignore max_ptes_none limits */
+     if (full_scan)
+             return HPAGE_PMD_NR - 1;
+
+     if (order == HPAGE_PMD_ORDER)
+             return khugepaged_max_ptes_none;
+
+     max_ptes_none = min(khugepaged_max_ptes_none, HPAGE_PMD_NR/2 - 1);


Hey Lorenzo,

I mean not to beat a dead horse re: v11 commentary, but I thought we were going
to implement David's idea re: the new 'eagerness' tunable, and again we're now 
just
implementing the capping at HPAGE_PMD_NR/2 - 1 thing again?

I spoke to David and he said to continue forward with this series; the
"eagerness" tunable will take some time, and may require further
considerations/discussion.

Right, after talking to Johannes it got clearer that what we envisioned with "eagerness" would not be like swappiness, and we will really have to be careful here. I don't know yet when I will have time to look into that.

If we want to avoid the implicit capping, I think there are the following possible approaches

(1) Tolerate creep for now, maybe warning if the user configures it.
(2) Avoid creep by counting zero-filled pages towards none_or_zero.
(3) Have separate toggles for each THP size. Doesn't quite solve the
    problem, only shifts it.

Anything else?

IIUC, creep is less of a problem when we have the underused shrinker enabled: whatever we over-allocated can (unless longterm-pinned etc) get reclaimed again.

So maybe having underused-shrinker support for mTHP as well would be a solution to tackle (1) later?

--
Cheers

David / dhildenb


Reply via email to