On Fri, Apr 24, 2026 at 11:34:48AM +0100, Kiryl Shutsemau wrote:
> On Thu, Apr 23, 2026 at 02:57:34PM -0400, Peter Xu wrote:
> > On Thu, Apr 23, 2026 at 07:08:00PM +0100, Kiryl Shutsemau wrote:
> > > > - Whether read protection is required for an userspace swap system
> > > >   (e.g. did you get time to have a look at umap?)
> > > 
> > > I looked at it briefly, so I can miss details.
> > > 
> > > IIUC, in absence of read tracking it doesn't collect hotness information
> > > at all. The eviction is based on fault-in time: the oldest faulted-in
> > 
> > For example, let's imagine if we can have a per-mm idle page tracker, would
> > it work for you to collect hotness info?
> >
> > The other idea is, no matter whether we use MGLRU or legacy LRU, if we can
> > expose a better interface to share hotness info from kernel to userspace,
> > would it be possible?
> 
> I don't see how either fits our problem.
> 
> Both page_idle and the LRUs (legacy or MGLRU) track accesses on physical
> memory. We need visibility in the virtual address space domain.

Yes they are, but ACCESS bit isn't.  ACCESS bit is only about virtual
mapping or any similar mapping (like EPT's access bit).

What I described with per-mm tracking (either we call it per-mm idle page
tracking or using other interface) is about relying on ACCESS bit, not
pgtable changes using RWP.  IMHO It's more efficient and it will also
achieve your goal of VA tracking.

In your case (and also ours), if you're looking for VMs running virtual
machines, I think you need both pgtable's ACCESS bit and EPT-similar ACCESS
bit.  Here what's redundant is rmap, not ACCESS bit tracking.  When both
MMU and secondary MMU supports hardware access tracking, AFAIU it's faster
than RWP.

> 
> We don't care which physical page backs a given guest address at any
> moment. We want to know which piece of the user's dataset is cold, and
> the answer has to be indifferent to kernel actions underneath: the
> tracking must survive migration and swap-out. RWP gives us that — the

This is exactly what we hit...  that's the reason why I was trying to
propose a new API to read directly from swap (swap_access) or similar.

Btw, from another perspective, I believe we could also persist ACCESS bit
across migration or swap out.

For migration, see e.g. remove_migration_pte() has:

                if (!softleaf_is_migration_young(entry))
                        pte = pte_mkold(pte);

For swap, it's different.  Normally, if an userapp would manage page
hotness, it will record the hotness within the userspace with whatever
algorithm it wants.  Then it will also survive host swap happening because
that hotness is per-VA.  It should be deduced from any hotness tracking
system it previously used to sample (and it still can be idle page
tracking, even if not efficient enough; when the VM page isn't mapped
anywhere else, rmap is pure overhead, it doesn't introduce false positives).

> uffd-wp bit is preserved across swap PTEs and migration entries, so the
> "this VA was declared cold" marker stays attached to the VA. A
> physical-side tracker loses its state the moment the folio is freed or
> replaced: a refaulted folio is a fresh object with no history.
> 
> Scaling goes the same way. Per-mm tracking of the form RWP does can
> scale with the working set. A physical-side tracker scales with all folios
> on the LRU/memcg, then needs an rmap walk per folio to map back to a
> VA — which is exactly the reason page_idle doesn't scale for this use
> case today.
> 
> There is also a cgroup-level confound: memcg hotness mixes guest memory
> with the VMM's own (worker threads, I/O buffers, vhost-user rings).
> VMA-scoped tracking is the natural unit regardless of the migration
> story.

This kind of further proved you're using shmem and you have separate
mappings.

Again, when with a per-mm idle page tracking these issue should all be
gone.  That per-mm idle page tracking needs to:

  - Ignore rmap so it's VA based
  - Still consider secondary MMUs, hence mmu young notifier needs to present
  - Work based on ACCESS bit (to leverage hardware tracking accelerations),
    rather than relying on a kernel fault to set the access mark, which
    should be more efficient.

The other thing is, could you please still answer why RWP is required for
swap impl in general?  It's not yet mentioned in the reply.

Personally I really feel like we're looking at very similar problems.  It
is a great news to me, because if you can convince me on the new api it
means our use case may likely also adopt the approach, vice versa.

It would be great to share the new interface no matter what it is, instead
of trying to push different ones.

Thanks,

-- 
Peter Xu


Reply via email to