On Fri, Apr 24, 2026 at 12:37:35PM +0100, Kiryl Shutsemau wrote: > On Thu, Apr 23, 2026 at 04:10:30PM -0400, Peter Xu wrote: > > On Thu, Apr 23, 2026 at 09:25:30PM +0200, David Hildenbrand (Arm) wrote: > > > > > > > > The other thing is, as I mentioned in the other email, I still don't > > > > know > > > > how the current RW protection would work for anonymous. I don't yet > > > > think > > > > the user swapper can read the anon page with RW-protected pgtables. So > > > > far > > > > my understanding is maybe you only care about shmem so it's fine, but > > > > it'll > > > > always be great to confirm with you. > > > That's true. We use vhost and therefore shmem in our setup.
I see, thanks for confirming. Side note: I believe host works for anon too since GUP works for anon, but it doesn't matter as long as we know anon isn't a must. > > One idea I had about how to make atomic eviction for anon is extending > process_vm_read() and process_madvise(): > > - Add a flag to process_vm_read() to bypass the protnone check on > accessible (or only RWP?) VMAs. > > - Allow process_madvise(MADV_DONTNEED) when the caller already has > ptrace write access to the target. > > The standing objection to remote DONTNEED has been "destructive", but > process_vm_writev() already lets a ptrace-capable caller overwrite > arbitrary anon with attacker-chosen content. DONTNEED is strictly > weaker — it zeroes, it does not inject — so the trust model is already > established. > > > > I wonder if uffdio_move could be used for a swapper implementation > > > instead? > > I considered it. UFFDIO_MOVE can in principle relocate the cold folio > into a staging VMA inside the VMM, which then reads it and drops it. > The downside is the VMM has to maintain a second address range and > serialise eviction through it. A purpose-built primitive — something > like UFFDIO_EVICT that zaps the PTE and returns the folio contents > (optionally to an fd for io_uring) — seems cleaner. Right, the other thing is unnecessary overhead on the extra pgtable operations when moving to the staging VMA (e.g. tlb flush). > > > > If RW is justified to be useful first, maybe. > > > > I had a gut feeling Kirill's use case doesn't use anon at all, then if > > nobody needs it we can still decide to not support anon. > > > > > > > > If we ever have to read from a protnone page, maybe we could teach ptrace > > > access > > > to do it, or have something that can read from prot_none areas -- like > > > uffdio_copy, which can write to prot-none areas. > > > > Somethinig like swap_access() in my proposal can also partly achieve that. > > > > https://lore.kernel.org/all/[email protected]/ > > A maccess()-style primitive that reads through PROT_NONE is a reasonable > building block and overlaps with part of what UFFDIO_EVICT would need. > > > There, it was only about reading from swap so far, though. But that one > > might be easier to be extended to read PROT_NONE and directly put data into > > buffer user specified (ps: in my local tree impl I named it maccess() to > > pair with mincore(), but it doesn't really matter; it doesn't even need to > > be a syscall..). > > > > To me, the interfacing is not a major issue. The major question I have is > > why RW protection can help in swap system impl when we already have uffd-wp. > > > > So I want to make sure the use case can't be implemented by uffd-wp already. > > Because that's really what we might do for QEMU. > > Race-free eviction can definitely be implemented with uffd-wp already. > But not proper working set discovery. Good. Then we can focus the discussion on hotness tracking with RWP and its benefits, and compare it with a pure access bit focused tracking system (as I mentioned in the other reply). Thanks, -- Peter Xu

