On 2026-02-25 at 02:17 +1100, Gregory Price <[email protected]> wrote...
> On Tue, Feb 24, 2026 at 05:19:11PM +1100, Alistair Popple wrote:
> > On 2026-02-22 at 19:48 +1100, Gregory Price <[email protected]> wrote...
> >
> > Based on our discussion at LPC I believe one of the primary motivators here
> > was
> > to re-use the existing mm buddy allocator rather than writing your own. I
> > remain
> > to be convinced that alone is justification enough for doing all this - DRM
> > for
> > example already has quite a nice standalone buddy allocator (drm_buddy.c)
> > that
> > could presumably be used, or adapted for use, by any device driver.
> >
> > The interesting part of this series (which I have skimmed but not read in
> > detail) is how device memory gets exposed to userspace - this is something
> > that
> > existing ZONE_DEVICE implementations don't address, instead leaving it up to
> > drivers and associated userspace stacks to deal with allocation, migration,
> > etc.
> >
>
> I agree that buddy-access alone is insufficient justification, it
> started off that way - but if you want mempolicy/NUMA UAPI access,
> it turns into "Re-use all of MM" - and that means using the buddy.
>
> I also expected ZONE_DEVICE vs NODE_DATA to be the primary discussion,
>
> I raise replacing it as a thought experiment, but not the proposal.
>
> The idea that drm/ is going to switch to private nodes is outside the
> realm of reality, but part of that is because of years of infrastructure
> built on the assumption that re-using mm/ is infeasible.
>
> But, lets talk about DEVICE_COHERENT
>
> ---
>
> DEVICE_COHERENT is the odd-man out among ZONE_DEVICE modes. The others
> use softleaf entries and don't allow direct mappings.
I think you have this around the wrong way - DEVICE_PRIVATE is the odd one out
as
it is the one ZONE_DEVICE page type that uses softleaf entries and doesn't
allow direct mappings. Every other type of ZONE_DEVICE page allows for direct
mappings.
> (DEVICE_PRIVATE sort of does if you squint, but you can also view that
> a bit like PROT_NONE or read-only controls to force migrations).
>
> If you take DEVICE_COHERENT and:
>
> - Move pgmap out of the struct page (page_ext, NODE_DATA, etc) to free
> the LRU list_head
> - Put pages in the buddy (free lists, watermarks, managed_pages) or add
> pgmap->device_alloc() at every allocation callsite / buddy hook
> - Add LRU support (aging, reclaim, compaction)
> - Add isolated gating (new GFP flag and adjusted zonelist filtering)
> - Add new dev_pagemap_ops callbacks for the various mm/ features
> - Audit evey folio_is_zone_device() to distinguish zone device modes
>
> ... you've built N_MEMORY_PRIVATE inside ZONE_DEVICE. Except now
> page_zone(page) returns ZONE_DEVICE - so you inherit the wrong
> defaults at every existing ZONE_DEVICE check.
>
> Skip-sites become things to opt-out of instead of opting into.
>
> You just end up with
>
> if (folio_is_zone_device(folio))
> if (folio_is_my_special_zone_device())
> else ....
>
> and this just generalizes to
>
> if (folio_is_private_managed(folio))
> folio_managed_my_hooked_operation()
I don't quite get this - couldn't you just as easily do:
if (folio_is_zone_device(folio))
folio_device_my_hooked_operation()
Where folio_device_my_hooked_operation() is just:
if (pgmap->ops->my_hoooked_operation)
pgmap->ops->my_hooked_operation();
> So you get the same code, but have added more complexity to ZONE_DEVICE.
Don't you still have to add code to hook every operation you care about for your
private managed nodes?
> I don't think that's needed if we just recognize ZONE is the wrong
> abstraction to be operating on.
>
> Honestly, even ZONE_MOVABLE becomes pointless with N_MEMORY_PRIVATE
> if you disallow longterm pinning - because the managing service handles
> allocations (it has to inject GFP_PRIVATE to get access) or selectively
> enables the mm/ services it knows are safe (mempolicy).
>
> Even if you allow longterm pinning, if your service controls what does
> the pinning it can still be reclaimable - just manually (killing
> processes) instead of letting hotplug do it via migration.
>
> If your service only allocates movable pages - your ZONE_NORMAL is
> effectively ZONE_MOVABLE.
This is interesting - it sounds like the conclusion of this is ZONE_* is just a
bad abstraction and should be replaced with something else maybe some like this?
And FWIW I'm not tied to the ZONE_DEVICE as being a good abstraction, it's just
what we seem to have today for determing page types. It almost sounds like what
we want is just a bunch of hooks that can be associated with a range of pages,
and then you just get rid of ZONE_DEVICE and instead install hooks appropriate
for each page a driver manages. I have to think more about that though, this
is just what popped into my head when you start saying ZONE_MOVABLE could also
disappear :-)
> In some cases we use ZONE_MOVABLE to prevent the kernel from allocating
> memory onto devices (like CXL). This means struct page is forced to
> take up DRAM or use memmap_on_memory - meaning you lose high-value
> capacity or sacrifice contiguity (less huge page support).
One of the other reasons is to prevent long term pinning. But I think that's a
conversation that warrants a whole separate thread.
> This entire problem can evaporate if you can just use ZONE_NORMAL.
>
> There are a lot of benefits to just re-using the buddy like this.
>
> Zones are the wrong abstraction and cause more problems.
>
> > > free_folio - mirrors ZONE_DEVICE's
> > > folio_split - mirrors ZONE_DEVICE's
> > > migrate_to - ... same as ZONE_DEVICE
> > > handle_fault - mirrors the ZONE_DEVICE ...
> > > memory_failure - parallels memory_failure_dev_pagemap(),
> >
> > One does not have to squint too hard to see that the above is not so
> > different
> > from what ZONE_DEVICE provides today via dev_pagemap_ops(). So I think I
> > think
> > it would be worth outlining why the existing ZONE_DEVICE mechanism can't be
> > extended to provide these kind of services.
> >
> > This seems to add a bunch of code just to use NODE_DATA instead of
> > page->pgmap,
> > without really explaining why just extending dev_pagemap_ops wouldn't work.
> > The
> > obvious reason is that if you want to support things like reclaim,
> > compaction,
> > etc. these pages need to be on the LRU, which is a little bit hard when that
> > field is also used by the pgmap pointer for ZONE_DEVICE pages.
> >
>
> You don't have to squint because it was deliberate :]
Nice.
> The callback similarity is the feature - they're the same logical
> operations. The difference is the direction of the defaults.
>
> Extending ZONE_DEVICE into these areas requires the same set of hooks,
> plus distinguishing "old ZONE_DEVICE" from "new ZONE_DEVICE".
>
> Where there are new injection sites, it's because ZONE_DEVICE opts
> out of ever touching that code in some other silently implied way.
Yeah, I hate that aspect of ZONE_DEVICE. There are far too many places where we
"prove" you can't have a ZONE_DEVICE page because of ad-hoc "reasons". Usually
they take the form of it's not on the LRU, or it's not an anonymous page and
this isn't DAX, etc.
> For example, reclaim/compaction doesn't run because ZONE_DEVICE doesn't
> add to managed_pages (among other reasons).
And people can't even agree on the reasons. I would argue the primary reason is
reclaim/compaction doesn't run because it can't even find the pages due to them
not being on the LRU. But everyone is equally correct.
> You'd have to go figure out how to hack those things into ZONE_DEVICE
> *and then* opt every *other* ZONE_DEVICE mode *back out*.
>
> So you still end up with something like this anyway:
>
> static inline bool folio_managed_handle_fault(struct folio *folio,
> struct vm_fault *vmf,
> enum pgtable_level level,
> vm_fault_t *ret)
> {
> /* Zone device pages use swap entries; handled in do_swap_page */
> if (folio_is_zone_device(folio))
> return false;
>
> if (folio_is_private_node(folio))
> ...
> return false;
> }
>
>
> > example page_ext could be used. Or I hear struct page may go away in place
> > of
> > folios any day now, so maybe that gives us space for both :-)
> >
>
> If NUMA is the interface we want, then NODE_DATA is the right direction
> regardless of struct page's future or what zone it lives in.
>
> There's no reason to keep per-page pgmap w/ device-to-node mappings.
In reality I suspect that's already the case today. I'm not sure we need
per-page pgmap.
> You can have one driver manage multiple devices with the same numa node
> if it uses the same owner context (PFN already differentiates devices).
>
> The existing code allows for this.
>
> > The above also looks pretty similar to the existing ZONE_DEVICE methods for
> > doing this which is another reason to argue for just building up the
> > feature set
> > of the existing boondoggle rather than adding another thingymebob.
> >
> > It seems the key thing we are looking for is:
> >
> > 1) A userspace API to allocate/manage device memory (ie. move_pages(),
> > mbind(),
> > etc.)
> >
> > 2) Allowing reclaim/LRU list processing of device memory.
> >
> > From my perspective both of these are interesting and I look forward to the
> > discussion (hopefully I can make it to LSFMM). Mostly I'm interested in the
> > implementation as this does on the surface seem to sprinkle around and
> > duplicate
> > a lot of hooks similar to what ZONE_DEVICE already provides.
> >
>
> On (1): ZONE_DEVICE NUMA UAPI is harder than it looks from the surface
Ok, I will admit I've only been hovering on the surface so need to give this
some more thought. Everything you've written below makes sense and is definitely
food for thought. Thanks.
- Alistair
> Much of the kernel mm/ infrastructure is written on top of the buddy and
> expects N_MEMORY to be the sole arbiter of "Where to Acquire Pages".
>
> Mempolicy depends on:
> - Buddy support or a new alloc hook around the buddy
>
> - Migration support (mbind() after allocation migrates)
> - Migration also deeply assumes buddy and LRU support
>
> - Changing validations on node states
> - mempolicy checks N_MEMORY membership, so you have to hack
> N_MEMORY onto ZONE_DEVICE
> (or teach it about a new node state... N_MEMORY_PRIVATE)
>
>
> Getting mempolicy to work with N_MEMORY_PRIVATE amounts to adding 2
> lines of code in vma_alloc_folio_noprof:
>
> struct folio *vma_alloc_folio_noprof(gfp_t gfp, int order,
> struct vm_area_struct *vma,
> unsigned long addr)
> {
> if (pol->flags & MPOL_F_PRIVATE)
> gfp |= __GFP_PRIVATE;
>
> folio = folio_alloc_mpol_noprof(gfp, order, pol, ilx, numa_node_id());
> /* Woo! I faulted a DEVICE PAGE! */
> }
>
> But this requires the pages to be managed by the buddy.
>
> The rest of the mempolicy support is around keeping sane nodemasks when
> things like cpuset.mems rebinds occur and validating you don't end up
> with private nodes that don't support mempolicy in your nodemask.
>
> You have to do all of this anyway, but with the added bonus of fighting
> with the overloaded nature of ZONE_DEVICE at every step.
>
> ==========
>
> On (2): Assume you solve LRU.
>
> Zone Device has no free lists, managed_pages, or watermarks.
>
> kswapd can't run, compaction has no targets, vmscan's pressure model
> doesn't function. These all come for free when the pages are
> buddy-managed on a real zone. Why re-invent the wheel?
>
> ==========
>
> So you really have two options here:
>
> a) Put pages in the buddy, or
>
> b) Add pgmap->device_alloc() callbacks at every allocation site that
> could target a node:
> - vma_alloc_folio
> - alloc_migration_target
> - alloc_demote_folio
> - alloc_pages_node
> - alloc_contig_pages
> - list goes on
>
> Or more likely - hooking get_page_from_freelist. Which at that
> point... just use the buddy? You're already deep in the hot path.
>
> >
> > For basic allocation I agree this is the case. But there's no reason some
> > device
> > allocator library couldn't be written. Or in fact as pointed out above
> > reuse the
> > already existing one in drm_buddy.c. So would be interested to hear
> > arguments
> > for why allocation has to be done by the mm allocator and/or why an
> > allocation
> > library wouldn't work here given DRM already has them.
> >
>
> Using the buddy underpins the rest of mm/ services we want to re-use.
>
> That's basically it. Otherwise you have to inject hooks into every
> surface that touches the buddy...
>
> ... or in the buddy (get_page_from_freelist), at which point why not
> just use the buddy?
>
> ~Gregory