On 4/15/26 21:47, Frank van der Linden wrote: > On Wed, Apr 15, 2026 at 8:18 AM Gregory Price <[email protected]> wrote: >> >> On Wed, Apr 15, 2026 at 11:49:59AM +0200, David Hildenbrand (Arm) wrote: >> >> As a preface - the current RFC was informed by ZONE_DEVICE patterns. >> >> I think that was useful as a way to find existing friction points - but >> ultimately wrong for this new interface. >> >> I don't thinks an ops struct here is the right design, and I think there >> are only a few patterns that actually make sense for device memory using >> nodes this way. >> >> So there's going to be a *major* contraction in the complexity of this >> patch series (hopefully I'll have something next week), and much of what >> you point out below is already in-flight. >> >>> >> ... snip ... >>> >>> A related series proposed some MEM_READ/WRITE backend requests [1] >>> >>> [1] https://lists.nongnu.org/archive/html/qemu-devel/2024-09/msg02693.html >>> >> >> Oh interesting, thank you for the reference here. >> >>> >>> Something else people were discussing in the past was to physically >>> limit the area where virtio queues could be placed. >>> >> >> That is functionally what I did - the idea was pretty simple, just have >> a separate memfd/node dedicated for the queues: >> >> guest_memory = memfd(MAP_PRIVATE) >> net_memory = memfd(MAP_SHARED) >> >> And boom, you get what you want. >> >> So yeah "It works" - but there's likely other ways to do this too, and >> as you note re: compatibility, i'm not sure virtio actually wants this, >> but it's a nice proof-of-concept for a network device on the host that >> carries its own memory. >> >> I'll try post my hack as an example with the next RFC version, as I >> think it's informative. >> >>> >>> But that's a different "fallback" problem, no? >>> >>> You want allocations that target the "special node" to fallback to >>> *other* nodes, but not other allocations to fallback to *this special* node. >>> >> ... snip - slight reordering to put thoughts together ... >>> >>> Needs a second thought regarding fallback logic I raised above. >>> >>> What I think would have to be audited is the usage of __GFP_THISNODE by >>> kernel allocations, where we would not actually want to allocate from >>> this private node. >>> >> >> This is fair, and I a re-visit is absolutely warranted. >> >> Re-examining the quick audit from my last response suggests - I should >> never have seen leakage in those cases, but the fallbacks are needed. >> >> So yes, this all requires a second look (and a third, and a ninth). >> >> I'm not married to __GFP_PRIVATE, but it has been reliable for me. >> >>> Maybe we could just outright refuse *any* non-user (movable) allocations >>> that target the node, even with __GFP_THISNODE. >>> >>> Because, why would we want kernel allocations to even end up on a >>> private node that is supposed to only be consumed by user space? Or >>> which use cases are there where we would want to place kernel >>> allocations on there? >>> >> >> As a start, maybe? But as a permanent invariant? I would wonder whether >> the decision here would lock us into a design. >> >> But then - this is all kernel internal, so i think it would be feasible >> to change this out from under users without backward compatibility pain. >> >> So far I have done my best to avoid changing any userland interfaces in >> a way that would fundamentally change the contracts. If anything >> private-node other than just the node's `has_memory_private` attribute >> leaks into userland, someone messed up. >> >> So... I think that's reasonable. >> >>> >>> I assume you will be as LSF/MM? Would be good to discuss some of that in >>> person. >>> >> >> Yes, looking forward to it :] >> >> >>> >>> >>> Again, I am not sure about compaction and khugepaged. All we want to >>> guarantee is that our memory does not leave the private node. >>> >>> That doesn't require any __GFP_PRIVATE magic, just en-lighting these >>> subsystems that private nodes must use __GFP_THISNODE and must not leak >>> to other nodes. >> >> This is where specific use-cases matter. >> >> In the compressed memory example - the device doesn't care about memory >> leaving - but it cares about memory arriving and *and being modified*. >> (more on this in your next question) >> >> So i'm not convinced *all possible devices* would always want to support >> move_pages(), mbind(), and set_mempolicy(). >> >> But, I do want to give this serious thought, and I agree the absolute >> minimal patch set could just be the fallback control mechanism and >> mm/ component filters/audit on __GFP_*. >> >> >>> >>> I'm missing why these are even opt-in. What's the problem with allowing >>> mbind and mempolicy to use these nodes in some of your drivers? >>> >> >> First: >> >> In my latest working branch these two flags have been folded into just >> _OPS_MEMPOLICY and any other migration interaction is just handled by >> filtering with the GFP flag. >> >> >> on always allowing mbind and mempolicy vs opt-in >> --- >> >> A proper compressed memory solution should not allow mbind/mempolicy. >> >> Compressed memory is different from normal memory - as the kernel can >> percieves free memory (many unused struct page in the buddy) when the >> device knows there's none left (the physical capacity is actually full). >> >> Any form of write to a compressed memory device is essentially a >> dangerous condition (OOMs = poison, not oom_kill()). >> >> So you need two controls: Allocation and (userland) Write protection >> I implemented via: >> - Demotion-only (allocations only happen in reclaim path) >> - Write-protecting the entire node >> >> (I fully accept that a write-protection extension here might be a bridge >> to far, but please stick with me for the sake of exploration). >> >> >> There's a serious argument to limit these devices to using an mbind >> pattern, but I wanted to make a full-on attempt to integrate this device >> into the demotion path as a transparent tier (kinda like zswap). >> >> I could not square write-protection with mempolicy, so i had to make >> them both optional and mutually exclusive. >> >> If you limit the device to mbind interactions, you do limit what can >> crash - but this forces userland software to be less portable by design: >> >> - am i running on a system where this device is present? >> - is that device exposing its memory on a node? >> - which node? >> - what memory can i put on that node? (can you prevent a process from >> putting libc on that node?) >> - how much compression ratio is left on the device? >> - can i safety write to this virtual address? >> - should i write-protect compressed VMAs? Can i handle those faults? >> - many more >> >> That sounds a lot like re-implementing a bunch of mm/ in userland, and >> that's exactly where we were at with DAX. We know this pattern failed. >> >> I'm trying to very much avoid repeating these mistakes, and so I'm very >> much trying to find a good path forward here that results in transparent >> usage of this memory. >> >> >>> I also have some questions about longterm pinnings, but that's better >>> discussed in person :) >>> >> >> The longterm pin extention came from auditing existing zone_device >> filters. >> >> tl;dr: informative mechanism - but it probably should be dropped, >> it makes no sense (it's device memory, pinnings mean nothing?). >> >> >>> >>> Right, that's rather invasive. >>> >> >> Yeah i'm trying to avoid it, and the answer may actually just exist in >> the task-death and VMA cleanup path rather than the folio-free path. >> >> From what i've seen of accelerator drivers that implement this, when you >> inform the driver of a memory region with a task, the driver should have >> a mechanism to take references on that VMA (or something like this) - so >> that when the task dies the driver has a way to be notified of the VMA >> being cleaned up. >> >> This probably exists - I just haven't gotten there yet. >> >> ~Gregory > > This has been a really great discussion. I just wanted to add a few > points that I think I have mentioned in other forums, but not here. > > In essence, this is a discussion about memory properties and the level > at which they should be dealt with. Right now there are basically 3 > levels: pageblocks, zones and nodes. While these levels exist for good > reasons, they also sometimes lead to issues. There's duplication of > functionality. MIGRATE_CMA and ZONE_MOVABLE both implement the same > basic property, but at different levels (attempts have been made to > merge them, but it didn't work out). There's also memory with clashing > properties inhabiting the same data structure: LRUs. Having strictly > movable memory on the same LRU as unmovable memory is a mismatch. It > leads to the well known problem of reclaim done in the name of an > unmovable allocation attempt can be entirely pointless in the face of > large amounts of ZONE_MOVABLE or MIGRATE_CMA memory: the anon LRU will > be chock full of movable-only pages. Reclaiming them is useless for > your allocation, and skipping them leads to locking up the system > because you're holding on to the LRU lock a long time. > > So, looking at having some properties set at the node level makes > sense to me even in the non-device case. But perhaps that is out of > scope for the initial discussion. > > One use case that seems like a good match for private nodes is guest > memory. Guest memory is special enough to want to allocate / maintain > it separately, which is acknowledged by the introduction of > guest_memfd.
Yes. There is now an interface to configure mbind() for guest_memfd. So with that and some tweaks, maybe that ... would just work, if we get the mbind() interaction right? -- Cheers, David
