Here’s an update from the Linux MM Alignment Session on July 10 2024, 9-10am PDT:
The current direction is: + Allow mmap() of ranges that cover both shared and private memory, but disallow faulting in of private pages + On access to private pages, userspace will get some error, perhaps SIGBUS + On shared to private conversions, unmap the page and decrease refcounts + To support huge pages, guest_memfd will take ownership of the hugepages, and provide interested parties (userspace, KVM, iommu) with pages to be used. + guest_memfd will track usage of (sub)pages, for both private and shared memory + Pages will be broken into smaller (probably 4K) chunks at creation time to simplify implementation (as opposed to splitting at runtime when private to shared conversion is requested by the guest) + Core MM infrastructure will still be used to track page table mappings in mapcounts and other references (refcounts) per subpage + HugeTLB vmemmap Optimization (HVO) is lost when pages are broken up - to be optimized later. Suggestions: + Use a tracking data structure other than struct page + Remove the memory for struct pages backing private memory from the vmemmap, and re-populate the vmemmap on conversion from private to shared + Implementation pointers for huge page support + Consensus was that getting core MM to do tracking seems wrong + Maintaining special page refcounts for guest_memfd pages is difficult to get working and requires weird special casing in many places. This was tried for FS DAX pages and did not work out: [1] + Implementation suggestion: use infrastructure similar to what ZONE_DEVICE uses, to provide the huge page to interested parties + TBD: how to actually get huge pages into guest_memfd + TBD: how to provide/convert the huge pages to ZONE_DEVICE + Perhaps reserve them at boot time like in HugeTLB + Line of sight to compaction/migration: + Compaction here means making memory contiguous + Compaction/migration scope: + In scope for 4K pages + Out of scope for 1G pages and anything managed through ZONE_DEVICE + Out of scope for an initial implementation + Ideas for future implementations + Reuse the non-LRU page migration framework as used by memory balloning + Have userspace drive compaction/migration via ioctls + Having line of sight to optimizing lost HVO means avoiding being locked in to any implementation requiring struct pages + Without struct pages, it is hard to reuse core MM’s compaction/migration infrastructure + Discuss more details at LPC in Sep 2024, such as how to use huge pages, shared/private conversion, huge page splitting This addresses the prerequisites set out by Fuad and Elliott at the beginning of the session, which were: 1. Non-destructive shared/private conversion + Through having guest_memfd manage and track both shared/private memory 2. Huge page support with the option of converting individual subpages + Splitting of pages will be managed by guest_memfd 3. Line of sight to compaction/migration of private memory + Possibly driven by userspace using guest_memfd ioctls 4. Loading binaries into guest (private) memory before VM starts + This was identified as a special case of (1.) above 5. Non-protected guests in pKVM + Not discussed during session, but this is a goal of guest_memfd, for all VM types [2] David Hildenbrand summarized this during the meeting at t=47m25s [3]. [1]: https://lore.kernel.org/linux-mm/cover.66009f59a7fe77320d413011386c3ae5c2ee82eb.1719386613.git-series.apop...@nvidia.com/ [2]: https://lore.kernel.org/lkml/znrmn1obu8tfr...@google.com/ [3]: https://drive.google.com/file/d/17lruFrde2XWs6B1jaTrAy9gjv08FnJ45/view?t=47m25s&resourcekey=0-LiteoxLd5f4fKoPRMjMTOw