This patch series introduces support for Transparent Huge Page (THP) migration in zone device-private memory. The implementation enables efficient migration of large folios between system memory and device-private memory
Background Current zone device-private memory implementation only supports PAGE_SIZE granularity, leading to: - Increased TLB pressure - Inefficient migration between CPU and device memory This series extends the existing zone device-private infrastructure to support THP, leading to: - Reduced page table overhead - Improved memory bandwidth utilization - Seamless fallback to base pages when needed In my local testing (using lib/test_hmm) and a throughput test, the series shows a 350% improvement in data transfer throughput and a 80% improvement in latency These patches build on the earlier posts by Ralph Campbell [1] Two new flags are added in vma_migration to select and mark compound pages. migrate_vma_setup(), migrate_vma_pages() and migrate_vma_finalize() support migration of these pages when MIGRATE_VMA_SELECT_COMPOUND is passed in as arguments. The series also adds zone device awareness to (m)THP pages along with fault handling of large zone device private pages. page vma walk and the rmap code is also zone device aware. Support has also been added for folios that might need to be split in the middle of migration (when the src and dst do not agree on MIGRATE_PFN_COMPOUND), that occurs when src side of the migration can migrate large pages, but the destination has not been able to allocate large pages. The code supported and used folio_split() when migrating THP pages, this is used when MIGRATE_VMA_SELECT_COMPOUND is not passed as an argument to migrate_vma_setup(). The test infrastructure lib/test_hmm.c has been enhanced to support THP migration. A new ioctl to emulate failure of large page allocations has been added to test the folio split code path. hmm-tests.c has new test cases for huge page migration and to test the folio split path. A new throughput test has been added as well. The nouveau dmem code has been enhanced to use the new THP migration capability. mTHP support: The patches hard code, HPAGE_PMD_NR in a few places, but the code has been kept generic to support various order sizes. With additional refactoring of the code support of different order sizes should be possible. The future plan is to post enhancements to support mTHP with a rough design as follows: 1. Add the notion of allowable thp orders to the HMM based test driver 2. For non PMD based THP paths in migrate_device.c, check to see if a suitable order is found and supported by the driver 3. Iterate across orders to check the highest supported order for migration 4. Migrate and finalize The mTHP patches can be built on top of this series, the key design elements that need to be worked out are infrastructure and driver support for multiple ordered pages and their migration. HMM support for large folios, patches are already posted and in mm-unstable. Cc: Andrew Morton <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Zi Yan <[email protected]> Cc: Joshua Hahn <[email protected]> Cc: Rakie Kim <[email protected]> Cc: Byungchul Park <[email protected]> Cc: Gregory Price <[email protected]> Cc: Ying Huang <[email protected]> Cc: Alistair Popple <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: Lorenzo Stoakes <[email protected]> Cc: Baolin Wang <[email protected]> Cc: "Liam R. Howlett" <[email protected]> Cc: Nico Pache <[email protected]> Cc: Ryan Roberts <[email protected]> Cc: Dev Jain <[email protected]> Cc: Barry Song <[email protected]> Cc: Lyude Paul <[email protected]> Cc: Danilo Krummrich <[email protected]> Cc: David Airlie <[email protected]> Cc: Simona Vetter <[email protected]> Cc: Ralph Campbell <[email protected]> Cc: Mika Penttilä <[email protected]> Cc: Matthew Brost <[email protected]> Cc: Francois Dugast <[email protected]> References: [1] https://lore.kernel.org/linux-mm/[email protected]/ [2] https://lore.kernel.org/linux-mm/[email protected]/T/ [3] https://lore.kernel.org/lkml/[email protected]/ [4] https://lkml.kernel.org/r/[email protected] [5] https://lore.kernel.org/lkml/[email protected]/ [6] https://lore.kernel.org/lkml/[email protected]/ [7] https://lore.kernel.org/lkml/[email protected]/ [8] https://lore.kernel.org/all/[email protected]/ [9] https://lore.kernel.org/lkml/[email protected]/ These patches are built on top of mm/mm-new Changelog v7 [9]: - Rebased against mm/mm-new again - Addressed more review comments from Zi Yan and David Hildenbrand - Code flow reorganization of split_huge_pmd_locked - page_free callback is now changed to folio_free (posted as patch 2 in the series) - zone_device_page_init() takes an order parameter - migrate_vma_split_pages() is now called migrate_vma_split_unmapped_folio() - More cleanups and fixes - Patch 6 partial unmapped folio case has been split into two parts some of the content has been moved to the actual device private split handling code - Fault handling for device-private pages now uses folio routines instead of page_get/trylock/put routines. Changelog v6 [8]: - Rebased against mm/mm-new after fixing the following - Two issues reported by kernel test robot - m68k requires an lvalue for pmd_present() - BUILD_BUG_ON() issues when THP is disabled - kernel doc warnings reported on linux-next - Thanks Stephen Rothwell! - smatch fixes and issues reported - Fix issue with potential NULL page - Report about young being uninitialized for device-private pages in __split_huge_pmd_locked() - Several Review comments from David - Indentation changes and style improvements - Removal of some unwanted extra lines - Introduction of new helper function is_pmd_non_present_folio_entry() to represent migration and device private pmd's - Code flow refactoring into migration and device private paths - More consistent use of helper function is_pmd_device_private() - Review comments from Mika - folio_get() is not required for huge_pmd prior to split Changelog v5 [7] : - Rebased against mm/mm-new (resolved conflict caused by MIGRATEPAGE_SUCCESS removal) - Fixed a kernel-doc warning reported by kernel test robot Changelog v4 [6] : - Addressed review comments - Split patch 2 into a smaller set of patches - PVMW_THP_DEVICE_PRIVATE flag is no longer present - damon/page_idle and other page_vma_mapped_walk paths are aware of device-private folios - No more flush for non-present entries in set_pmd_migration_entry - Implemented a helper function for migrate_vma_split_folio() which splits large folios if seen during a pte walk - Removed the controversial change for folio_ref_freeze using folio_expected_ref_count() - Removed functions invoked from with VM_WARN_ON - New test cases and fixes from Matthew Brost - Fixed bugs reported by kernel test robot (Thanks!) - Several fixes for THP support in nouveau driver Changelog v3 [5] : - Addressed review comments - No more split_device_private_folio() helper - Device private large folios do not end up on deferred scan lists - Removed THP size order checks when initializing zone device folio - Fixed bugs reported by kernel test robot (Thanks!) Changelog v2 [3] : - Several review comments from David Hildenbrand were addressed, Mika, Zi, Matthew also provided helpful review comments - In paths where it makes sense a new helper is_pmd_device_private_entry() is used - anon_exclusive handling of zone device private pages in split_huge_pmd_locked() has been fixed - Patches that introduced helpers have been folded into where they are used - Zone device handling in mm/huge_memory.c has benefited from the code and testing of Matthew Brost, he helped find bugs related to copy_huge_pmd() and partial unmapping of folios. - Zone device THP PMD support via page_vma_mapped_walk() is restricted to try_to_migrate_one() - There is a new dedicated helper to split large zone device folios Changelog v1 [2]: - Support for handling fault_folio and using trylock in the fault path - A new test case has been added to measure the throughput improvement - General refactoring of code to keep up with the changes in mm - New split folio callback when the entire split is complete/done. The callback is used to know when the head order needs to be reset. Testing: - Testing was done with ZONE_DEVICE private pages on an x86 VM Balbir Singh (15): mm/zone_device: support large zone device private folios mm/zone_device: Rename page_free callback to folio_free mm/huge_memory: add device-private THP support to PMD operations mm/rmap: extend rmap and migration support device-private entries mm/huge_memory: implement device-private THP splitting mm/migrate_device: handle partially mapped folios during collection mm/migrate_device: implement THP migration of zone device pages mm/memory/fault: add THP fault handling for zone device private pages lib/test_hmm: add zone device private THP test infrastructure mm/memremap: add driver callback support for folio splitting mm/migrate_device: add THP splitting during migration lib/test_hmm: add large page allocation failure testing selftests/mm/hmm-tests: new tests for zone device THP migration selftests/mm/hmm-tests: new throughput tests including THP gpu/drm/nouveau: enable THP support for GPU memory migration Matthew Brost (1): selftests/mm/hmm-tests: partial unmap, mremap and anon_write tests Documentation/mm/memory-model.rst | 2 +- arch/powerpc/kvm/book3s_hv_uvmem.c | 7 +- drivers/gpu/drm/amd/amdkfd/kfd_migrate.c | 7 +- drivers/gpu/drm/drm_pagemap.c | 12 +- drivers/gpu/drm/nouveau/nouveau_dmem.c | 308 ++++++-- drivers/gpu/drm/nouveau/nouveau_svm.c | 6 +- drivers/gpu/drm/nouveau/nouveau_svm.h | 3 +- drivers/pci/p2pdma.c | 5 +- include/linux/huge_mm.h | 18 +- include/linux/memremap.h | 57 +- include/linux/migrate.h | 2 + include/linux/swapops.h | 32 + lib/test_hmm.c | 448 +++++++++-- lib/test_hmm_uapi.h | 3 + mm/damon/ops-common.c | 20 +- mm/huge_memory.c | 243 ++++-- mm/memory.c | 5 +- mm/memremap.c | 40 +- mm/migrate.c | 1 + mm/migrate_device.c | 609 +++++++++++++-- mm/page_idle.c | 7 +- mm/page_vma_mapped.c | 7 + mm/pgtable-generic.c | 2 +- mm/rmap.c | 30 +- tools/testing/selftests/mm/hmm-tests.c | 919 +++++++++++++++++++++-- 25 files changed, 2399 insertions(+), 394 deletions(-) -- 2.51.0
