On 2/9/26 07:14, Honglei Huang wrote: > > I've reworked the implementation in v4. The fix is actually inspired > by the DRM GPU SVM framework (drivers/gpu/drm/drm_gpusvm.c). > > DRM GPU SVM uses wide notifiers (recommended 512M or larger) to track > multiple user virtual address ranges under a single mmu_interval_notifier, > and these ranges can be non-contiguous which is essentially the same > problem that batch userptr needs to solve: one BO backed by multiple > non-contiguous CPU VA ranges sharing one notifier.
That still doesn't solve the sequencing problem. As far as I can see you can't use hmm_range_fault with this approach or it would just not be very valuable. So how should that work with your patch set? Regards, Christian. > > The wide notifier is created in drm_gpusvm_notifier_alloc: > notifier->itree.start = ALIGN_DOWN(fault_addr, gpusvm->notifier_size); > notifier->itree.last = ALIGN(fault_addr + 1, gpusvm->notifier_size) - 1; > The Xe driver passes > xe_modparam.svm_notifier_size * SZ_1M in xe_svm_init > as the notifier_size, so one notifier can cover many of MB of VA space > containing multiple non-contiguous ranges. > > And DRM GPU SVM solves the per-range validity problem with flag-based > validation instead of seq-based validation in: > - drm_gpusvm_pages_valid() checks > flags.has_dma_mapping > not notifier_seq. The comment explicitly states: > "This is akin to a notifier seqno check in the HMM documentation > but due to wider notifiers (i.e., notifiers which span multiple > ranges) this function is required for finer grained checking" > - __drm_gpusvm_unmap_pages() clears > flags.has_dma_mapping = false under notifier_lock > - drm_gpusvm_get_pages() sets > flags.has_dma_mapping = true under notifier_lock > I adopted the same approach. > > DRM GPU SVM: > drm_gpusvm_notifier_invalidate() > down_write(&gpusvm->notifier_lock); > mmu_interval_set_seq(mni, cur_seq); > gpusvm->ops->invalidate() > -> xe_svm_invalidate() > drm_gpusvm_for_each_range() > -> __drm_gpusvm_unmap_pages() > WRITE_ONCE(flags.has_dma_mapping = false); // clear flag > up_write(&gpusvm->notifier_lock); > > KFD batch userptr: > amdgpu_amdkfd_evict_userptr_batch() > mutex_lock(&process_info->notifier_lock); > mmu_interval_set_seq(mni, cur_seq); > discard_invalid_ranges() > interval_tree_iter_first/next() > range_info->valid = false; // clear flag > mutex_unlock(&process_info->notifier_lock); > > Both implementations: > - Acquire notifier_lock FIRST, before any flag changes > - Call mmu_interval_set_seq() under the lock > - Use interval tree to find affected ranges within the wide notifier > - Mark per-range flag as invalid/valid under the lock > > The page fault path and final validation path also follow the same > pattern as DRM GPU SVM: fault outside the lock, set/check per-range > flag under the lock. > > Regards, > Honglei > > > On 2026/2/6 21:56, Christian König wrote: >> On 2/6/26 07:25, Honglei Huang wrote: >>> From: Honglei Huang <[email protected]> >>> >>> Hi all, >>> >>> This is v3 of the patch series to support allocating multiple non-contiguous >>> CPU virtual address ranges that map to a single contiguous GPU virtual >>> address. >>> >>> v3: >>> 1. No new ioctl: Reuses existing AMDKFD_IOC_ALLOC_MEMORY_OF_GPU >>> - Adds only one flag: KFD_IOC_ALLOC_MEM_FLAGS_USERPTR_BATCH >> >> That is most likely not the best approach, but Felix or Philip need to >> comment here since I don't know such IOCTLs well either. >> >>> - When flag is set, mmap_offset field points to range array >>> - Minimal API surface change >> >> Why range of VA space for each entry? >> >>> 2. Improved MMU notifier handling: >>> - Single mmu_interval_notifier covering the VA span [va_min, va_max] >>> - Interval tree for efficient lookup of affected ranges during >>> invalidation >>> - Avoids per-range notifier overhead mentioned in v2 review >> >> That won't work unless you also modify hmm_range_fault() to take multiple VA >> addresses (or ranges) at the same time. >> >> The problem is that we must rely on hmm_range.notifier_seq to detect changes >> to the page tables in question, but that in turn works only if you have one >> hmm_range structure and not multiple. >> >> What might work is doing an XOR or CRC over all hmm_range.notifier_seq you >> have, but that is a bit flaky. >> >> Regards, >> Christian. >> >>> >>> 3. Better code organization: Split into 8 focused patches for easier review >>> >>> v2: >>> - Each CPU VA range gets its own mmu_interval_notifier for invalidation >>> - All ranges validated together and mapped to contiguous GPU VA >>> - Single kgd_mem object with array of user_range_info structures >>> - Unified eviction/restore path for all ranges in a batch >>> >>> Current Implementation Approach >>> =============================== >>> >>> This series implements a practical solution within existing kernel >>> constraints: >>> >>> 1. Single MMU notifier for VA span: Register one notifier covering the >>> entire range from lowest to highest address in the batch >>> >>> 2. Interval tree filtering: Use interval tree to efficiently identify >>> which specific ranges are affected during invalidation callbacks, >>> avoiding unnecessary processing for unrelated address changes >>> >>> 3. Unified eviction/restore: All ranges in a batch share eviction and >>> restore paths, maintaining consistency with existing userptr handling >>> >>> Patch Series Overview >>> ===================== >>> >>> Patch 1/8: Add userptr batch allocation UAPI structures >>> - KFD_IOC_ALLOC_MEM_FLAGS_USERPTR_BATCH flag >>> - kfd_ioctl_userptr_range and kfd_ioctl_userptr_ranges_data structures >>> >>> Patch 2/8: Add user_range_info infrastructure to kgd_mem >>> - user_range_info structure for per-range tracking >>> - Fields for batch allocation in kgd_mem >>> >>> Patch 3/8: Implement interval tree for userptr ranges >>> - Interval tree for efficient range lookup during invalidation >>> - mark_invalid_ranges() function >>> >>> Patch 4/8: Add batch MMU notifier support >>> - Single notifier for entire VA span >>> - Invalidation callback using interval tree filtering >>> >>> Patch 5/8: Implement batch userptr page management >>> - get_user_pages_batch() and set_user_pages_batch() >>> - Per-range page array management >>> >>> Patch 6/8: Add batch allocation function and export API >>> - init_user_pages_batch() main initialization >>> - amdgpu_amdkfd_gpuvm_alloc_memory_of_gpu_batch() entry point >>> >>> Patch 7/8: Unify userptr cleanup and update paths >>> - Shared eviction/restore handling for batch allocations >>> - Integration with existing userptr validation flows >>> >>> Patch 8/8: Wire up batch allocation in ioctl handler >>> - Input validation and range array parsing >>> - Integration with existing alloc_memory_of_gpu path >>> >>> Testing >>> ======= >>> >>> - Multiple scattered malloc() allocations (2-4000+ ranges) >>> - Various allocation sizes (4KB to 1G+ per range) >>> - Memory pressure scenarios and eviction/restore cycles >>> - OpenCL CTS and HIP catch tests in KVM guest environment >>> - AI workloads: Stable Diffusion, ComfyUI in virtualized environments >>> - Small LLM inference (3B-7B models) >>> - Benchmark score: 160,000 - 190,000 (80%-95% of bare metal) >>> - Performance improvement: 2x-2.4x faster than userspace approach >>> >>> Thank you for your review and feedback. >>> >>> Best regards, >>> Honglei Huang >>> >>> Honglei Huang (8): >>> drm/amdkfd: Add userptr batch allocation UAPI structures >>> drm/amdkfd: Add user_range_info infrastructure to kgd_mem >>> drm/amdkfd: Implement interval tree for userptr ranges >>> drm/amdkfd: Add batch MMU notifier support >>> drm/amdkfd: Implement batch userptr page management >>> drm/amdkfd: Add batch allocation function and export API >>> drm/amdkfd: Unify userptr cleanup and update paths >>> drm/amdkfd: Wire up batch allocation in ioctl handler >>> >>> drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h | 23 + >>> .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 539 +++++++++++++++++- >>> drivers/gpu/drm/amd/amdkfd/kfd_chardev.c | 128 ++++- >>> include/uapi/linux/kfd_ioctl.h | 31 +- >>> 4 files changed, 697 insertions(+), 24 deletions(-) >>> >> >
