I've reworked the implementation in v4. The fix is actually inspired
by the DRM GPU SVM framework (drivers/gpu/drm/drm_gpusvm.c).

DRM GPU SVM uses wide notifiers (recommended 512M or larger) to track
multiple user virtual address ranges under a single mmu_interval_notifier,
and these ranges can be non-contiguous which is essentially the same
problem that batch userptr needs to solve: one BO backed by multiple
non-contiguous CPU VA ranges sharing one notifier.

The wide notifier is created in drm_gpusvm_notifier_alloc:
  notifier->itree.start = ALIGN_DOWN(fault_addr, gpusvm->notifier_size);
  notifier->itree.last = ALIGN(fault_addr + 1, gpusvm->notifier_size) - 1;
The Xe driver passes
  xe_modparam.svm_notifier_size * SZ_1M in xe_svm_init
as the notifier_size, so one notifier can cover many of MB of VA space
containing multiple non-contiguous ranges.

And DRM GPU SVM solves the per-range validity problem with flag-based
validation instead of seq-based validation in:
  - drm_gpusvm_pages_valid() checks
      flags.has_dma_mapping
    not notifier_seq. The comment explicitly states:
      "This is akin to a notifier seqno check in the HMM documentation
       but due to wider notifiers (i.e., notifiers which span multiple
       ranges) this function is required for finer grained checking"
  - __drm_gpusvm_unmap_pages() clears
      flags.has_dma_mapping = false  under notifier_lock
  - drm_gpusvm_get_pages() sets
      flags.has_dma_mapping = true  under notifier_lock
I adopted the same approach.

DRM GPU SVM:
  drm_gpusvm_notifier_invalidate()
    down_write(&gpusvm->notifier_lock);
    mmu_interval_set_seq(mni, cur_seq);
    gpusvm->ops->invalidate()
      -> xe_svm_invalidate()
         drm_gpusvm_for_each_range()
           -> __drm_gpusvm_unmap_pages()
              WRITE_ONCE(flags.has_dma_mapping = false);  // clear flag
    up_write(&gpusvm->notifier_lock);

KFD batch userptr:
  amdgpu_amdkfd_evict_userptr_batch()
    mutex_lock(&process_info->notifier_lock);
    mmu_interval_set_seq(mni, cur_seq);
    discard_invalid_ranges()
      interval_tree_iter_first/next()
        range_info->valid = false;          // clear flag
    mutex_unlock(&process_info->notifier_lock);

Both implementations:
  - Acquire notifier_lock FIRST, before any flag changes
  - Call mmu_interval_set_seq() under the lock
  - Use interval tree to find affected ranges within the wide notifier
  - Mark per-range flag as invalid/valid under the lock

The page fault path and final validation path also follow the same
pattern as DRM GPU SVM: fault outside the lock, set/check per-range
flag under the lock.

Regards,
Honglei


On 2026/2/6 21:56, Christian König wrote:
On 2/6/26 07:25, Honglei Huang wrote:
From: Honglei Huang <[email protected]>

Hi all,

This is v3 of the patch series to support allocating multiple non-contiguous
CPU virtual address ranges that map to a single contiguous GPU virtual address.

v3:
1. No new ioctl: Reuses existing AMDKFD_IOC_ALLOC_MEMORY_OF_GPU
    - Adds only one flag: KFD_IOC_ALLOC_MEM_FLAGS_USERPTR_BATCH

That is most likely not the best approach, but Felix or Philip need to comment 
here since I don't know such IOCTLs well either.

    - When flag is set, mmap_offset field points to range array
    - Minimal API surface change

Why range of VA space for each entry?

2. Improved MMU notifier handling:
    - Single mmu_interval_notifier covering the VA span [va_min, va_max]
    - Interval tree for efficient lookup of affected ranges during invalidation
    - Avoids per-range notifier overhead mentioned in v2 review

That won't work unless you also modify hmm_range_fault() to take multiple VA 
addresses (or ranges) at the same time.

The problem is that we must rely on hmm_range.notifier_seq to detect changes to 
the page tables in question, but that in turn works only if you have one 
hmm_range structure and not multiple.

What might work is doing an XOR or CRC over all hmm_range.notifier_seq you 
have, but that is a bit flaky.

Regards,
Christian.


3. Better code organization: Split into 8 focused patches for easier review

v2:
    - Each CPU VA range gets its own mmu_interval_notifier for invalidation
    - All ranges validated together and mapped to contiguous GPU VA
    - Single kgd_mem object with array of user_range_info structures
    - Unified eviction/restore path for all ranges in a batch

Current Implementation Approach
===============================

This series implements a practical solution within existing kernel constraints:

1. Single MMU notifier for VA span: Register one notifier covering the
    entire range from lowest to highest address in the batch

2. Interval tree filtering: Use interval tree to efficiently identify
    which specific ranges are affected during invalidation callbacks,
    avoiding unnecessary processing for unrelated address changes

3. Unified eviction/restore: All ranges in a batch share eviction and
    restore paths, maintaining consistency with existing userptr handling

Patch Series Overview
=====================

Patch 1/8: Add userptr batch allocation UAPI structures
     - KFD_IOC_ALLOC_MEM_FLAGS_USERPTR_BATCH flag
     - kfd_ioctl_userptr_range and kfd_ioctl_userptr_ranges_data structures

Patch 2/8: Add user_range_info infrastructure to kgd_mem
     - user_range_info structure for per-range tracking
     - Fields for batch allocation in kgd_mem

Patch 3/8: Implement interval tree for userptr ranges
     - Interval tree for efficient range lookup during invalidation
     - mark_invalid_ranges() function

Patch 4/8: Add batch MMU notifier support
     - Single notifier for entire VA span
     - Invalidation callback using interval tree filtering

Patch 5/8: Implement batch userptr page management
     - get_user_pages_batch() and set_user_pages_batch()
     - Per-range page array management

Patch 6/8: Add batch allocation function and export API
     - init_user_pages_batch() main initialization
     - amdgpu_amdkfd_gpuvm_alloc_memory_of_gpu_batch() entry point

Patch 7/8: Unify userptr cleanup and update paths
     - Shared eviction/restore handling for batch allocations
     - Integration with existing userptr validation flows

Patch 8/8: Wire up batch allocation in ioctl handler
     - Input validation and range array parsing
     - Integration with existing alloc_memory_of_gpu path

Testing
=======

- Multiple scattered malloc() allocations (2-4000+ ranges)
- Various allocation sizes (4KB to 1G+ per range)
- Memory pressure scenarios and eviction/restore cycles
- OpenCL CTS and HIP catch tests in KVM guest environment
- AI workloads: Stable Diffusion, ComfyUI in virtualized environments
- Small LLM inference (3B-7B models)
- Benchmark score: 160,000 - 190,000 (80%-95% of bare metal)
- Performance improvement: 2x-2.4x faster than userspace approach

Thank you for your review and feedback.

Best regards,
Honglei Huang

Honglei Huang (8):
   drm/amdkfd: Add userptr batch allocation UAPI structures
   drm/amdkfd: Add user_range_info infrastructure to kgd_mem
   drm/amdkfd: Implement interval tree for userptr ranges
   drm/amdkfd: Add batch MMU notifier support
   drm/amdkfd: Implement batch userptr page management
   drm/amdkfd: Add batch allocation function and export API
   drm/amdkfd: Unify userptr cleanup and update paths
   drm/amdkfd: Wire up batch allocation in ioctl handler

  drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h    |  23 +
  .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c  | 539 +++++++++++++++++-
  drivers/gpu/drm/amd/amdkfd/kfd_chardev.c      | 128 ++++-
  include/uapi/linux/kfd_ioctl.h                |  31 +-
  4 files changed, 697 insertions(+), 24 deletions(-)



Reply via email to