amdkfd: Add batch userptr allocation support

Honglei Huang Mon, 09 Feb 2026 06:45:17 -0800



you said that DRM GPU SVM has the same pattern, but argued
that it is not designed for "batch userptr". However, this distinction
has no technical significance. The core problem is "multiple ranges
under one wide notifier doing per-range hmm_range_fault". Whether
these ranges are dynamically created by GPU page faults or
batch-specified via ioctl, the concurrency safety mechanism is
same.

You said "each hmm_range_fault() can invalidate the other ranges
while faulting them in". Yes, this can happen but this is precisely
the scenario that mem->invalid catches:

  1. hmm_range_fault(A) succeeds
  2. hmm_range_fault(B) triggers reclaim → A's pages swapped out
     → MMU notifier callback:
       mutex_lock(notifier_lock)
         range_A->valid = false
         mem->invalid++
       mutex_unlock(notifier_lock)
  3. hmm_range_fault(B) completes
  4. Commit phase:
       mutex_lock(notifier_lock)
         mem->invalid != saved_invalid
         → return -EAGAIN, retry entire batch
       mutex_unlock(notifier_lock)

 invalid pages are never committed.

Regards,
Honglei


On 2026/2/9 22:25, Christian König wrote:

On 2/9/26 15:16, Honglei Huang wrote:

The case you described: one hmm_range_fault() invalidating another's
seq under the same notifier, is already handled in the implementation.

  example: suppose ranges A, B, C share one notifier:

   1. hmm_range_fault(A) succeeds, seq_A recorded
   2. External invalidation occurs, triggers callback:
      mutex_lock(notifier_lock)
        → mmu_interval_set_seq()
        → range_A->valid = false
        → mem->invalid++
      mutex_unlock(notifier_lock)
   3. hmm_range_fault(B) succeeds
   4. Commit phase:
      mutex_lock(notifier_lock)
        → check mem->invalid != saved_invalid
        → return -EAGAIN, retry the entire batch
      mutex_unlock(notifier_lock)

All concurrent invalidations are caught by the mem->invalid counter.
Additionally, amdgpu_ttm_tt_get_user_pages_done() in 
confirm_valid_user_pages_locked
performs a per-range mmu_interval_read_retry() as a final safety check.

DRM GPU SVM uses the same approach: drm_gpusvm_get_pages() also calls
hmm_range_fault() per-range independently there is no array version
of hmm_range_fault in DRM GPU SVM either. If you consider this approach
unworkable, then DRM GPU SVM would be unworkable too, yet it has been
accepted upstream.

The number of batch ranges is controllable. And even if it
scales to thousands, DRM GPU SVM faces exactly the same situation:
it does not need an array version of hmm_range_fault either, which
shows this is a correctness question, not a performance one. For
correctness, I believe DRM GPU SVM already demonstrates the approach
is ok.


Well yes, GPU SVM would have exactly the same problems. But that also doesn't 
have a create bulk userptr interface.

The implementation is simply not made for this use case, and as far as I know 
no current upstream implementation is.

For performance, I have tested with thousands of ranges present:
performance reaches 80%~95% of the native driver, and all OpenCL
and ROCr test suites pass with no correctness issues.


Testing can only falsify a system and not verify it.

Here is how DRM GPU SVM handles correctness with multiple ranges
under one wide notifier doing per-range hmm_range_fault:

   Invalidation: drm_gpusvm_notifier_invalidate()
     - Acquires notifier_lock
     - Calls mmu_interval_set_seq()
     - Iterates affected ranges via driver callback (xe_svm_invalidate)
     - Clears has_dma_mapping = false for each affected range (under lock)
     - Releases notifier_lock

   Fault: drm_gpusvm_get_pages()  (called per-range independently)
     - mmu_interval_read_begin() to get seq
     - hmm_range_fault() outside lock
     - Acquires notifier_lock
     - mmu_interval_read_retry() → if stale, release lock and retry
     - DMA map pages + set has_dma_mapping = true (under lock)
     - Releases notifier_lock

   Validation: drm_gpusvm_pages_valid()
     - Checks has_dma_mapping flag (under lock), NOT seq

If invalidation occurs between two per-range faults, the flag is
cleared under lock, and either mmu_interval_read_retry catches it
in the current fault, or drm_gpusvm_pages_valid() catches it at
validation time. No stale pages are ever committed.

KFD batch userptr uses the same three-step pattern:

   Invalidation: amdgpu_amdkfd_evict_userptr_batch()
     - Acquires notifier_lock
     - Calls mmu_interval_set_seq()
     - Iterates affected ranges via interval_tree
     - Sets range->valid = false for each affected range (under lock)
     - Increments mem->invalid (under lock)
     - Releases notifier_lock

   Fault: update_invalid_user_pages()
     - Per-range hmm_range_fault() outside lock


And here the idea falls apart. Each hmm_range_fault() can invalidate the other 
ranges while faulting them in.

That is not fundamentally solveable, but by moving the handling further into 
hmm_range_fault it makes it much less likely that something goes wrong.

So once more as long as this still uses this hacky approach I will clearly 
reject this implementation.

Regards,
Christian.

     - Acquires notifier_lock
     - Checks mem->invalid != saved_invalid → if changed, -EAGAIN retry
     - Sets range->valid = true for faulted ranges (under lock)
     - Releases notifier_lock

   Validation: valid_user_pages_batch()
     - Checks range->valid flag
     - Calls amdgpu_ttm_tt_get_user_pages_done() (mmu_interval_read_retry)

The logic is equivalent as far as I can see.

Regards,
Honglei



On 2026/2/9 21:27, Christian König wrote:

On 2/9/26 14:11, Honglei Huang wrote:


So the drm svm is also a NAK?

These codes have passed local testing, opencl and rocr， I also provided a 
detailed code path and analysis.
You only said the conclusion without providing any reasons or evidence. Your 
statement has no justifiable reasons and is difficult to convince
so far.


That sounds like you don't understand what the issue here is, I will try to 
explain this once more on pseudo-code.

Page tables are updated without holding a lock, so when you want to grab 
physical addresses from the then you need to use an opportunistically retry 
based approach to make sure that the data you got is still valid.

In other words something like this here is needed:

retry:
     hmm_range.notifier_seq = mmu_interval_read_begin(notifier);
     hmm_range.hmm_pfns = kvmalloc_array(npages, ...);
...
     while (true) {
         mmap_read_lock(mm);
         err = hmm_range_fault(&hmm_range);
         mmap_read_unlock(mm);

         if (err == -EBUSY) {
             if (time_after(jiffies, timeout))
                 break;

             hmm_range.notifier_seq =
                 mmu_interval_read_begin(notifier);
             continue;
         }
         break;
     }
...
     for (i = 0, j = 0; i < npages; ++j) {
...
         dma_map_page(...)
...
     grab_notifier_lock();
     if (mmu_interval_read_retry(notifier, hmm_range.notifier_seq))
         goto retry;
     restart_queues();
     drop_notifier_lock();
...

Now hmm_range.notifier_seq indicates if your DMA addresses are still valid or 
not after you grabbed the notifier lock.

The problem is that hmm_range works only on a single range/sequence 
combination, so when you do multiple calls to hmm_range_fault() for scattered 
VA is can easily be that one call invalidates the ranges of another call.

So as long as you only have a few hundred hmm_ranges for your userptrs that 
kind of works, but it doesn't scale up into the thousands of different VA 
addresses you get for scattered handling.

That's why hmm_range_fault needs to be modified to handle an array of VA 
addresses instead of just a A..B range.

Regards,
Christian.


On 2026/2/9 20:59, Christian König wrote:

On 2/9/26 13:52, Honglei Huang wrote:

DRM GPU SVM does use hmm_range_fault(), see drm_gpusvm_get_pages()


I'm not sure what you are talking about, drm_gpusvm_get_pages() only supports a 
single range as well and not scatter gather of VA addresses.

As far as I can see that doesn't help the slightest.

My implementation follows the same pattern. The detailed comparison
of invalidation path was provided in the second half of my previous mail.


Yeah and as I said that is not very valuable because it doesn't solves the 
sequence problem.

As far as I can see the approach you try here is a clear NAK from my side.

Regards,
Christian.


On 2026/2/9 18:16, Christian König wrote:

On 2/9/26 07:14, Honglei Huang wrote:


I've reworked the implementation in v4. The fix is actually inspired
by the DRM GPU SVM framework (drivers/gpu/drm/drm_gpusvm.c).

DRM GPU SVM uses wide notifiers (recommended 512M or larger) to track
multiple user virtual address ranges under a single mmu_interval_notifier,
and these ranges can be non-contiguous which is essentially the same
problem that batch userptr needs to solve: one BO backed by multiple
non-contiguous CPU VA ranges sharing one notifier.


That still doesn't solve the sequencing problem.

As far as I can see you can't use hmm_range_fault with this approach or it 
would just not be very valuable.

So how should that work with your patch set?

Regards,
Christian.


The wide notifier is created in drm_gpusvm_notifier_alloc:
      notifier->itree.start = ALIGN_DOWN(fault_addr, gpusvm->notifier_size);
      notifier->itree.last = ALIGN(fault_addr + 1, gpusvm->notifier_size) - 1;
The Xe driver passes
      xe_modparam.svm_notifier_size * SZ_1M in xe_svm_init
as the notifier_size, so one notifier can cover many of MB of VA space
containing multiple non-contiguous ranges.

And DRM GPU SVM solves the per-range validity problem with flag-based
validation instead of seq-based validation in:
      - drm_gpusvm_pages_valid() checks
          flags.has_dma_mapping
        not notifier_seq. The comment explicitly states:
          "This is akin to a notifier seqno check in the HMM documentation
           but due to wider notifiers (i.e., notifiers which span multiple
           ranges) this function is required for finer grained checking"
      - __drm_gpusvm_unmap_pages() clears
          flags.has_dma_mapping = false  under notifier_lock
      - drm_gpusvm_get_pages() sets
          flags.has_dma_mapping = true  under notifier_lock
I adopted the same approach.

DRM GPU SVM:
      drm_gpusvm_notifier_invalidate()
        down_write(&gpusvm->notifier_lock);
        mmu_interval_set_seq(mni, cur_seq);
        gpusvm->ops->invalidate()
          -> xe_svm_invalidate()
             drm_gpusvm_for_each_range()
               -> __drm_gpusvm_unmap_pages()
                  WRITE_ONCE(flags.has_dma_mapping = false);  // clear flag
        up_write(&gpusvm->notifier_lock);

KFD batch userptr:
      amdgpu_amdkfd_evict_userptr_batch()
        mutex_lock(&process_info->notifier_lock);
        mmu_interval_set_seq(mni, cur_seq);
        discard_invalid_ranges()
          interval_tree_iter_first/next()
            range_info->valid = false;          // clear flag
        mutex_unlock(&process_info->notifier_lock);

Both implementations:
      - Acquire notifier_lock FIRST, before any flag changes
      - Call mmu_interval_set_seq() under the lock
      - Use interval tree to find affected ranges within the wide notifier
      - Mark per-range flag as invalid/valid under the lock

The page fault path and final validation path also follow the same
pattern as DRM GPU SVM: fault outside the lock, set/check per-range
flag under the lock.

Regards,
Honglei


On 2026/2/6 21:56, Christian König wrote:

On 2/6/26 07:25, Honglei Huang wrote:

From: Honglei Huang <[email protected]>

Hi all,

This is v3 of the patch series to support allocating multiple non-contiguous
CPU virtual address ranges that map to a single contiguous GPU virtual address.

v3:
1. No new ioctl: Reuses existing AMDKFD_IOC_ALLOC_MEMORY_OF_GPU
        - Adds only one flag: KFD_IOC_ALLOC_MEM_FLAGS_USERPTR_BATCH


That is most likely not the best approach, but Felix or Philip need to comment 
here since I don't know such IOCTLs well either.

        - When flag is set, mmap_offset field points to range array
        - Minimal API surface change


Why range of VA space for each entry?

2. Improved MMU notifier handling:
        - Single mmu_interval_notifier covering the VA span [va_min, va_max]
        - Interval tree for efficient lookup of affected ranges during 
invalidation
        - Avoids per-range notifier overhead mentioned in v2 review


That won't work unless you also modify hmm_range_fault() to take multiple VA 
addresses (or ranges) at the same time.

The problem is that we must rely on hmm_range.notifier_seq to detect changes to 
the page tables in question, but that in turn works only if you have one 
hmm_range structure and not multiple.

What might work is doing an XOR or CRC over all hmm_range.notifier_seq you 
have, but that is a bit flaky.

Regards,
Christian.


3. Better code organization: Split into 8 focused patches for easier review

v2:
        - Each CPU VA range gets its own mmu_interval_notifier for invalidation
        - All ranges validated together and mapped to contiguous GPU VA
        - Single kgd_mem object with array of user_range_info structures
        - Unified eviction/restore path for all ranges in a batch

Current Implementation Approach
===============================

This series implements a practical solution within existing kernel constraints:

1. Single MMU notifier for VA span: Register one notifier covering the
        entire range from lowest to highest address in the batch

2. Interval tree filtering: Use interval tree to efficiently identify
        which specific ranges are affected during invalidation callbacks,
        avoiding unnecessary processing for unrelated address changes

3. Unified eviction/restore: All ranges in a batch share eviction and
        restore paths, maintaining consistency with existing userptr handling

Patch Series Overview
=====================

Patch 1/8: Add userptr batch allocation UAPI structures
         - KFD_IOC_ALLOC_MEM_FLAGS_USERPTR_BATCH flag
         - kfd_ioctl_userptr_range and kfd_ioctl_userptr_ranges_data structures

Patch 2/8: Add user_range_info infrastructure to kgd_mem
         - user_range_info structure for per-range tracking
         - Fields for batch allocation in kgd_mem

Patch 3/8: Implement interval tree for userptr ranges
         - Interval tree for efficient range lookup during invalidation
         - mark_invalid_ranges() function

Patch 4/8: Add batch MMU notifier support
         - Single notifier for entire VA span
         - Invalidation callback using interval tree filtering

Patch 5/8: Implement batch userptr page management
         - get_user_pages_batch() and set_user_pages_batch()
         - Per-range page array management

Patch 6/8: Add batch allocation function and export API
         - init_user_pages_batch() main initialization
         - amdgpu_amdkfd_gpuvm_alloc_memory_of_gpu_batch() entry point

Patch 7/8: Unify userptr cleanup and update paths
         - Shared eviction/restore handling for batch allocations
         - Integration with existing userptr validation flows

Patch 8/8: Wire up batch allocation in ioctl handler
         - Input validation and range array parsing
         - Integration with existing alloc_memory_of_gpu path

Testing
=======

- Multiple scattered malloc() allocations (2-4000+ ranges)
- Various allocation sizes (4KB to 1G+ per range)
- Memory pressure scenarios and eviction/restore cycles
- OpenCL CTS and HIP catch tests in KVM guest environment
- AI workloads: Stable Diffusion, ComfyUI in virtualized environments
- Small LLM inference (3B-7B models)
- Benchmark score: 160,000 - 190,000 (80%-95% of bare metal)
- Performance improvement: 2x-2.4x faster than userspace approach

Thank you for your review and feedback.

Best regards,
Honglei Huang

Honglei Huang (8):
       drm/amdkfd: Add userptr batch allocation UAPI structures
       drm/amdkfd: Add user_range_info infrastructure to kgd_mem
       drm/amdkfd: Implement interval tree for userptr ranges
       drm/amdkfd: Add batch MMU notifier support
       drm/amdkfd: Implement batch userptr page management
       drm/amdkfd: Add batch allocation function and export API
       drm/amdkfd: Unify userptr cleanup and update paths
       drm/amdkfd: Wire up batch allocation in ioctl handler

      drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h    |  23 +
      .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c  | 539 +++++++++++++++++-
      drivers/gpu/drm/amd/amdkfd/kfd_chardev.c      | 128 ++++-
      include/uapi/linux/kfd_ioctl.h                |  31 +-
      4 files changed, 697 insertions(+), 24 deletions(-)

Re: [PATCH v3 0/8] drm/amdkfd: Add batch userptr allocation support

Reply via email to