amdkfd: Add batch userptr allocation support

Christian König Mon, 09 Feb 2026 06:26:17 -0800

On 2/9/26 15:16, Honglei Huang wrote:
> The case you described: one hmm_range_fault() invalidating another's
> seq under the same notifier, is already handled in the implementation.
> 
>  example: suppose ranges A, B, C share one notifier:
> 
>   1. hmm_range_fault(A) succeeds, seq_A recorded
>   2. External invalidation occurs, triggers callback:
>      mutex_lock(notifier_lock)
>        → mmu_interval_set_seq()
>        → range_A->valid = false
>        → mem->invalid++
>      mutex_unlock(notifier_lock)
>   3. hmm_range_fault(B) succeeds
>   4. Commit phase:
>      mutex_lock(notifier_lock)
>        → check mem->invalid != saved_invalid
>        → return -EAGAIN, retry the entire batch
>      mutex_unlock(notifier_lock)
> 
> All concurrent invalidations are caught by the mem->invalid counter.
> Additionally, amdgpu_ttm_tt_get_user_pages_done() in 
> confirm_valid_user_pages_locked
> performs a per-range mmu_interval_read_retry() as a final safety check.
> 
> DRM GPU SVM uses the same approach: drm_gpusvm_get_pages() also calls
> hmm_range_fault() per-range independently there is no array version
> of hmm_range_fault in DRM GPU SVM either. If you consider this approach
> unworkable, then DRM GPU SVM would be unworkable too, yet it has been
> accepted upstream.
> 
> The number of batch ranges is controllable. And even if it
> scales to thousands, DRM GPU SVM faces exactly the same situation:
> it does not need an array version of hmm_range_fault either, which
> shows this is a correctness question, not a performance one. For
> correctness, I believe DRM GPU SVM already demonstrates the approach
> is ok.


Well yes, GPU SVM would have exactly the same problems. But that also doesn't 
have a create bulk userptr interface.

The implementation is simply not made for this use case, and as far as I know 
no current upstream implementation is.

> For performance, I have tested with thousands of ranges present:
> performance reaches 80%~95% of the native driver, and all OpenCL
> and ROCr test suites pass with no correctness issues.

Testing can only falsify a system and not verify it.

> Here is how DRM GPU SVM handles correctness with multiple ranges
> under one wide notifier doing per-range hmm_range_fault:
> 
>   Invalidation: drm_gpusvm_notifier_invalidate()
>     - Acquires notifier_lock
>     - Calls mmu_interval_set_seq()
>     - Iterates affected ranges via driver callback (xe_svm_invalidate)
>     - Clears has_dma_mapping = false for each affected range (under lock)
>     - Releases notifier_lock
> 
>   Fault: drm_gpusvm_get_pages()  (called per-range independently)
>     - mmu_interval_read_begin() to get seq
>     - hmm_range_fault() outside lock
>     - Acquires notifier_lock
>     - mmu_interval_read_retry() → if stale, release lock and retry
>     - DMA map pages + set has_dma_mapping = true (under lock)
>     - Releases notifier_lock
> 
>   Validation: drm_gpusvm_pages_valid()
>     - Checks has_dma_mapping flag (under lock), NOT seq
> 
> If invalidation occurs between two per-range faults, the flag is
> cleared under lock, and either mmu_interval_read_retry catches it
> in the current fault, or drm_gpusvm_pages_valid() catches it at
> validation time. No stale pages are ever committed.
> 
> KFD batch userptr uses the same three-step pattern:
> 
>   Invalidation: amdgpu_amdkfd_evict_userptr_batch()
>     - Acquires notifier_lock
>     - Calls mmu_interval_set_seq()
>     - Iterates affected ranges via interval_tree
>     - Sets range->valid = false for each affected range (under lock)
>     - Increments mem->invalid (under lock)
>     - Releases notifier_lock
> 
>   Fault: update_invalid_user_pages()
>     - Per-range hmm_range_fault() outside lock

And here the idea falls apart. Each hmm_range_fault() can invalidate the other 
ranges while faulting them in.

That is not fundamentally solveable, but by moving the handling further into 
hmm_range_fault it makes it much less likely that something goes wrong.

So once more as long as this still uses this hacky approach I will clearly 
reject this implementation.

Regards,
Christian.

>     - Acquires notifier_lock
>     - Checks mem->invalid != saved_invalid → if changed, -EAGAIN retry
>     - Sets range->valid = true for faulted ranges (under lock)
>     - Releases notifier_lock
> 
>   Validation: valid_user_pages_batch()
>     - Checks range->valid flag
>     - Calls amdgpu_ttm_tt_get_user_pages_done() (mmu_interval_read_retry)
> 
> The logic is equivalent as far as I can see.
> 
> Regards,
> Honglei
> 
> 
> 
> On 2026/2/9 21:27, Christian König wrote:
>> On 2/9/26 14:11, Honglei Huang wrote:
>>>
>>> So the drm svm is also a NAK?
>>>
>>> These codes have passed local testing, opencl and rocr， I also provided a 
>>> detailed code path and analysis.
>>> You only said the conclusion without providing any reasons or evidence. 
>>> Your statement has no justifiable reasons and is difficult to convince
>>> so far.
>>
>> That sounds like you don't understand what the issue here is, I will try to 
>> explain this once more on pseudo-code.
>>
>> Page tables are updated without holding a lock, so when you want to grab 
>> physical addresses from the then you need to use an opportunistically retry 
>> based approach to make sure that the data you got is still valid.
>>
>> In other words something like this here is needed:
>>
>> retry:
>>     hmm_range.notifier_seq = mmu_interval_read_begin(notifier);
>>     hmm_range.hmm_pfns = kvmalloc_array(npages, ...);
>> ...
>>     while (true) {
>>         mmap_read_lock(mm);
>>         err = hmm_range_fault(&hmm_range);
>>         mmap_read_unlock(mm);
>>
>>         if (err == -EBUSY) {
>>             if (time_after(jiffies, timeout))
>>                 break;
>>
>>             hmm_range.notifier_seq =
>>                 mmu_interval_read_begin(notifier);
>>             continue;
>>         }
>>         break;
>>     }
>> ...
>>     for (i = 0, j = 0; i < npages; ++j) {
>> ...
>>         dma_map_page(...)
>> ...
>>     grab_notifier_lock();
>>     if (mmu_interval_read_retry(notifier, hmm_range.notifier_seq))
>>         goto retry;
>>     restart_queues();
>>     drop_notifier_lock();
>> ...
>>
>> Now hmm_range.notifier_seq indicates if your DMA addresses are still valid 
>> or not after you grabbed the notifier lock.
>>
>> The problem is that hmm_range works only on a single range/sequence 
>> combination, so when you do multiple calls to hmm_range_fault() for 
>> scattered VA is can easily be that one call invalidates the ranges of 
>> another call.
>>
>> So as long as you only have a few hundred hmm_ranges for your userptrs that 
>> kind of works, but it doesn't scale up into the thousands of different VA 
>> addresses you get for scattered handling.
>>
>> That's why hmm_range_fault needs to be modified to handle an array of VA 
>> addresses instead of just a A..B range.
>>
>> Regards,
>> Christian.
>>
>>
>>>
>>> On 2026/2/9 20:59, Christian König wrote:
>>>> On 2/9/26 13:52, Honglei Huang wrote:
>>>>> DRM GPU SVM does use hmm_range_fault(), see drm_gpusvm_get_pages()
>>>>
>>>> I'm not sure what you are talking about, drm_gpusvm_get_pages() only 
>>>> supports a single range as well and not scatter gather of VA addresses.
>>>>
>>>> As far as I can see that doesn't help the slightest.
>>>>
>>>>> My implementation follows the same pattern. The detailed comparison
>>>>> of invalidation path was provided in the second half of my previous mail.
>>>>
>>>> Yeah and as I said that is not very valuable because it doesn't solves the 
>>>> sequence problem.
>>>>
>>>> As far as I can see the approach you try here is a clear NAK from my side.
>>>>
>>>> Regards,
>>>> Christian.
>>>>
>>>>>
>>>>> On 2026/2/9 18:16, Christian König wrote:
>>>>>> On 2/9/26 07:14, Honglei Huang wrote:
>>>>>>>
>>>>>>> I've reworked the implementation in v4. The fix is actually inspired
>>>>>>> by the DRM GPU SVM framework (drivers/gpu/drm/drm_gpusvm.c).
>>>>>>>
>>>>>>> DRM GPU SVM uses wide notifiers (recommended 512M or larger) to track
>>>>>>> multiple user virtual address ranges under a single 
>>>>>>> mmu_interval_notifier,
>>>>>>> and these ranges can be non-contiguous which is essentially the same
>>>>>>> problem that batch userptr needs to solve: one BO backed by multiple
>>>>>>> non-contiguous CPU VA ranges sharing one notifier.
>>>>>>
>>>>>> That still doesn't solve the sequencing problem.
>>>>>>
>>>>>> As far as I can see you can't use hmm_range_fault with this approach or 
>>>>>> it would just not be very valuable.
>>>>>>
>>>>>> So how should that work with your patch set?
>>>>>>
>>>>>> Regards,
>>>>>> Christian.
>>>>>>
>>>>>>>
>>>>>>> The wide notifier is created in drm_gpusvm_notifier_alloc:
>>>>>>>      notifier->itree.start = ALIGN_DOWN(fault_addr, 
>>>>>>> gpusvm->notifier_size);
>>>>>>>      notifier->itree.last = ALIGN(fault_addr + 1, 
>>>>>>> gpusvm->notifier_size) - 1;
>>>>>>> The Xe driver passes
>>>>>>>      xe_modparam.svm_notifier_size * SZ_1M in xe_svm_init
>>>>>>> as the notifier_size, so one notifier can cover many of MB of VA space
>>>>>>> containing multiple non-contiguous ranges.
>>>>>>>
>>>>>>> And DRM GPU SVM solves the per-range validity problem with flag-based
>>>>>>> validation instead of seq-based validation in:
>>>>>>>      - drm_gpusvm_pages_valid() checks
>>>>>>>          flags.has_dma_mapping
>>>>>>>        not notifier_seq. The comment explicitly states:
>>>>>>>          "This is akin to a notifier seqno check in the HMM 
>>>>>>> documentation
>>>>>>>           but due to wider notifiers (i.e., notifiers which span 
>>>>>>> multiple
>>>>>>>           ranges) this function is required for finer grained checking"
>>>>>>>      - __drm_gpusvm_unmap_pages() clears
>>>>>>>          flags.has_dma_mapping = false  under notifier_lock
>>>>>>>      - drm_gpusvm_get_pages() sets
>>>>>>>          flags.has_dma_mapping = true  under notifier_lock
>>>>>>> I adopted the same approach.
>>>>>>>
>>>>>>> DRM GPU SVM:
>>>>>>>      drm_gpusvm_notifier_invalidate()
>>>>>>>        down_write(&gpusvm->notifier_lock);
>>>>>>>        mmu_interval_set_seq(mni, cur_seq);
>>>>>>>        gpusvm->ops->invalidate()
>>>>>>>          -> xe_svm_invalidate()
>>>>>>>             drm_gpusvm_for_each_range()
>>>>>>>               -> __drm_gpusvm_unmap_pages()
>>>>>>>                  WRITE_ONCE(flags.has_dma_mapping = false);  // clear 
>>>>>>> flag
>>>>>>>        up_write(&gpusvm->notifier_lock);
>>>>>>>
>>>>>>> KFD batch userptr:
>>>>>>>      amdgpu_amdkfd_evict_userptr_batch()
>>>>>>>        mutex_lock(&process_info->notifier_lock);
>>>>>>>        mmu_interval_set_seq(mni, cur_seq);
>>>>>>>        discard_invalid_ranges()
>>>>>>>          interval_tree_iter_first/next()
>>>>>>>            range_info->valid = false;          // clear flag
>>>>>>>        mutex_unlock(&process_info->notifier_lock);
>>>>>>>
>>>>>>> Both implementations:
>>>>>>>      - Acquire notifier_lock FIRST, before any flag changes
>>>>>>>      - Call mmu_interval_set_seq() under the lock
>>>>>>>      - Use interval tree to find affected ranges within the wide 
>>>>>>> notifier
>>>>>>>      - Mark per-range flag as invalid/valid under the lock
>>>>>>>
>>>>>>> The page fault path and final validation path also follow the same
>>>>>>> pattern as DRM GPU SVM: fault outside the lock, set/check per-range
>>>>>>> flag under the lock.
>>>>>>>
>>>>>>> Regards,
>>>>>>> Honglei
>>>>>>>
>>>>>>>
>>>>>>> On 2026/2/6 21:56, Christian König wrote:
>>>>>>>> On 2/6/26 07:25, Honglei Huang wrote:
>>>>>>>>> From: Honglei Huang <[email protected]>
>>>>>>>>>
>>>>>>>>> Hi all,
>>>>>>>>>
>>>>>>>>> This is v3 of the patch series to support allocating multiple 
>>>>>>>>> non-contiguous
>>>>>>>>> CPU virtual address ranges that map to a single contiguous GPU 
>>>>>>>>> virtual address.
>>>>>>>>>
>>>>>>>>> v3:
>>>>>>>>> 1. No new ioctl: Reuses existing AMDKFD_IOC_ALLOC_MEMORY_OF_GPU
>>>>>>>>>        - Adds only one flag: KFD_IOC_ALLOC_MEM_FLAGS_USERPTR_BATCH
>>>>>>>>
>>>>>>>> That is most likely not the best approach, but Felix or Philip need to 
>>>>>>>> comment here since I don't know such IOCTLs well either.
>>>>>>>>
>>>>>>>>>        - When flag is set, mmap_offset field points to range array
>>>>>>>>>        - Minimal API surface change
>>>>>>>>
>>>>>>>> Why range of VA space for each entry?
>>>>>>>>
>>>>>>>>> 2. Improved MMU notifier handling:
>>>>>>>>>        - Single mmu_interval_notifier covering the VA span [va_min, 
>>>>>>>>> va_max]
>>>>>>>>>        - Interval tree for efficient lookup of affected ranges during 
>>>>>>>>> invalidation
>>>>>>>>>        - Avoids per-range notifier overhead mentioned in v2 review
>>>>>>>>
>>>>>>>> That won't work unless you also modify hmm_range_fault() to take 
>>>>>>>> multiple VA addresses (or ranges) at the same time.
>>>>>>>>
>>>>>>>> The problem is that we must rely on hmm_range.notifier_seq to detect 
>>>>>>>> changes to the page tables in question, but that in turn works only if 
>>>>>>>> you have one hmm_range structure and not multiple.
>>>>>>>>
>>>>>>>> What might work is doing an XOR or CRC over all hmm_range.notifier_seq 
>>>>>>>> you have, but that is a bit flaky.
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>> Christian.
>>>>>>>>
>>>>>>>>>
>>>>>>>>> 3. Better code organization: Split into 8 focused patches for easier 
>>>>>>>>> review
>>>>>>>>>
>>>>>>>>> v2:
>>>>>>>>>        - Each CPU VA range gets its own mmu_interval_notifier for 
>>>>>>>>> invalidation
>>>>>>>>>        - All ranges validated together and mapped to contiguous GPU VA
>>>>>>>>>        - Single kgd_mem object with array of user_range_info 
>>>>>>>>> structures
>>>>>>>>>        - Unified eviction/restore path for all ranges in a batch
>>>>>>>>>
>>>>>>>>> Current Implementation Approach
>>>>>>>>> ===============================
>>>>>>>>>
>>>>>>>>> This series implements a practical solution within existing kernel 
>>>>>>>>> constraints:
>>>>>>>>>
>>>>>>>>> 1. Single MMU notifier for VA span: Register one notifier covering the
>>>>>>>>>        entire range from lowest to highest address in the batch
>>>>>>>>>
>>>>>>>>> 2. Interval tree filtering: Use interval tree to efficiently identify
>>>>>>>>>        which specific ranges are affected during invalidation 
>>>>>>>>> callbacks,
>>>>>>>>>        avoiding unnecessary processing for unrelated address changes
>>>>>>>>>
>>>>>>>>> 3. Unified eviction/restore: All ranges in a batch share eviction and
>>>>>>>>>        restore paths, maintaining consistency with existing userptr 
>>>>>>>>> handling
>>>>>>>>>
>>>>>>>>> Patch Series Overview
>>>>>>>>> =====================
>>>>>>>>>
>>>>>>>>> Patch 1/8: Add userptr batch allocation UAPI structures
>>>>>>>>>         - KFD_IOC_ALLOC_MEM_FLAGS_USERPTR_BATCH flag
>>>>>>>>>         - kfd_ioctl_userptr_range and kfd_ioctl_userptr_ranges_data 
>>>>>>>>> structures
>>>>>>>>>
>>>>>>>>> Patch 2/8: Add user_range_info infrastructure to kgd_mem
>>>>>>>>>         - user_range_info structure for per-range tracking
>>>>>>>>>         - Fields for batch allocation in kgd_mem
>>>>>>>>>
>>>>>>>>> Patch 3/8: Implement interval tree for userptr ranges
>>>>>>>>>         - Interval tree for efficient range lookup during invalidation
>>>>>>>>>         - mark_invalid_ranges() function
>>>>>>>>>
>>>>>>>>> Patch 4/8: Add batch MMU notifier support
>>>>>>>>>         - Single notifier for entire VA span
>>>>>>>>>         - Invalidation callback using interval tree filtering
>>>>>>>>>
>>>>>>>>> Patch 5/8: Implement batch userptr page management
>>>>>>>>>         - get_user_pages_batch() and set_user_pages_batch()
>>>>>>>>>         - Per-range page array management
>>>>>>>>>
>>>>>>>>> Patch 6/8: Add batch allocation function and export API
>>>>>>>>>         - init_user_pages_batch() main initialization
>>>>>>>>>         - amdgpu_amdkfd_gpuvm_alloc_memory_of_gpu_batch() entry point
>>>>>>>>>
>>>>>>>>> Patch 7/8: Unify userptr cleanup and update paths
>>>>>>>>>         - Shared eviction/restore handling for batch allocations
>>>>>>>>>         - Integration with existing userptr validation flows
>>>>>>>>>
>>>>>>>>> Patch 8/8: Wire up batch allocation in ioctl handler
>>>>>>>>>         - Input validation and range array parsing
>>>>>>>>>         - Integration with existing alloc_memory_of_gpu path
>>>>>>>>>
>>>>>>>>> Testing
>>>>>>>>> =======
>>>>>>>>>
>>>>>>>>> - Multiple scattered malloc() allocations (2-4000+ ranges)
>>>>>>>>> - Various allocation sizes (4KB to 1G+ per range)
>>>>>>>>> - Memory pressure scenarios and eviction/restore cycles
>>>>>>>>> - OpenCL CTS and HIP catch tests in KVM guest environment
>>>>>>>>> - AI workloads: Stable Diffusion, ComfyUI in virtualized environments
>>>>>>>>> - Small LLM inference (3B-7B models)
>>>>>>>>> - Benchmark score: 160,000 - 190,000 (80%-95% of bare metal)
>>>>>>>>> - Performance improvement: 2x-2.4x faster than userspace approach
>>>>>>>>>
>>>>>>>>> Thank you for your review and feedback.
>>>>>>>>>
>>>>>>>>> Best regards,
>>>>>>>>> Honglei Huang
>>>>>>>>>
>>>>>>>>> Honglei Huang (8):
>>>>>>>>>       drm/amdkfd: Add userptr batch allocation UAPI structures
>>>>>>>>>       drm/amdkfd: Add user_range_info infrastructure to kgd_mem
>>>>>>>>>       drm/amdkfd: Implement interval tree for userptr ranges
>>>>>>>>>       drm/amdkfd: Add batch MMU notifier support
>>>>>>>>>       drm/amdkfd: Implement batch userptr page management
>>>>>>>>>       drm/amdkfd: Add batch allocation function and export API
>>>>>>>>>       drm/amdkfd: Unify userptr cleanup and update paths
>>>>>>>>>       drm/amdkfd: Wire up batch allocation in ioctl handler
>>>>>>>>>
>>>>>>>>>      drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h    |  23 +
>>>>>>>>>      .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c  | 539 
>>>>>>>>> +++++++++++++++++-
>>>>>>>>>      drivers/gpu/drm/amd/amdkfd/kfd_chardev.c      | 128 ++++-
>>>>>>>>>      include/uapi/linux/kfd_ioctl.h                |  31 +-
>>>>>>>>>      4 files changed, 697 insertions(+), 24 deletions(-)
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: [PATCH v3 0/8] drm/amdkfd: Add batch userptr allocation support

Reply via email to