amdkfd: Add batch userptr allocation support

Christian König Mon, 09 Feb 2026 02:16:20 -0800

On 2/9/26 07:14, Honglei Huang wrote:
> 
> I've reworked the implementation in v4. The fix is actually inspired
> by the DRM GPU SVM framework (drivers/gpu/drm/drm_gpusvm.c).
> 
> DRM GPU SVM uses wide notifiers (recommended 512M or larger) to track
> multiple user virtual address ranges under a single mmu_interval_notifier,
> and these ranges can be non-contiguous which is essentially the same
> problem that batch userptr needs to solve: one BO backed by multiple
> non-contiguous CPU VA ranges sharing one notifier.


That still doesn't solve the sequencing problem.

As far as I can see you can't use hmm_range_fault with this approach or it 
would just not be very valuable.

So how should that work with your patch set?

Regards,
Christian.

> 
> The wide notifier is created in drm_gpusvm_notifier_alloc:
>   notifier->itree.start = ALIGN_DOWN(fault_addr, gpusvm->notifier_size);
>   notifier->itree.last = ALIGN(fault_addr + 1, gpusvm->notifier_size) - 1;
> The Xe driver passes
>   xe_modparam.svm_notifier_size * SZ_1M in xe_svm_init
> as the notifier_size, so one notifier can cover many of MB of VA space
> containing multiple non-contiguous ranges.
> 
> And DRM GPU SVM solves the per-range validity problem with flag-based
> validation instead of seq-based validation in:
>   - drm_gpusvm_pages_valid() checks
>       flags.has_dma_mapping
>     not notifier_seq. The comment explicitly states:
>       "This is akin to a notifier seqno check in the HMM documentation
>        but due to wider notifiers (i.e., notifiers which span multiple
>        ranges) this function is required for finer grained checking"
>   - __drm_gpusvm_unmap_pages() clears
>       flags.has_dma_mapping = false  under notifier_lock
>   - drm_gpusvm_get_pages() sets
>       flags.has_dma_mapping = true  under notifier_lock
> I adopted the same approach.
> 
> DRM GPU SVM:
>   drm_gpusvm_notifier_invalidate()
>     down_write(&gpusvm->notifier_lock);
>     mmu_interval_set_seq(mni, cur_seq);
>     gpusvm->ops->invalidate()
>       -> xe_svm_invalidate()
>          drm_gpusvm_for_each_range()
>            -> __drm_gpusvm_unmap_pages()
>               WRITE_ONCE(flags.has_dma_mapping = false);  // clear flag
>     up_write(&gpusvm->notifier_lock);
> 
> KFD batch userptr:
>   amdgpu_amdkfd_evict_userptr_batch()
>     mutex_lock(&process_info->notifier_lock);
>     mmu_interval_set_seq(mni, cur_seq);
>     discard_invalid_ranges()
>       interval_tree_iter_first/next()
>         range_info->valid = false;          // clear flag
>     mutex_unlock(&process_info->notifier_lock);
> 
> Both implementations:
>   - Acquire notifier_lock FIRST, before any flag changes
>   - Call mmu_interval_set_seq() under the lock
>   - Use interval tree to find affected ranges within the wide notifier
>   - Mark per-range flag as invalid/valid under the lock
> 
> The page fault path and final validation path also follow the same
> pattern as DRM GPU SVM: fault outside the lock, set/check per-range
> flag under the lock.
> 
> Regards,
> Honglei
> 
> 
> On 2026/2/6 21:56, Christian König wrote:
>> On 2/6/26 07:25, Honglei Huang wrote:
>>> From: Honglei Huang <[email protected]>
>>>
>>> Hi all,
>>>
>>> This is v3 of the patch series to support allocating multiple non-contiguous
>>> CPU virtual address ranges that map to a single contiguous GPU virtual 
>>> address.
>>>
>>> v3:
>>> 1. No new ioctl: Reuses existing AMDKFD_IOC_ALLOC_MEMORY_OF_GPU
>>>     - Adds only one flag: KFD_IOC_ALLOC_MEM_FLAGS_USERPTR_BATCH
>>
>> That is most likely not the best approach, but Felix or Philip need to 
>> comment here since I don't know such IOCTLs well either.
>>
>>>     - When flag is set, mmap_offset field points to range array
>>>     - Minimal API surface change
>>
>> Why range of VA space for each entry?
>>
>>> 2. Improved MMU notifier handling:
>>>     - Single mmu_interval_notifier covering the VA span [va_min, va_max]
>>>     - Interval tree for efficient lookup of affected ranges during 
>>> invalidation
>>>     - Avoids per-range notifier overhead mentioned in v2 review
>>
>> That won't work unless you also modify hmm_range_fault() to take multiple VA 
>> addresses (or ranges) at the same time.
>>
>> The problem is that we must rely on hmm_range.notifier_seq to detect changes 
>> to the page tables in question, but that in turn works only if you have one 
>> hmm_range structure and not multiple.
>>
>> What might work is doing an XOR or CRC over all hmm_range.notifier_seq you 
>> have, but that is a bit flaky.
>>
>> Regards,
>> Christian.
>>
>>>
>>> 3. Better code organization: Split into 8 focused patches for easier review
>>>
>>> v2:
>>>     - Each CPU VA range gets its own mmu_interval_notifier for invalidation
>>>     - All ranges validated together and mapped to contiguous GPU VA
>>>     - Single kgd_mem object with array of user_range_info structures
>>>     - Unified eviction/restore path for all ranges in a batch
>>>
>>> Current Implementation Approach
>>> ===============================
>>>
>>> This series implements a practical solution within existing kernel 
>>> constraints:
>>>
>>> 1. Single MMU notifier for VA span: Register one notifier covering the
>>>     entire range from lowest to highest address in the batch
>>>
>>> 2. Interval tree filtering: Use interval tree to efficiently identify
>>>     which specific ranges are affected during invalidation callbacks,
>>>     avoiding unnecessary processing for unrelated address changes
>>>
>>> 3. Unified eviction/restore: All ranges in a batch share eviction and
>>>     restore paths, maintaining consistency with existing userptr handling
>>>
>>> Patch Series Overview
>>> =====================
>>>
>>> Patch 1/8: Add userptr batch allocation UAPI structures
>>>      - KFD_IOC_ALLOC_MEM_FLAGS_USERPTR_BATCH flag
>>>      - kfd_ioctl_userptr_range and kfd_ioctl_userptr_ranges_data structures
>>>
>>> Patch 2/8: Add user_range_info infrastructure to kgd_mem
>>>      - user_range_info structure for per-range tracking
>>>      - Fields for batch allocation in kgd_mem
>>>
>>> Patch 3/8: Implement interval tree for userptr ranges
>>>      - Interval tree for efficient range lookup during invalidation
>>>      - mark_invalid_ranges() function
>>>
>>> Patch 4/8: Add batch MMU notifier support
>>>      - Single notifier for entire VA span
>>>      - Invalidation callback using interval tree filtering
>>>
>>> Patch 5/8: Implement batch userptr page management
>>>      - get_user_pages_batch() and set_user_pages_batch()
>>>      - Per-range page array management
>>>
>>> Patch 6/8: Add batch allocation function and export API
>>>      - init_user_pages_batch() main initialization
>>>      - amdgpu_amdkfd_gpuvm_alloc_memory_of_gpu_batch() entry point
>>>
>>> Patch 7/8: Unify userptr cleanup and update paths
>>>      - Shared eviction/restore handling for batch allocations
>>>      - Integration with existing userptr validation flows
>>>
>>> Patch 8/8: Wire up batch allocation in ioctl handler
>>>      - Input validation and range array parsing
>>>      - Integration with existing alloc_memory_of_gpu path
>>>
>>> Testing
>>> =======
>>>
>>> - Multiple scattered malloc() allocations (2-4000+ ranges)
>>> - Various allocation sizes (4KB to 1G+ per range)
>>> - Memory pressure scenarios and eviction/restore cycles
>>> - OpenCL CTS and HIP catch tests in KVM guest environment
>>> - AI workloads: Stable Diffusion, ComfyUI in virtualized environments
>>> - Small LLM inference (3B-7B models)
>>> - Benchmark score: 160,000 - 190,000 (80%-95% of bare metal)
>>> - Performance improvement: 2x-2.4x faster than userspace approach
>>>
>>> Thank you for your review and feedback.
>>>
>>> Best regards,
>>> Honglei Huang
>>>
>>> Honglei Huang (8):
>>>    drm/amdkfd: Add userptr batch allocation UAPI structures
>>>    drm/amdkfd: Add user_range_info infrastructure to kgd_mem
>>>    drm/amdkfd: Implement interval tree for userptr ranges
>>>    drm/amdkfd: Add batch MMU notifier support
>>>    drm/amdkfd: Implement batch userptr page management
>>>    drm/amdkfd: Add batch allocation function and export API
>>>    drm/amdkfd: Unify userptr cleanup and update paths
>>>    drm/amdkfd: Wire up batch allocation in ioctl handler
>>>
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h    |  23 +
>>>   .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c  | 539 +++++++++++++++++-
>>>   drivers/gpu/drm/amd/amdkfd/kfd_chardev.c      | 128 ++++-
>>>   include/uapi/linux/kfd_ioctl.h                |  31 +-
>>>   4 files changed, 697 insertions(+), 24 deletions(-)
>>>
>>
>

Re: [PATCH v3 0/8] drm/amdkfd: Add batch userptr allocation support

Reply via email to