On 2026/1/15 17:20, Akihiko Odaki wrote:
On 2026/01/15 16:58, Honglei Huang wrote:
From: Honglei Huang <[email protected]>
Hello,
This series adds virtio-gpu userptr support to enable ROCm native
context for compute workloads. The userptr feature allows the host to
directly access guest userspace memory without memcpy overhead, which is
essential for GPU compute performance.
The userptr implementation provides buffer-based zero-copy memory access.
This approach pins guest userspace pages and exposes them to the host
via scatter-gather tables, enabling efficient compute operations.
This description looks identical with what
VIRTIO_GPU_BLOB_MEM_HOST3D_GUEST does so there should be some
explanation how it makes difference.
I have already pointed out this when reviewing the QEMU patches[1], but
I note that here too, since QEMU is just a middleman and this matter is
better discussed by Linux and virglrenderer developers.
[1] https://lore.kernel.org/qemu-devel/35a8add7-da49-4833-9e69-
[email protected]/
Thanks for raising this important point about the distinction between
VIRTGPU_BLOB_FLAG_USE_USERPTR and VIRTIO_GPU_BLOB_MEM_HOST3D_GUEST.
I might not have explained it clearly previously.
The key difference is memory ownership and lifecycle:
BLOB_MEM_HOST3D_GUEST:
- Kernel allocates memory (drm_gem_shmem_create)
- Userspace accesses via mmap(GEM_BO)
- Use case: Graphics resources (Vulkan/OpenGL)
BLOB_FLAG_USE_USERPTR:
- Userspace pre-allocates memory (malloc/mmap)
- Kernel only get existing pages
- Use case: Compute workloads (ROCm/CUDA) with large datasets, like
GPU needs load a big model file 10G+, UMD mmap the fd file, then give
the mmap ptr into userspace then driver do not need a another copy.
But if the shmem is used, the userspace needs copy the file data into a
shmem mmap ptr there is a copy overhead.
Userptr:
file -> open/mmap -> userspace ptr -> driver
shmem:
user alloc shmem ──→ mmap shmem ──→ shmem userspace ptr -> driver
↑
│ copy
│
file ──→ open/mmap ──→ file userptr ──────────┘
For compute workloads, this matters significantly:
Without userptr: malloc(8GB) → alloc GEM BO → memcpy 8GB → compute →
memcpy 8GB back
With userptr: malloc(8GB) → create userptr BO → compute (zero-copy)
The explicit flag serves three purposes:
1. Although both send scatter-gather entries to host. The flag makes the
intent unambiguous.
2. Ensures consistency between flag and userptr address field.
3. Future HMM support: There is a plan to upgrade userptr implementation
to use Heterogeneous Memory Management for better GPU coherency and
dynamic page migration. The flag provides a clean path to future upgrade.
I understand the concern about API complexity. I'll defer to the
virtio-gpu maintainers for the final decision on whether this design is
acceptable or if they prefer an alternative approach.
Regards,
Honglei Huang
Key features:
- Zero-copy memory access between guest userspace and host GPU
- Read-only and read-write userptr support
- Runtime feature detection via VIRTGPU_PARAM_RESOURCE_USERPTR
- ROCm capset support for ROCm stack integration
- Proper page lifecycle management with FOLL_LONGTERM pinning
Patches overview:
1. Add VIRTIO_GPU_CAPSET_ROCM capability for compute workloads
2. Add virtio-gpu API definitions for userptr blob resources
3. Extend DRM UAPI with comprehensive userptr support
4. Implement core userptr functionality with page management
5. Integrate userptr into blob resource creation and advertise to
userspace
Performance: In popular compute benchmarks, this implementation achieves
approximately 70% efficiency compared to bare metal OpenCL performance on
AMD V2000 hardware, achieves 92% efficiency on AMD W7900 hardware.
Testing: Verified with ROCm stack and OpenCL applications in VIRTIO
virtualized
environments.
- Full OPENCL CTS tests passed on ROCm 5.7.0 in V2000 platform.
- Near 70% percentage of OPENCL CTS tests passed on ROCm 7.0 W7900
platform.
- most HIP catch tests passed on ROCm 7.0 W7900 platform.
- Some AI applications enabled on ROCm 7.0 W7900 platform.
V4 changes:
- Renamed VIRTIO_GPU_CAPSET_HSAKMT to VIRTIO_GPU_CAPSET_ROCM
- Remove userptr feature probing cause it can reuse the guest
blob resource code path, reduce patch count from 6 to 5
- Updated corresponding commit messages
- Consolidated userptr feature detection in final patch
- Update corresponding cover letter content
V3 changes:
- Split into focused patches for easier review
- Removed complex interval tree userptr management
- Simplified resource creation without deduplication
- Added VIRTGPU_PARAM_RESOURCE_USERPTR for feature detection
- Improved UAPI documentation and error handling
- Enhanced code quality with proper cleanup paths
- Removed MMU notifier dependencies for simplicity
- Fixed resource lifecycle management issues
V2: - Split add HSAKMT context and blob userptr resource to two patches.
- Remove MMU notifier related patches, cause use not moveable
user space
memory with MMU notifier is not a good idea.
- Remove HSAKMT context check when create context, let all the
context
support the userptr feature.
- Remove MMU notifier related content in cover letter.
- Add more comments for patch 6 in cover letter.
Honglei Huang (5):
drm/virtio-gpu: Add VIRTIO_GPU_CAPSET_ROCM capability
virtio-gpu api: add blob userptr resource
drm/virtgpu api: add blob userptr resource
drm/virtio: implement userptr support for zero-copy memory access
drm/virtio: advertise base userptr feature to userspace
drivers/gpu/drm/virtio/Makefile | 3 +-
drivers/gpu/drm/virtio/virtgpu_drv.h | 33 ++++
drivers/gpu/drm/virtio/virtgpu_ioctl.c | 9 +-
drivers/gpu/drm/virtio/virtgpu_object.c | 6 +
drivers/gpu/drm/virtio/virtgpu_userptr.c | 231 +++++++++++++++++++++++
include/uapi/drm/virtgpu_drm.h | 9 +
include/uapi/linux/virtio_gpu.h | 7 +
7 files changed, 295 insertions(+), 3 deletions(-)
create mode 100644 drivers/gpu/drm/virtio/virtgpu_userptr.c