On 6/12/26 04:03, Gavin Shan wrote:
All ram device regions was turned to be indirectly accessible by commit 4a2e242bbb ("memory: Don't use memcpy for ram_device regions"). This leads to guest hang on compiling 'cuda-samples' as reported by Julia. The guest is started by the following command lines, with a GH100 GPU card.host$ lspci | grep GH100 0009:01:00.0 3D controller: NVIDIA Corporation GH100 [GH200 120GB / 480GB] (rev a1) host$ /home/sandbox/gavin/qemu.main/build/qemu-system-aarch64 \ -machine virt,gic-version=host,ras=on,highmem-mmio-size=4T \ -accel kvm -cpu host -smp cpus=48 -m size=8G \ -drive file=/home/gavin/sandbox/images/disk.qcow2,if=none,id=d0 \ -device virtio-blk-pci,id=vb0,bus=pcie.0,drive=d0,num-queues=4 \ -device vfio-pci-nohotplug,host=0009:01:00.0,bus=pcie.1.0 : guest$ cd cuda-samples/build guest$ make -j 20 clean guest$ make -j 20 : [ 54%] Linking CUDA executable graphMemoryNodes [ 54%] Built target graphMemoryNodes <no more output afterwards, guest becomes frozen here> guest$ qemu-system-aarch64: virtio: bogus descriptor or out of resources [ 555.814025] virtio_blk virtio0: [vda] new size: 268435456 512-byte logical blocks (137 GB/128 GiB) When the GPU's driver (NVidia open driver) is loaded on guest bootup, the memory blocks residing in the PCI BAR#4 can be presented to the guest through memory hot-add. The page cache can be allocated from the hot added memory blocks when cuda-samples is being compiled. Afterwards, the page cache is sent to QEMU's virtio-blk device as part of the DMA request, the bounce buffer has to be used to accomodate the request as the corresponding memory region (MemoryRegion) is a RAM DEVICE region and indirectly accessible in qemu. However, the max bounce bufer size is only 4096 bytes by default. We're running out of that space quickly. QEMU ==== virtio_blk_handle_output virtio_blk_handle_vq virtio_blk_get_request virtqueue_pop virtqueue_split_pop virtqueue_map_desc address_space_map memory_access_is_direct # Return false memory_region_supports_direct_access (qemu) info mtree memory-region: pci_bridge_pci 0000000000000000-ffffffffffffffff (prio 0, container): pci_bridge_pci 0000042000000000-0000043fffffffff (prio 1, i/o): 0009:01:00.0 base BAR 4 0000042000000000-0000043fffffffff (prio 0, i/o): 0009:01:00.0 BAR 4 0000042000000000-000004379fffffff (prio 0, ramd): 0009:01:00.0 BAR 4 mmaps[0] This replaces mem{cpy, move} with __builtin_mem{cpy, move} in the memory accessors to ram device memory region, preparatory work to make ram device region directly accessible and bypass the bounce buffer in the DMA path in next patch.
memcpy/memmove *always* compile to __builtin_memcpy/memmove, and the compiler later decides whether or not to expand inline.
So, this doesn't do what you think it does. My real question is: what are you attempting to achieve? (1) is the problem unaligned access to a mapped physical device? (2) is the problem vector access to a mapped physical device? (3) something else? r~
