On 6/12/26 04:03, Gavin Shan wrote:
All ram device regions was turned to be indirectly accessible by commit
4a2e242bbb ("memory: Don't use memcpy for ram_device regions"). This leads
to guest hang on compiling 'cuda-samples' as reported by Julia. The guest
is started by the following command lines, with a GH100 GPU card.

    host$ lspci | grep GH100
    0009:01:00.0 3D controller: NVIDIA Corporation GH100 [GH200 120GB / 480GB] 
(rev a1)
    host$ /home/sandbox/gavin/qemu.main/build/qemu-system-aarch64            \
          -machine virt,gic-version=host,ras=on,highmem-mmio-size=4T         \
          -accel kvm -cpu host -smp cpus=48 -m size=8G                       \
          -drive file=/home/gavin/sandbox/images/disk.qcow2,if=none,id=d0    \
          -device virtio-blk-pci,id=vb0,bus=pcie.0,drive=d0,num-queues=4     \
          -device vfio-pci-nohotplug,host=0009:01:00.0,bus=pcie.1.0
            :
    guest$ cd cuda-samples/build
    guest$ make -j 20 clean
    guest$ make -j 20
            :
    [ 54%] Linking CUDA executable graphMemoryNodes
    [ 54%] Built target graphMemoryNodes
    <no more output afterwards, guest becomes frozen here>

    guest$ qemu-system-aarch64: virtio: bogus descriptor or out of resources
    [  555.814025] virtio_blk virtio0: [vda] new size: 268435456 512-byte 
logical blocks (137 GB/128 GiB)

When the GPU's driver (NVidia open driver) is loaded on guest bootup,
the memory blocks residing in the PCI BAR#4 can be presented to the
guest through memory hot-add. The page cache can be allocated from the
hot added memory blocks when cuda-samples is being compiled. Afterwards,
the page cache is sent to QEMU's virtio-blk device as part of the DMA
request, the bounce buffer has to be used to accomodate the request as
the corresponding memory region (MemoryRegion) is a RAM DEVICE region
and indirectly accessible in qemu. However, the max bounce bufer size
is only 4096 bytes by default. We're running out of that space quickly.

   QEMU
   ====
   virtio_blk_handle_output
     virtio_blk_handle_vq
       virtio_blk_get_request
         virtqueue_pop
           virtqueue_split_pop
             virtqueue_map_desc
               address_space_map
                 memory_access_is_direct         # Return false
                   memory_region_supports_direct_access

   (qemu) info mtree
   memory-region: pci_bridge_pci
     0000000000000000-ffffffffffffffff (prio 0, container): pci_bridge_pci
       0000042000000000-0000043fffffffff (prio 1, i/o): 0009:01:00.0 base BAR 4
         0000042000000000-0000043fffffffff (prio 0, i/o): 0009:01:00.0 BAR 4
           0000042000000000-000004379fffffff (prio 0, ramd): 0009:01:00.0 BAR 4 
mmaps[0]

This replaces mem{cpy, move} with __builtin_mem{cpy, move} in the memory
accessors to ram device memory region, preparatory work to make ram device
region directly accessible and bypass the bounce buffer in the DMA path
in next patch.

memcpy/memmove *always* compile to __builtin_memcpy/memmove, and the compiler later decides whether or not to expand inline.

So, this doesn't do what you think it does.
My real question is: what are you attempting to achieve?

(1) is the problem unaligned access to a mapped physical device?
(2) is the problem vector access to a mapped physical device?
(3) something else?


r~

Reply via email to