On Fri, Jun 12, 2026 at 08:29:34AM -0700, Richard Henderson wrote:
> On 6/12/26 04:03, Gavin Shan wrote:
> > All ram device regions was turned to be indirectly accessible by commit
> > 4a2e242bbb ("memory: Don't use memcpy for ram_device regions"). This leads
> > to guest hang on compiling 'cuda-samples' as reported by Julia. The guest
> > is started by the following command lines, with a GH100 GPU card.
> > 
> >     host$ lspci | grep GH100
> >     0009:01:00.0 3D controller: NVIDIA Corporation GH100 [GH200 120GB / 
> > 480GB] (rev a1)
> >     host$ /home/sandbox/gavin/qemu.main/build/qemu-system-aarch64           
> >  \
> >           -machine virt,gic-version=host,ras=on,highmem-mmio-size=4T        
> >  \
> >           -accel kvm -cpu host -smp cpus=48 -m size=8G                      
> >  \
> >           -drive file=/home/gavin/sandbox/images/disk.qcow2,if=none,id=d0   
> >  \
> >           -device virtio-blk-pci,id=vb0,bus=pcie.0,drive=d0,num-queues=4    
> >  \
> >           -device vfio-pci-nohotplug,host=0009:01:00.0,bus=pcie.1.0
> >             :
> >     guest$ cd cuda-samples/build
> >     guest$ make -j 20 clean
> >     guest$ make -j 20
> >             :
> >     [ 54%] Linking CUDA executable graphMemoryNodes
> >     [ 54%] Built target graphMemoryNodes
> >     <no more output afterwards, guest becomes frozen here>
> > 
> >     guest$ qemu-system-aarch64: virtio: bogus descriptor or out of resources
> >     [  555.814025] virtio_blk virtio0: [vda] new size: 268435456 512-byte 
> > logical blocks (137 GB/128 GiB)
> > 
> > When the GPU's driver (NVidia open driver) is loaded on guest bootup,
> > the memory blocks residing in the PCI BAR#4 can be presented to the
> > guest through memory hot-add. The page cache can be allocated from the
> > hot added memory blocks when cuda-samples is being compiled. Afterwards,
> > the page cache is sent to QEMU's virtio-blk device as part of the DMA
> > request, the bounce buffer has to be used to accomodate the request as
> > the corresponding memory region (MemoryRegion) is a RAM DEVICE region
> > and indirectly accessible in qemu. However, the max bounce bufer size
> > is only 4096 bytes by default. We're running out of that space quickly.
> > 
> >    QEMU
> >    ====
> >    virtio_blk_handle_output
> >      virtio_blk_handle_vq
> >        virtio_blk_get_request
> >          virtqueue_pop
> >            virtqueue_split_pop
> >              virtqueue_map_desc
> >                address_space_map
> >                  memory_access_is_direct         # Return false
> >                    memory_region_supports_direct_access
> > 
> >    (qemu) info mtree
> >    memory-region: pci_bridge_pci
> >      0000000000000000-ffffffffffffffff (prio 0, container): pci_bridge_pci
> >        0000042000000000-0000043fffffffff (prio 1, i/o): 0009:01:00.0 base 
> > BAR 4
> >          0000042000000000-0000043fffffffff (prio 0, i/o): 0009:01:00.0 BAR 4
> >            0000042000000000-000004379fffffff (prio 0, ramd): 0009:01:00.0 
> > BAR 4 mmaps[0]
> > 
> > This replaces mem{cpy, move} with __builtin_mem{cpy, move} in the memory
> > accessors to ram device memory region, preparatory work to make ram device
> > region directly accessible and bypass the bounce buffer in the DMA path
> > in next patch.
> 
> memcpy/memmove *always* compile to __builtin_memcpy/memmove, and the
> compiler later decides whether or not to expand inline.

Did some research.

The issue that prompted use of __builtin_memcpy in bswap.h is musl's fortify 
headers
that Alpine Linux was using back in 2019.

Try this:
git clone --depth=1 https://github.com/jvoisin/fortify-headers 
/tmp/fortify-headers
git -C /tmp/fortify-headers checkout 5aabc7e~1

now:

musl-gcc -O2 -D_FORTIFY_SOURCE=2 -isystem /tmp/fortify-headers/include -S -o- 
test.c
where test.c is:

#include <string.h>
#include <stdint.h>

void test_memcpy(void *dst, const void *src) {
    memcpy(dst, src, 4);
}

void test_builtin_memcpy(void *dst, const void *src) {
    __builtin_memcpy(dst, src, 4);
}




On x86:

    test_memcpy:
     .LFB13:
        .cfi_startproc
        cmpq    %rsi, %rdi
        jnb     .L2
        leaq    4(%rdi), %rax
        cmpq    %rax, %rsi
        jb      .L6
     .L4:                                                                       
                     
        movl    $4, %edx
        jmp     memcpy
        .p2align 4,,10
        .p2align 3
     .L2:
        cmpq    %rdi, %rsi
        jnb     .L4
        leaq    4(%rsi), %rax
        cmpq    %rax, %rdi
        jb      .L3
        movl    $4, %edx
        jmp     memcpy
     .L6:
        jmp     .L3
        .cfi_endproc
        .section        .text.unlikely
        .cfi_startproc
        .type   test_memcpy.cold, @function
     test_memcpy.cold:
     .LFSB13:
     .L3:
        ud2
        .cfi_endproc
     .LFE13:
        .text
        .size   test_memcpy, .-test_memcpy



A jmp so a function call.
While builtin:



     test_builtin_memcpy:
     .LFB14:
        .cfi_startproc
        movl    (%rsi), %eax
        movl    %eax, (%rdi)
        ret
        .cfi_endproc
     .LFE14:
        .size   test_builtin_memcpy, .-test_builtin_memcpy



So it's a fortify headers Version 1.0 bug.  Version 1.1 has the fix.

I guess we should rewrite bswap.h then.


> So, this doesn't do what you think it does.
> My real question is: what are you attempting to achieve?
> 
> (1) is the problem unaligned access to a mapped physical device?
> (2) is the problem vector access to a mapped physical device?
> (3) something else?
> 
> 
> r~


Reply via email to