On Tue, Jun 16, 2026 at 03:40:34PM +1000, Gavin Shan wrote:
> On 6/16/26 3:25 PM, Gavin Shan wrote:
> > All ram device regions was turned to be indirectly accessible by commit
> > 4a2e242bbb ("memory: Don't use memcpy for ram_device regions"). This leads
> > to a hanged guest where a NVidia GH100 GPU is passed from host. The memory
> > in its PCI BAR#4 can be allocated as DMA target buffer. qemu has to take
> > DMA bounce buffer in address_space_map() to cover the DMA request. However,
> > the bounce buffer size is 4096 bytes and we're overrunning it easily when
> > the guest has significant disk activities on compiling 'cuda-samples'.
> > The full log and problem description can be found from PATCH[1/2]'s commit
> > log.
> > 
> > Try to fix the issue handled in commit 4a2e242bbb by replacing memcopy()/
> > memmove() with newly added helpers qemu_ram_{copy, move}() that works on
> > top of __builtin_{memcpy, memmove} or unaligned access friendly memory
> > movement in the accessors to the ram device regions. With this, we can
> > basically revert that commit to make ram device region directly accessible
> > again and bypass the bounce buffer in address_space_map() where the guest
> > hang is caused.
> > 
> > PATCH[1] uses qemu_ram_{copy, move}() in ram device region accessors
> > PATCH[2] makes ram device region directly accessible again
> > 
> Michael asked to include below context in the cover letter in v3, but I
> didn't noticed that before I sent v3 series, appended with them.
> 
> ----
> 
> The issues listed by Michael:
> 
> 1. On x86, memcpy is different from __builtin_memcpy if one uses old 1.0
>    force-headers from 2019. Likely no longer relevant.
> 
> 2. variable length memcpy can translate 2,4,8 byte guest access into
>    multiple byte accesses. doing this for mmio is guaranteed to break devices.
> 
> 3. (theoretical concern) also on x86, unaligned accesses are possible on guest
>    and host, so converting an unaligned access to a series of aligned ones can
>    in theory break devices.
> 
> 4. also on x86, vector instructions for large (>16 byte) writes into
>    pgprot_noncached memory are safe and faster than multiple 8 byte ones.
> 
> 5. also on x86 it so happens that if you write a fixed-size memcpy this gets
>    optimized to a single store/load and it works for aligned and unaligned
>    addresses on that architecture. How to ensure this keeps being correct
>    is left as an excerise for the reader. But qemu already relies on this
>    and did for years.
> 
> 6. on non-x86 both unaligned accesses and vector instructions for accessing
>    UC memory are illegal.
> 
> 7. standard vfio gives KVM VM_ALLOW_ANY_UNCACHED, so even on non x86 guest can
>    map the memory as as pgprot_noncached/ioremap or 
> pgprot_writecombine/ioremap_uc.
>    If it does the second then it can use unaligned or vector for access.
>    This is why normal passthrough tends to work - it never traps to qemu at
>    all. But for qemu, vfio uses  pgprot_noncached unconditionally so qemu
>    can't use unaligned or vector instructions on non-x86.
> 
> 
> 8. But for nvgrace RAM, vfio has a driver that uses 
> pgprot_writecombine/ioremap_uc.
>    so qemu could safely use unaligned/vector instructioons even on non-x86.
> 
> 9. Except sadly, vfio currently does not tell qemu how it maps
>    the memory, so qemu can not know what is safe on non-x86.
> 

And more:

10. on x86 memcpy will sometimes do multiple overlapping stores when
size is not a power of 2. for example, a 15 byte write is done with
2 8-byte stores. This is theoretically an issue
if guest does something super clever with ordering,
but does not seem to be in practice.

10. on non-x86 memcpy will do multiple overlapping stores even
for single byte writes. E.g. it does it to avoid extra branches.
This is causing issues in practice.





> Now, what is to be done?
> 
> 
> A. on x86, we must avoid converting 2,4,8 byte accesses into byte accesses.
> At least for aligned, perferably for unaligned accesses too.
> Fixed width memcpy seems to work for this. Whether we should bother with
> __builtin to work around broken old fortify headers, I donnu.
> I do not have any answer how to check that compiler does this correctly.
> If anyone is motivated enough, adding a GCC builtin could be possible.
> Given qemu did this for years, I think we can leave solving this for
> another day.
> 
> B. Also on x86, I do not see why we should not use memcpy for large
> accesses if we can. Better perf.
> 
> C. on non-x86, we currently must not memcpy since we do not know if it
> is pgprot_noncached. yes, performance will be bad for DMA into device RAM.
> 
> D. It goes without saying that casting an unaligned address to unint32_t
> (be it for qatomic_set or whatever) is undefined behaviour in C
> and so a bad idea on any architecture.
> 
> E. also for non-x86, we really should teach vfio to tell qemu whether
> it maps device pgprot_noncached or pgprot_writecombine.
> we will then be able to use memcpy for >8 accesses.
> 
> Anyone, correct me if I'm wrong? Maybe I should start a new thread with
> this summary?
> 
> Thanks,
> Gavin


Reply via email to