On Tue, Jun 16, 2026 at 03:40:34PM +1000, Gavin Shan wrote:
> On 6/16/26 3:25 PM, Gavin Shan wrote:
> > All ram device regions was turned to be indirectly accessible by commit
> > 4a2e242bbb ("memory: Don't use memcpy for ram_device regions"). This leads
> > to a hanged guest where a NVidia GH100 GPU is passed from host. The memory
> > in its PCI BAR#4 can be allocated as DMA target buffer. qemu has to take
> > DMA bounce buffer in address_space_map() to cover the DMA request. However,
> > the bounce buffer size is 4096 bytes and we're overrunning it easily when
> > the guest has significant disk activities on compiling 'cuda-samples'.
> > The full log and problem description can be found from PATCH[1/2]'s commit
> > log.
> >
> > Try to fix the issue handled in commit 4a2e242bbb by replacing memcopy()/
> > memmove() with newly added helpers qemu_ram_{copy, move}() that works on
> > top of __builtin_{memcpy, memmove} or unaligned access friendly memory
> > movement in the accessors to the ram device regions. With this, we can
> > basically revert that commit to make ram device region directly accessible
> > again and bypass the bounce buffer in address_space_map() where the guest
> > hang is caused.
> >
> > PATCH[1] uses qemu_ram_{copy, move}() in ram device region accessors
> > PATCH[2] makes ram device region directly accessible again
> >
> Michael asked to include below context in the cover letter in v3, but I
> didn't noticed that before I sent v3 series, appended with them.
>
> ----
>
> The issues listed by Michael:
>
> 1. On x86, memcpy is different from __builtin_memcpy if one uses old 1.0
> force-headers from 2019. Likely no longer relevant.
>
> 2. variable length memcpy can translate 2,4,8 byte guest access into
> multiple byte accesses. doing this for mmio is guaranteed to break devices.
>
> 3. (theoretical concern) also on x86, unaligned accesses are possible on guest
> and host, so converting an unaligned access to a series of aligned ones can
> in theory break devices.
>
> 4. also on x86, vector instructions for large (>16 byte) writes into
> pgprot_noncached memory are safe and faster than multiple 8 byte ones.
>
> 5. also on x86 it so happens that if you write a fixed-size memcpy this gets
> optimized to a single store/load and it works for aligned and unaligned
> addresses on that architecture. How to ensure this keeps being correct
> is left as an excerise for the reader. But qemu already relies on this
> and did for years.
>
> 6. on non-x86 both unaligned accesses and vector instructions for accessing
> UC memory are illegal.
>
> 7. standard vfio gives KVM VM_ALLOW_ANY_UNCACHED, so even on non x86 guest can
> map the memory as as pgprot_noncached/ioremap or
> pgprot_writecombine/ioremap_uc.
> If it does the second then it can use unaligned or vector for access.
> This is why normal passthrough tends to work - it never traps to qemu at
> all. But for qemu, vfio uses pgprot_noncached unconditionally so qemu
> can't use unaligned or vector instructions on non-x86.
>
>
> 8. But for nvgrace RAM, vfio has a driver that uses
> pgprot_writecombine/ioremap_uc.
> so qemu could safely use unaligned/vector instructioons even on non-x86.
>
> 9. Except sadly, vfio currently does not tell qemu how it maps
> the memory, so qemu can not know what is safe on non-x86.
>
And more:
10. on x86 memcpy will sometimes do multiple overlapping stores when
size is not a power of 2. for example, a 15 byte write is done with
2 8-byte stores. This is theoretically an issue
if guest does something super clever with ordering,
but does not seem to be in practice.
10. on non-x86 memcpy will do multiple overlapping stores even
for single byte writes. E.g. it does it to avoid extra branches.
This is causing issues in practice.
> Now, what is to be done?
>
>
> A. on x86, we must avoid converting 2,4,8 byte accesses into byte accesses.
> At least for aligned, perferably for unaligned accesses too.
> Fixed width memcpy seems to work for this. Whether we should bother with
> __builtin to work around broken old fortify headers, I donnu.
> I do not have any answer how to check that compiler does this correctly.
> If anyone is motivated enough, adding a GCC builtin could be possible.
> Given qemu did this for years, I think we can leave solving this for
> another day.
>
> B. Also on x86, I do not see why we should not use memcpy for large
> accesses if we can. Better perf.
>
> C. on non-x86, we currently must not memcpy since we do not know if it
> is pgprot_noncached. yes, performance will be bad for DMA into device RAM.
>
> D. It goes without saying that casting an unaligned address to unint32_t
> (be it for qatomic_set or whatever) is undefined behaviour in C
> and so a bad idea on any architecture.
>
> E. also for non-x86, we really should teach vfio to tell qemu whether
> it maps device pgprot_noncached or pgprot_writecombine.
> we will then be able to use memcpy for >8 accesses.
>
> Anyone, correct me if I'm wrong? Maybe I should start a new thread with
> this summary?
>
> Thanks,
> Gavin