On Wed, Jun 17, 2026 at 12:35:00PM +1000, Gavin Shan wrote:
> On 6/16/26 3:44 PM, Michael S. Tsirkin wrote:
> > On Tue, Jun 16, 2026 at 03:40:34PM +1000, Gavin Shan wrote:
> > > On 6/16/26 3:25 PM, Gavin Shan wrote:
> > > > All ram device regions was turned to be indirectly accessible by commit
> > > > 4a2e242bbb ("memory: Don't use memcpy for ram_device regions"). This
> > > > leads
> > > > to a hanged guest where a NVidia GH100 GPU is passed from host. The
> > > > memory
> > > > in its PCI BAR#4 can be allocated as DMA target buffer. qemu has to take
> > > > DMA bounce buffer in address_space_map() to cover the DMA request.
> > > > However,
> > > > the bounce buffer size is 4096 bytes and we're overrunning it easily
> > > > when
> > > > the guest has significant disk activities on compiling 'cuda-samples'.
> > > > The full log and problem description can be found from PATCH[1/2]'s
> > > > commit
> > > > log.
> > > >
> > > > Try to fix the issue handled in commit 4a2e242bbb by replacing
> > > > memcopy()/
> > > > memmove() with newly added helpers qemu_ram_{copy, move}() that works on
> > > > top of __builtin_{memcpy, memmove} or unaligned access friendly memory
> > > > movement in the accessors to the ram device regions. With this, we can
> > > > basically revert that commit to make ram device region directly
> > > > accessible
> > > > again and bypass the bounce buffer in address_space_map() where the
> > > > guest
> > > > hang is caused.
> > > >
> > > > PATCH[1] uses qemu_ram_{copy, move}() in ram device region accessors
> > > > PATCH[2] makes ram device region directly accessible again
> > > >
> > > Michael asked to include below context in the cover letter in v3, but I
> > > didn't noticed that before I sent v3 series, appended with them.
> > >
>
> Looking at the list of issues (questions) raised by Michael, I don't
> understand
> every one
Gavin, I doubt one should make memory.c changes without understanding the issues
it is trying to address.
What is unclear? Ask away.
> before I'm able to put more time to dig, but I feel this series has
> too ambitious goal to cover accesses to all the directly accessible regions
> with the newly introduced qemu_ram_{copy, move}. It causes too many behavior
> changes and concerns, making this series impossible to land.
>
> I would suggest to break down the goal and step back to apply the newly
> introduced
> qemu_ram_{copy, move} to the ram device regions only? It's actually something
> proposed by Peter Xu in the earlier replies. Taking address_space_write() as
> an
> example, the indirectly accessible regions are covered by
> memory_region_dispatch_write()
> in (1), the ram device region is covered by qemu_ram_move() in (2), and all
> other
> directly accessible regions are covered by memmove() in (3).
>
> address_space_write
> flatview_write
> flatview_write_continue
> flatview_write_continue_step
> memory_access_size // (1) indirectly accessible region
> memory_region_dispatch_write
> access_with_adjusted_size
> memory_region_write_accessor
> mr->ops->write
> qemu_ram_move // (2) ram device region
> memmove // (3) all other directly accessible
> regions
>
> With the limitation, only the ram device regions in (2) are affected. We're
> basically moving the accesses to the ram device region from (1) to (2). No
> changes introduced to other types of regions. The goal is to make the ram
> device
> region accessible so that the bounce buffer can be bypassed in DMA path.
Esthetics aside - ram device regions have all the same issues.
Maybe you can limit the scope of the changes,
but I doubt you can get out understanding)
> > > ----
> > >
> > > The issues listed by Michael:
> > >
> > > 1. On x86, memcpy is different from __builtin_memcpy if one uses old 1.0
> > > force-headers from 2019. Likely no longer relevant.
> > >
> > > 2. variable length memcpy can translate 2,4,8 byte guest access into
> > > multiple byte accesses. doing this for mmio is guaranteed to break
> > > devices.
> > >
> > > 3. (theoretical concern) also on x86, unaligned accesses are possible on
> > > guest
> > > and host, so converting an unaligned access to a series of aligned
> > > ones can
> > > in theory break devices.
> > >
> > > 4. also on x86, vector instructions for large (>16 byte) writes into
> > > pgprot_noncached memory are safe and faster than multiple 8 byte ones.
> > >
> > > 5. also on x86 it so happens that if you write a fixed-size memcpy this
> > > gets
> > > optimized to a single store/load and it works for aligned and
> > > unaligned
> > > addresses on that architecture. How to ensure this keeps being correct
> > > is left as an excerise for the reader. But qemu already relies on this
> > > and did for years.
> > >
> > > 6. on non-x86 both unaligned accesses and vector instructions for
> > > accessing
> > > UC memory are illegal.
> > >
> > > 7. standard vfio gives KVM VM_ALLOW_ANY_UNCACHED, so even on non x86
> > > guest can
> > > map the memory as as pgprot_noncached/ioremap or
> > > pgprot_writecombine/ioremap_uc.
> > > If it does the second then it can use unaligned or vector for access.
> > > This is why normal passthrough tends to work - it never traps to qemu
> > > at
> > > all. But for qemu, vfio uses pgprot_noncached unconditionally so qemu
> > > can't use unaligned or vector instructions on non-x86.
> > >
> > >
> > > 8. But for nvgrace RAM, vfio has a driver that uses
> > > pgprot_writecombine/ioremap_uc.
> > > so qemu could safely use unaligned/vector instructioons even on
> > > non-x86.
> > >
> > > 9. Except sadly, vfio currently does not tell qemu how it maps
> > > the memory, so qemu can not know what is safe on non-x86.
> > >
> >
> > And more:
> >
> > 10. on x86 memcpy will sometimes do multiple overlapping stores when
> > size is not a power of 2. for example, a 15 byte write is done with
> > 2 8-byte stores. This is theoretically an issue
> > if guest does something super clever with ordering,
> > but does not seem to be in practice.
> >
> > 10. on non-x86 memcpy will do multiple overlapping stores even
> > for single byte writes. E.g. it does it to avoid extra branches.
> > This is causing issues in practice.
> >
> >
> >
> >
> >
> > > Now, what is to be done?
> > >
> > >
> > > A. on x86, we must avoid converting 2,4,8 byte accesses into byte
> > > accesses.
> > > At least for aligned, perferably for unaligned accesses too.
> > > Fixed width memcpy seems to work for this. Whether we should bother with
> > > __builtin to work around broken old fortify headers, I donnu.
> > > I do not have any answer how to check that compiler does this correctly.
> > > If anyone is motivated enough, adding a GCC builtin could be possible.
> > > Given qemu did this for years, I think we can leave solving this for
> > > another day.
> > >
> > > B. Also on x86, I do not see why we should not use memcpy for large
> > > accesses if we can. Better perf.
> > >
> > > C. on non-x86, we currently must not memcpy since we do not know if it
> > > is pgprot_noncached. yes, performance will be bad for DMA into device RAM.
> > >
> > > D. It goes without saying that casting an unaligned address to unint32_t
> > > (be it for qatomic_set or whatever) is undefined behaviour in C
> > > and so a bad idea on any architecture.
> > >
> > > E. also for non-x86, we really should teach vfio to tell qemu whether
> > > it maps device pgprot_noncached or pgprot_writecombine.
> > > we will then be able to use memcpy for >8 accesses.
> > >
> > > Anyone, correct me if I'm wrong? Maybe I should start a new thread with
> > > this summary?
> > >
>
> Thanks,
> Gavin