memory: Make ram device region directly accessible

Michael S. Tsirkin Tue, 16 Jun 2026 22:53:20 -0700

On Wed, Jun 17, 2026 at 12:35:00PM +1000, Gavin Shan wrote:
> On 6/16/26 3:44 PM, Michael S. Tsirkin wrote:
> > On Tue, Jun 16, 2026 at 03:40:34PM +1000, Gavin Shan wrote:
> > > On 6/16/26 3:25 PM, Gavin Shan wrote:
> > > > All ram device regions was turned to be indirectly accessible by commit
> > > > 4a2e242bbb ("memory: Don't use memcpy for ram_device regions"). This 
> > > > leads
> > > > to a hanged guest where a NVidia GH100 GPU is passed from host. The 
> > > > memory
> > > > in its PCI BAR#4 can be allocated as DMA target buffer. qemu has to take
> > > > DMA bounce buffer in address_space_map() to cover the DMA request. 
> > > > However,
> > > > the bounce buffer size is 4096 bytes and we're overrunning it easily 
> > > > when
> > > > the guest has significant disk activities on compiling 'cuda-samples'.
> > > > The full log and problem description can be found from PATCH[1/2]'s 
> > > > commit
> > > > log.
> > > > 
> > > > Try to fix the issue handled in commit 4a2e242bbb by replacing 
> > > > memcopy()/
> > > > memmove() with newly added helpers qemu_ram_{copy, move}() that works on
> > > > top of __builtin_{memcpy, memmove} or unaligned access friendly memory
> > > > movement in the accessors to the ram device regions. With this, we can
> > > > basically revert that commit to make ram device region directly 
> > > > accessible
> > > > again and bypass the bounce buffer in address_space_map() where the 
> > > > guest
> > > > hang is caused.
> > > > 
> > > > PATCH[1] uses qemu_ram_{copy, move}() in ram device region accessors
> > > > PATCH[2] makes ram device region directly accessible again
> > > > 
> > > Michael asked to include below context in the cover letter in v3, but I
> > > didn't noticed that before I sent v3 series, appended with them.
> > > 
> 
> Looking at the list of issues (questions) raised by Michael, I don't 
> understand
> every one


Gavin, I doubt one should make memory.c changes without understanding the issues
it is trying to address.

What is unclear? Ask away.


> before I'm able to put more time to dig, but I feel this series has
> too ambitious goal to cover accesses to all the directly accessible regions
> with the newly introduced qemu_ram_{copy, move}. It causes too many behavior
> changes and concerns, making this series impossible to land.
> 
> I would suggest to break down the goal and step back to apply the newly 
> introduced
> qemu_ram_{copy, move} to the ram device regions only? It's actually something
> proposed by Peter Xu in the earlier replies. Taking address_space_write() as 
> an
> example, the indirectly accessible regions are covered by 
> memory_region_dispatch_write()
> in (1), the ram device region is covered by qemu_ram_move() in (2), and all 
> other
> directly accessible regions are covered by memmove() in (3).
> 
>   address_space_write
>     flatview_write
>       flatview_write_continue
>         flatview_write_continue_step
>           memory_access_size             // (1) indirectly accessible region
>           memory_region_dispatch_write
>             access_with_adjusted_size
>               memory_region_write_accessor
>                 mr->ops->write
>           qemu_ram_move                  // (2) ram device region
>           memmove                        // (3) all other directly accessible 
> regions
> 
> With the limitation, only the ram device regions in (2) are affected. We're
> basically moving the accesses to the ram device region from (1) to (2). No
> changes introduced to other types of regions. The goal is to make the ram 
> device
> region accessible so that the bounce buffer can be bypassed in DMA path.

Esthetics aside - ram device regions have all the same issues.

Maybe you can limit the scope of the changes,
but I doubt you can get out understanding)


> > > ----
> > > 
> > > The issues listed by Michael:
> > > 
> > > 1. On x86, memcpy is different from __builtin_memcpy if one uses old 1.0
> > >     force-headers from 2019. Likely no longer relevant.
> > > 
> > > 2. variable length memcpy can translate 2,4,8 byte guest access into
> > >     multiple byte accesses. doing this for mmio is guaranteed to break 
> > > devices.
> > > 
> > > 3. (theoretical concern) also on x86, unaligned accesses are possible on 
> > > guest
> > >     and host, so converting an unaligned access to a series of aligned 
> > > ones can
> > >     in theory break devices.
> > > 
> > > 4. also on x86, vector instructions for large (>16 byte) writes into
> > >     pgprot_noncached memory are safe and faster than multiple 8 byte ones.
> > > 
> > > 5. also on x86 it so happens that if you write a fixed-size memcpy this 
> > > gets
> > >     optimized to a single store/load and it works for aligned and 
> > > unaligned
> > >     addresses on that architecture. How to ensure this keeps being correct
> > >     is left as an excerise for the reader. But qemu already relies on this
> > >     and did for years.
> > > 
> > > 6. on non-x86 both unaligned accesses and vector instructions for 
> > > accessing
> > >     UC memory are illegal.
> > > 
> > > 7. standard vfio gives KVM VM_ALLOW_ANY_UNCACHED, so even on non x86 
> > > guest can
> > >     map the memory as as pgprot_noncached/ioremap or 
> > > pgprot_writecombine/ioremap_uc.
> > >     If it does the second then it can use unaligned or vector for access.
> > >     This is why normal passthrough tends to work - it never traps to qemu 
> > > at
> > >     all. But for qemu, vfio uses  pgprot_noncached unconditionally so qemu
> > >     can't use unaligned or vector instructions on non-x86.
> > > 
> > > 
> > > 8. But for nvgrace RAM, vfio has a driver that uses 
> > > pgprot_writecombine/ioremap_uc.
> > >     so qemu could safely use unaligned/vector instructioons even on 
> > > non-x86.
> > > 
> > > 9. Except sadly, vfio currently does not tell qemu how it maps
> > >     the memory, so qemu can not know what is safe on non-x86.
> > > 
> > 
> > And more:
> > 
> > 10. on x86 memcpy will sometimes do multiple overlapping stores when
> > size is not a power of 2. for example, a 15 byte write is done with
> > 2 8-byte stores. This is theoretically an issue
> > if guest does something super clever with ordering,
> > but does not seem to be in practice.
> > 
> > 10. on non-x86 memcpy will do multiple overlapping stores even
> > for single byte writes. E.g. it does it to avoid extra branches.
> > This is causing issues in practice.
> > 
> > 
> > 
> > 
> > 
> > > Now, what is to be done?
> > > 
> > > 
> > > A. on x86, we must avoid converting 2,4,8 byte accesses into byte 
> > > accesses.
> > > At least for aligned, perferably for unaligned accesses too.
> > > Fixed width memcpy seems to work for this. Whether we should bother with
> > > __builtin to work around broken old fortify headers, I donnu.
> > > I do not have any answer how to check that compiler does this correctly.
> > > If anyone is motivated enough, adding a GCC builtin could be possible.
> > > Given qemu did this for years, I think we can leave solving this for
> > > another day.
> > > 
> > > B. Also on x86, I do not see why we should not use memcpy for large
> > > accesses if we can. Better perf.
> > > 
> > > C. on non-x86, we currently must not memcpy since we do not know if it
> > > is pgprot_noncached. yes, performance will be bad for DMA into device RAM.
> > > 
> > > D. It goes without saying that casting an unaligned address to unint32_t
> > > (be it for qatomic_set or whatever) is undefined behaviour in C
> > > and so a bad idea on any architecture.
> > > 
> > > E. also for non-x86, we really should teach vfio to tell qemu whether
> > > it maps device pgprot_noncached or pgprot_writecombine.
> > > we will then be able to use memcpy for >8 accesses.
> > > 
> > > Anyone, correct me if I'm wrong? Maybe I should start a new thread with
> > > this summary?
> > > 
> 
> Thanks,
> Gavin

Re: [PATCH v3 0/2] system/memory: Make ram device region directly accessible

Reply via email to