memory: Use __builtin_mem{cpy, move} in accessors of ram device region

Peter Maydell Fri, 12 Jun 2026 10:58:58 -0700

On Fri, 12 Jun 2026 at 18:25, Richard Henderson
<[email protected]> wrote:
>
> On 6/12/26 09:36, Peter Maydell wrote:
> > On Fri, 12 Jun 2026 at 16:29, Richard Henderson
> > <[email protected]> wrote:
> >>
> >> On 6/12/26 04:03, Gavin Shan wrote:
> >>> This replaces mem{cpy, move} with __builtin_mem{cpy, move} in the memory
> >>> accessors to ram device memory region, preparatory work to make ram device
> >>> region directly accessible and bypass the bounce buffer in the DMA path
> >>> in next patch.
> >>
> >> memcpy/memmove *always* compile to __builtin_memcpy/memmove, and the 
> >> compiler later
> >> decides whether or not to expand inline.
> >
> > Yes, but if you pass it a fixed small integer, then it is likely
> > to expand it inline, whereas if you pass it a variable then it
> > is likely not to... The patch is attempting to persuade the
> > compiler to definitely do an inline access for 1, 2, 4, 8
> > byte access.
>
> Sure, for hosts with unaligned accesses.  We still have sparc64 and (some?) 
> riscv64 that
> don't automatically have such and will compile to more than one host 
> instruction.


Good point.

> >> My real question is: what are you attempting to achieve?
> >>
> >> (1) is the problem unaligned access to a mapped physical device?
> >> (2) is the problem vector access to a mapped physical device?
> >> (3) something else?
> >
> > I think there are two problems we're trying to fix here:
> >
> > (1) If a device does e.g. a pci_dma_write() with size 1, we want
> > this to turn into exactly 1 byte write into guest memory, for the
> > normal case where the guest memory is real host RAM.
> > This deals with the e1000 bug where the pci_dma_write() turns into
> > a call to glibc memmove() with size 1 and glibc's implementation
> > turns that into 3 writes of the byte to the same address...
>
> Gotcha.  Easily handled by not using memcpy/memmove at all.
>
>         *(char *)ptr = val;
>
> is sufficient for all hosts.

That helps e1000 because it happens to be doing byte writes,
but the same effect probably occurs for word writes in other
devices...

> > (2) If a vCPU (emulated or KVM) does a 4 byte access to an
> > address which is in the PCI BAR of a vfio-passthrough device,
> > we want this to turn into exactly a 4-byte access to the
> > mmap()ed memory which is the BAR of the host device. This
> > is because that address might be a real hardware device
> > register, so it's important to access it exactly the way
> > the guest vCPU asked for, and not do multiple accesses or
> > misaligned accesses or whatever.
> >
> > (The reason we make pass-through BARs not "direct access"
> > but instead go via the bounce buffer is that we were working
> > around (2).)
>
> That's going to require qatomic and alignment checks.
>
> That's also going to beg the question of the intended behaviour
> anything that isn't a naturally aligned access of size in {1,2,4,8}:
>fall back to N operations of the minimum of alignment and size?

I think in practice we can probably say we don't care and
fall back to memmove: anything unaligned here is almost
certainly going to real RAM and not going to have any
particular ordering requirements (and/or might be caused
by the guest misprogramming a device DMA address).

I feel like we handle "unaligned access, maybe need to split it"
already in multiple parts of QEMU (for TCG accesses we do that
in accel/tcg; system/memory.c has a TODO comment about unaligned
accesses at the IO MemoryRegion read/writefn level and there's
a patchseries that tries to fix that), so if we find ourselves
really needing to do "handle this unaligned access by splitting
it into pieces" again here we ought to take a wider look at
whether we can consolidate that at all...

thanks
-- PMM

Re: [PATCH 1/2] system/memory: Use __builtin_mem{cpy, move} in accessors of ram device region

Reply via email to