memory: Use __builtin_mem{cpy, move} in accessors of ram device region

Michael S. Tsirkin Sun, 14 Jun 2026 08:14:17 -0700

On Fri, Jun 12, 2026 at 10:25:35AM -0700, Richard Henderson wrote:
> On 6/12/26 09:36, Peter Maydell wrote:
> > On Fri, 12 Jun 2026 at 16:29, Richard Henderson
> > <[email protected]> wrote:
> > > 
> > > On 6/12/26 04:03, Gavin Shan wrote:
> > > > This replaces mem{cpy, move} with __builtin_mem{cpy, move} in the memory
> > > > accessors to ram device memory region, preparatory work to make ram 
> > > > device
> > > > region directly accessible and bypass the bounce buffer in the DMA path
> > > > in next patch.
> > > 
> > > memcpy/memmove *always* compile to __builtin_memcpy/memmove, and the 
> > > compiler later
> > > decides whether or not to expand inline.
> > 
> > Yes, but if you pass it a fixed small integer, then it is likely
> > to expand it inline, whereas if you pass it a variable then it
> > is likely not to... The patch is attempting to persuade the
> > compiler to definitely do an inline access for 1, 2, 4, 8
> > byte access.
> 
> Sure, for hosts with unaligned accesses.  We still have sparc64 and (some?)
> riscv64 that don't automatically have such and will compile to more than one
> host instruction.
> 
> > > My real question is: what are you attempting to achieve?
> > > 
> > > (1) is the problem unaligned access to a mapped physical device?
> > > (2) is the problem vector access to a mapped physical device?
> > > (3) something else?
> > 
> > I think there are two problems we're trying to fix here:
> > 
> > (1) If a device does e.g. a pci_dma_write() with size 1, we want
> > this to turn into exactly 1 byte write into guest memory, for the
> > normal case where the guest memory is real host RAM.
> > This deals with the e1000 bug where the pci_dma_write() turns into
> > a call to glibc memmove() with size 1 and glibc's implementation
> > turns that into 3 writes of the byte to the same address...
> 
> Gotcha.  Easily handled by not using memcpy/memmove at all.
> 
>       *(char *)ptr = val;
> 
> is sufficient for all hosts.


Yes, I think it does work because we use -fno-strict-aliasing.
For bigger sizes we'll need packed because the addresses
could be unaligned.



But again, qemu simply already relies on this in bswap.h

I kind of dislike muddying the waters by making several
unrelated changes here. If we do we should change bwap too.


> > (2) If a vCPU (emulated or KVM) does a 4 byte access to an
> > address which is in the PCI BAR of a vfio-passthrough device,
> > we want this to turn into exactly a 4-byte access to the
> > mmap()ed memory which is the BAR of the host device. This
> > is because that address might be a real hardware device
> > register, so it's important to access it exactly the way
> > the guest vCPU asked for, and not do multiple accesses or
> > misaligned accesses or whatever.
> > 
> > (The reason we make pass-through BARs not "direct access"
> > but instead go via the bounce buffer is that we were working
> > around (2).)
> 
> That's going to require qatomic and alignment checks.
> 
> That's also going to beg the question of the intended behaviour anything
> that isn't a naturally aligned access of size in {1,2,4,8}: fall back to N
> operations of the minimum of alignment and size?

For most host/guest pairs things simply work even for unaligned.

And yes, guest drivers do do this.

On classical pci, there are no transactions as such and
an unaligned access will be split anyway.



> 
> 
> r~

Re: [PATCH 1/2] system/memory: Use __builtin_mem{cpy, move} in accessors of ram device region

Reply via email to