On Fri, 12 Jun 2026 at 16:29, Richard Henderson <[email protected]> wrote: > > On 6/12/26 04:03, Gavin Shan wrote: > > This replaces mem{cpy, move} with __builtin_mem{cpy, move} in the memory > > accessors to ram device memory region, preparatory work to make ram device > > region directly accessible and bypass the bounce buffer in the DMA path > > in next patch. > > memcpy/memmove *always* compile to __builtin_memcpy/memmove, and the compiler > later > decides whether or not to expand inline.
Yes, but if you pass it a fixed small integer, then it is likely to expand it inline, whereas if you pass it a variable then it is likely not to... The patch is attempting to persuade the compiler to definitely do an inline access for 1, 2, 4, 8 byte access. > So, this doesn't do what you think it does. > My real question is: what are you attempting to achieve? > > (1) is the problem unaligned access to a mapped physical device? > (2) is the problem vector access to a mapped physical device? > (3) something else? I think there are two problems we're trying to fix here: (1) If a device does e.g. a pci_dma_write() with size 1, we want this to turn into exactly 1 byte write into guest memory, for the normal case where the guest memory is real host RAM. This deals with the e1000 bug where the pci_dma_write() turns into a call to glibc memmove() with size 1 and glibc's implementation turns that into 3 writes of the byte to the same address, and then a guest write/read might interleave badly with the extra writes.[*] Similarly some devices specify that they do definitely do DMA accesses in a particular way (e.g. the Arm SMMU spec says that the SMMU reads 64-bit page table entries as single-copy atomic accesses). (2) If a vCPU (emulated or KVM) does a 4 byte access to an address which is in the PCI BAR of a vfio-passthrough device, we want this to turn into exactly a 4-byte access to the mmap()ed memory which is the BAR of the host device. This is because that address might be a real hardware device register, so it's important to access it exactly the way the guest vCPU asked for, and not do multiple accesses or misaligned accesses or whatever. (The reason we make pass-through BARs not "direct access" but instead go via the bounce buffer is that we were working around (2).) [*] I have not fully investigated the e1000 situation so it's possible that there is some other issue also there, e.g. guest or device model getting barrier semantics wrong. But writing the byte with the "done with this descriptor" bit 3 times rather than once seems like asking for trouble. thanks -- PMM
