On Tue, Jun 16, 2026 at 03:07:27PM +1000, Gavin Shan wrote: > On 6/16/26 2:59 PM, Michael S. Tsirkin wrote: > > On Mon, Jun 15, 2026 at 09:48:00PM -0700, Richard Henderson wrote: > > > On 6/15/26 21:23, Michael S. Tsirkin wrote: > > > > B. Also on x86, I do not see why we should not use memcpy for large > > > > accesses if we can. Better perf. > > > > > > We have an example where memcpy writes to the same location 3 times. > > > This is not appropriate for any host. > > > > > > > > > r~ > > > > Ah, checked libc and sure enough, it does it. E.g. it uses 2 overlapping SSE > > stores to do a 17 byte write. Not sure how we get 3 but whatevs. > > > > > > But just to clarify, I am talking about DMA accesses, that are not > > initiated by the VCPU. I am not so sure we care about multiple stores > > in this instance? Do we? We do care about speed, for sure. > > > > In current implementation, qemu_ram_copy/move are differentiated on x86 > and other architectures. Do we need to unify the implementations > (qemu_ram_copy/move) > on all architectures to avoid using memcpy() and memmove()?
I am not sure for anything outside 1,2,4,8 bytes the issues are not theoretical. I'd be care > Maybe it's time for me to post (v3) for a new round of discussions. > > Thanks, > Gavin maybe start a toppost with the list of issues and solutions first of all. Let's add more to the list: 10. on x86 memcpy will sometimes do multiple overlapping stores when size is not a power of 2. for example, a 15 byte write is done with 2 8-byte stores. This is theoretically an issue if guest does something super clever with ordering, but does not seem to be in practice. 10. on non-x86 memcpy will do multiple overlapping stores even for single byte writes. E.g. it does it to avoid extra branches. This is causing issues in practice. -- MST
