On Tue, Jun 16, 2026 at 03:07:27PM +1000, Gavin Shan wrote:
> On 6/16/26 2:59 PM, Michael S. Tsirkin wrote:
> > On Mon, Jun 15, 2026 at 09:48:00PM -0700, Richard Henderson wrote:
> > > On 6/15/26 21:23, Michael S. Tsirkin wrote:
> > > > B. Also on x86, I do not see why we should not use memcpy for large
> > > > accesses if we can. Better perf.
> > > 
> > > We have an example where memcpy writes to the same location 3 times.
> > > This is not appropriate for any host.
> > > 
> > > 
> > > r~
> > 
> > Ah, checked libc and sure enough, it does it. E.g. it uses 2 overlapping SSE
> > stores to do a 17 byte write. Not sure how we get 3 but whatevs.
> > 
> > 
> > But just to clarify, I am talking about DMA accesses, that are not
> > initiated by the VCPU.  I am not so sure we care about multiple stores
> > in this instance? Do we? We do care about speed, for sure.
> > 
> 
> In current implementation, qemu_ram_copy/move are differentiated on x86
> and other architectures. Do we need to unify the implementations 
> (qemu_ram_copy/move)
> on all architectures to avoid using memcpy() and memmove()?

I am not sure for anything outside 1,2,4,8 bytes the issues are not
theoretical.

I'd be care

> Maybe it's time for me to post (v3) for a new round of discussions.
> 
> Thanks,
> Gavin

maybe start a toppost with the list of issues and solutions first of
all.

Let's add more to the list:


10. on x86 memcpy will sometimes do multiple overlapping stores when
size is not a power of 2. for example, a 15 byte write is done with
2 8-byte stores. This is theoretically an issue
if guest does something super clever with ordering,
but does not seem to be in practice.

10. on non-x86 memcpy will do multiple overlapping stores even
for single byte writes. E.g. it does it to avoid extra branches.
This is causing issues in practice.


-- 
MST


Reply via email to