On Tue, Jun 16, 2026 at 04:09:27AM +1000, Gavin Shan wrote:
> On 6/16/26 3:03 AM, Richard Henderson wrote:
> > On 6/15/26 09:33, Gavin Shan wrote:
> > > > Either assert n != 0 to start, or use while not do/while.
> > > > Because the body of the loop won't handle n == 0 correctly.
> > > >
> > >
> > > I will change this to "while (n > 0)" since we're not expecting
> > > "n < 0" either.
> >
> > size_t is unsigned.
> >
>
> oh, yes.
>
> > > Could you confirm which stores need qatomic_set()? There are two sets
> > > of stores as below. I guess you're asking have qatomic_set() for (a)?
> > > Could you explain a bit why qatomic_set() is needed?
> > >
> > > // (a)
> > > *(uint64_t *)dest = *(uint64_t *)src;
> >
> > Of course (a).
> >
> > Using qatomic_set prevents the compiler from merging the stores, kinda like
> > volatile. Not especially likely with the way you've written this, but if
> > you had exchanged the switch and loop,
> >
> > switch (lsb) {
> > case 1:
> > for (; n ; n--, dst++, src++) {
> > *(uint8_t *)dst = *(uint8_t *)src;
> > }
> >
> > then the compiler is quite likely to "optimize" the loop back to memcpy.
> >
> > It's also self-documenting what we're intending with the store.
> >
>
> ok, thanks for your explanation.
>
> > I guess it's also worth asking if this copy is also used for copying *from*
> > device memory. The commit that added memmove (4a73aee8814) suggests that
> > dma from the device uses these paths. In which case you'll either want a
> > separate function for that, or both source and destination must be aligned.
> >
>
> I think it's a valid case. Lets handle this by limiting the 'step' based on
> src/dest/n, something like below. atomic_read() instead of ldxxx_he_p() is
> used for loads. Please take a look if there are anything we can improve
> further.
>
> // Not compiled and tested yet
> static void qemu_ram_copy_unaligned(void *dest, const void *src, size_t n)
> {
> uintptr_t test, step;
>
> while (n != 0) {
> test = (uintptr_t)src | n; /* alignment enforced by source */
> step = test & -test;
> test = (uintptr_t)dest | n; /* alignment enforced by destination
> */
> step = MIN(step, test & -test);
>
> switch (step) {
> case 1:
> qatomic_set((uint8_t *)dest, qatomic_read((uint8_t *)src));
> src += 1;
> dest += 1;
> n -= 1;
> break;
> case 2:
> qatomic_set((uint16_t *)dest, qatomic_read((uint16_t *)src));
> src += 2;
> dest += 2;
> n -= 2;
> break;
> case 4:
> qatomic_set((uint32_t *)dest, qatomic_read((uint32_t *)src));
> src += 4;
> dest += 4;
> n -= 4;
> break;
> default:
> qatomic_set((uint64_t *)dest, qatomic_read((uint64_t *)src);
> src += 8;
> dest += 8;
> n -= 8;
> }
If the length is > 8, I feel it's important to just use memcpy/memmove
normally.
Because at that point it's a large data transfer not MMIO, and performance
is important.
> }
> }
>
>
> > > Another unrelated question: why 'int' value is returned from ldub_p()
> > > and ld{uw, l}_he_p() in bswap.h? They would return 'uint{8, 16, 32}_t'
> > > values?
> >
> > A bad choice 20 years ago, and the need to audit all uses in order to
> > change it now.
> > Newer functions use uintN_t as you suggest.
> >
>
> Ok :-)
>
> > r~
> >
>
> Thanks,
> Gavin