On Tue, Jun 16, 2026 at 04:09:27AM +1000, Gavin Shan wrote:
> On 6/16/26 3:03 AM, Richard Henderson wrote:
> > On 6/15/26 09:33, Gavin Shan wrote:
> > > > Either assert n != 0 to start, or use while not do/while.
> > > > Because the body of the loop won't handle n == 0 correctly.
> > > > 
> > > 
> > > I will change this to "while (n > 0)" since we're not expecting
> > > "n < 0" either.
> > 
> > size_t is unsigned.
> > 
> 
> oh, yes.
> 
> > > Could you confirm which stores need qatomic_set()? There are two sets
> > > of stores as below. I guess you're asking have qatomic_set() for (a)?
> > > Could you explain a bit why qatomic_set() is needed?
> > > 
> > >      // (a)
> > >      *(uint64_t *)dest = *(uint64_t *)src;
> > 
> > Of course (a).
> > 
> > Using qatomic_set prevents the compiler from merging the stores, kinda like 
> > volatile.  Not especially likely with the way you've written this, but if 
> > you had exchanged the switch and loop,
> > 
> >      switch (lsb) {
> >      case 1:
> >          for (; n ; n--, dst++, src++) {
> >              *(uint8_t *)dst = *(uint8_t *)src;
> >          }
> > 
> > then the compiler is quite likely to "optimize" the loop back to memcpy.
> > 
> > It's also self-documenting what we're intending with the store.
> > 
> 
> ok, thanks for your explanation.
> 
> > I guess it's also worth asking if this copy is also used for copying *from* 
> > device memory.  The commit that added memmove (4a73aee8814) suggests that 
> > dma from the device uses these paths.  In which case you'll either want a 
> > separate function for that, or both source and destination must be aligned.
> > 
> 
> I think it's a valid case. Lets handle this by limiting the 'step' based on
> src/dest/n, something like below. atomic_read() instead of ldxxx_he_p() is
> used for loads. Please take a look if there are anything we can improve
> further.
> 
> // Not compiled and tested yet
> static void qemu_ram_copy_unaligned(void *dest, const void *src, size_t n)
> {
>     uintptr_t test, step;
> 
>     while (n != 0) {
>         test = (uintptr_t)src | n;       /* alignment enforced by source */
>         step = test & -test;
>         test = (uintptr_t)dest | n;      /* alignment enforced by destination 
> */
>         step = MIN(step, test & -test);
> 
>         switch (step) {
>         case 1:
>             qatomic_set((uint8_t *)dest, qatomic_read((uint8_t *)src));
>             src += 1;
>             dest += 1;
>             n -= 1;
>             break;
>         case 2:
>             qatomic_set((uint16_t *)dest, qatomic_read((uint16_t *)src));
>             src += 2;
>             dest += 2;
>             n -= 2;
>             break;
>         case 4:
>             qatomic_set((uint32_t *)dest, qatomic_read((uint32_t *)src));
>             src += 4;
>             dest += 4;
>             n -= 4;
>             break;
>         default:
>             qatomic_set((uint64_t *)dest, qatomic_read((uint64_t *)src);
>             src += 8;
>             dest += 8;
>             n -= 8;
>         }


If the length is > 8, I feel it's important to just use memcpy/memmove
normally.

Because at that point it's a large data transfer not MMIO, and performance
is important.


>     }
> }
> 
> 
> > > Another unrelated question: why 'int' value is returned from ldub_p()
> > > and ld{uw, l}_he_p() in bswap.h? They would return 'uint{8, 16, 32}_t'
> > > values?
> > 
> > A bad choice 20 years ago, and the need to audit all uses in order to 
> > change it now.
> > Newer functions use uintN_t as you suggest.
> > 
> 
> Ok :-)
> 
> > r~
> > 
> 
> Thanks,
> Gavin


Reply via email to