On Tue, Jun 16, 2026 at 02:22:18PM +1000, Gavin Shan wrote:
> On 6/16/26 7:31 AM, Gavin Shan wrote:
> > On 6/16/26 5:42 AM, Michael S. Tsirkin wrote:
> > > On Tue, Jun 16, 2026 at 05:24:15AM +1000, Gavin Shan wrote:
> > > > Hi Peter and Michael,
> > > >
> > > > On 6/16/26 1:12 AM, Peter Maydell wrote:
> > > > > On Mon, 15 Jun 2026 at 15:56, Michael S. Tsirkin <[email protected]>
> > > > > wrote:
> > > > > >
> > > > > > On Mon, Jun 15, 2026 at 11:57:09AM +0100, Peter Maydell wrote:
> > > > > > > On Mon, 15 Jun 2026 at 11:03, Gavin Shan <[email protected]> wrote:
> > > > > > > >
> > > > > > > > All ram device regions were turned to be indirectly accessible
> > > > > > > > by commit
> > > > > > > > 4a2e242bbb ("memory: Don't use memcpy for ram_device regions").
> > > > > > > > This leads
> > > > > > > > to guest hang on attempt to build 'cuda-samples' as reported by
> > > > > > > > Julia. The
> > > > > > > > guest is started by the following command lines, with GH100 GPU
> > > > > > > > card passed
> > > > > > > > from the host.
> > > > > > >
> > > > > > > > diff --git a/include/system/memory.h b/include/system/memory.h
> > > > > > > > index 1417132f6d..5878727d09 100644
> > > > > > > > --- a/include/system/memory.h
> > > > > > > > +++ b/include/system/memory.h
> > > > > > > > @@ -2897,6 +2897,8 @@ void
> > > > > > > > address_space_register_map_client(AddressSpace *as, QEMUBH *bh);
> > > > > > > > void address_space_unregister_map_client(AddressSpace *as,
> > > > > > > > QEMUBH *bh);
> > > > > > > >
> > > > > > > > /* Internal functions, part of the implementation of
> > > > > > > > address_space_read. */
> > > > > > > > +void qemu_ram_copy(void *dest, const void *src, size_t n);
> > > > > > > > +void qemu_ram_move(void *dest, const void *src, size_t n);
> > > > > > >
> > > > > > > New function prototypes in include headers need documentation
> > > > > > > comments.
> > > > > > >
> > > > > > > In particular for these, it's really important that we clearly say
> > > > > > > what semantics we are attempting to provide with them, so that
> > > > > > > (a) when we're reviewing or later updating the implementation we
> > > > > > > know what we are trying to provide
> > > > > > > (b) when we're looking at the callsites we know what the function
> > > > > > > is guaranteeing to us and what it is not, and thus whether it's
> > > > > > > OK to use it or we need something els.
> > > > > > >
> > > > > > > > +#if defined(__i386__) || defined(__x86_64__)
> > > > > > > > +#define HOST_UNALIGNED_MMIO_OK 1
> > > > > > > > +#else
> > > > > > > > +#define HOST_UNALIGNED_MMIO_OK 0
> > > > > > > > +#endif
> > > > > > >
> > > > > > > We need to do something better than this. We can't
> > > > > > > just say "oh, we trust that on x86 this works": it is
> > > > > > > neither actually true that the compiler guarantees it
> > > > > >
> > > > > > sorry guarantee what?
> > > > >
> > > > > Well, that's part of my point about "we need a doc comment":
> > > > > what exactly are we trying to guarantee ? But whatever it is,
> > > > > "hardcoded ifdef that privileges x86" is definitely the wrong
> > > > > answer.
> > > > >
> > > >
> > > > Could you please check the following comments (documentation context)
> > > > for the
> > > > added functions look good to you? Please let me know if there are
> > > > anything
> > > > can be improved.
> > > >
> > > > +
> > > > +/**
> > > > + * qemu_ram_copy: copy data to a RAMBlock
> > > > + *
> > > > + * @dest: destination where the data is copied to. It's the pointer
> > > > returned
> > > > + * by ramblock_ptr() and its variants.
> > > > + * @src: source where the data is copied from. It can be a regular
> > > > memory block
> > > > + * or a pointer returned by ramblock_ptr() and its variants.
> > > > + * @n: length of data to be copied
> > > > + *
> > > > + * The destination is always a pointer to data area of a RAMBlock
> > > > which can
> > > > + * be for a regular memory block or a MMIO region. The source can be a
> > > > pointer
> > > > + * to a regular memory block or a MMIO region. Atomicity isn't
> > > > guranteed by
> > > > + * memcpy() to copy data between those MMIO regions as we do in this
> > > > function.
> > > > + * Besides, unaligned accesses are well handled on all architectures
> > > > except
> > > > + * i386 and x86_64 where the unaligned accesses are supported by the
> > > > CPUs.
> > > > + *
> > > > + * The ensured atomicity and alignment consideration, which aren't
> > > > needed
> > > > + * by data copy between two regular memory blocks. So performance
> > > > penalty
> > > > + * is introduced by this function in such circumstance.
> > > > + */
> > > > +void qemu_ram_copy(void *dest, const void *src, size_t n);
> > > > +
> > > > +/**
> > > > + * qemu_ram_move: move data to a RAMBlock
> > > > + *
> > > > + * @dest: destination where the data is moved to. It's the pointer
> > > > returned
> > > > + * by ramblock_ptr() and its variants.
> > > > + * @src: source where the data is moved from. It can be a regular
> > > > memory block
> > > > + * or a pointer returned by ramblock_ptr() and its variants.
> > > > + * @n: length of data to be moved
> > > > + *
> > > > + * The destination is always a pointer to data area of a RAMBlock
> > > > which can
> > > > + * be for a regular memory block or a MMIO region. The source can be a
> > > > pointer
> > > > + * to a regular memory block or a MMIO region. Atomicity isn't
> > > > guranteed by
> > > > + * memmove() to move data from or to those MMIO regions as we do in
> > > > this
> > > > + * function. Besides, unaligned accesses are well handled on all
> > > > architectures
> > > > + * except i386 and x86_64 where the unaligned accesses are supported
> > > > by the
> > > > + * CPUs.
> > > > + *
> > > > + * The ensured atomicity and alignment consideration, which aren't
> > > > needed
> > > > + * by data movement between two regular memory blocks. So performance
> > > > penalty
> > > > + * is introduced by this function in such circumstance.
> > > > + */
> > > > +void qemu_ram_move(void *dest, const void *src, size_t n);
> > > > +
> > >
> > >
> > > I apologize, I don't understand what these comments are saying, at all.
> > >
> >
> > Sorry to hear that, Michael. There are two things done in the newly added
> > function qemu_ram_{copy, move}(), comparing to the original memcpy() and
> > memmove():
> >
> > - Data copy or movement on MMIO region(s) using __bultin_{memcopy,
> > memmove}()
> > on x86, or qatomic_set() on other architecutures, to avoid the unexpected
> > optimization done by the compiler on memcpy/memmove. Unsafe instructions
> > can be produced by the compiler's optimization. This fixes the issue
> > resolved
> > by commit 4a2e242bbb ("memory: Don't use memcpy for ram_device regions")
> >
> > - The unaligned access is handled by forcing the access size aligned to
> > (src | dest | n)
> > on non-x86 architectures, comparing to the original memcpy() and
> > memmove().
> >
> > Please let me know if we're on same page. If we do, I can revise the
> > comments
> > accordingly.
> >
>
> I've modified the comments to something as below. Please let me know if it's
> better. The point is to emphasize qemu_ram_{copy, memmove}() are friendly to
> IO regions :-)
>
> -----> include/system/memory.h
>
> +/**
> + * qemu_ram_copy: copy data to a guest memory block
> + *
what is a memory block? ramblock?
don't we need something like this for reads?
> + * @dest: destination where the data is copied to. A pointer to a guest
> memory
@dst?
> + * block returned from ramblock_ptr() or its variants
> + * @src: source where the data is copied from. A pointer to host memory block
> + * or guest memory block returned from ramblock_ptr() or its variants.
> + * @n: length of data to be copied
> + *
> + * When the source pointer, destinatoin pointer or both dereference guest
> + * memory blocks, which can be IO blocks (regions).
IO block?
let's not invent terminology pls.
> Under this cirtumstance,
> + * data copy between those blocks is equivalent to IO accesses. It's unsafe
> + * to use memcpy(), which could be translated to IO unfriendly instructions
> + * by the compiler due to code level optimization, resulting in the
> unexpected
> + * behavior. This function is ensured to be friendly to both IO and memory
> + * accesses. No optimized instructions should be generated by the compiler,
> + * and no unexpected behavior should be observed for IO accesses.
Let's be specific, and brief, and tell user what this does:
Maybe:
Copy @n bytes from @src to @dst.
Assumes @src and @dst do not overlap.
Handles special cases such as uncacheable ramblocks correctly.
Use this for accessing ramblock in response to DMA/VCPU IO
and in preference to memcpy.
> + */
> +void qemu_ram_copy(void *dest, const void *src, size_t n);
> +
> +/**
> + * qemu_ram_move: move data to a guest memory block
> + *
> + * @dest: destination where the data is moved to. A pointer to a guest memory
> + * block returned from ramblock_ptr() or its variants
> + * @src: source where the data is moved from. A pointer to host memory block
> + * or guest memory block returned from ramblock_ptr() or its variants.
> + * @n: length of data to be copied
> + *
> + * Similar to qemu_ram_copy(), this function is ensured to be friendly to
> both
> + * IO and memory accesses. No optimized instructions should be generated by
> + * the compiler, and no unexpected behavior should be observed for IO
> accesses.
unclear how it is different from copy.
> + */
> +void qemu_ram_move(void *dest, const void *src, size_t n);
>
> > >
> > >
> > > > > -- PMM
> > > > >
>
> Thanks,
> Gavin