riscv: rvv: improve performance of RISC-V vector loads and stores on large amounts of data.

Paolo Savini Wed, 30 Oct 2024 08:26:19 -0700

Thanks for the review Richard.

On 10/30/24 11:40, Richard Henderson wrote:

On 10/29/24 19:43, Paolo Savini wrote:
This patch optimizes the emulation of unit-stride load/store RVVinstructionswhen the data being loaded/stored per iteration amounts to 16 bytesor more.The optimization consists of calling __builtin_memcpy on chunks ofdata of 16bytes between the memory address of the simulated vector register andthe
destination memory address and vice versa.
This is done only if we have direct access to the RAM of the hostmachine,
if the host is little endiand and if it supports atomic 128 bit memory
operations.
Signed-off-by: Paolo Savini <[email protected]>
---
  target/riscv/vector_helper.c    | 17 ++++++++++++++++-
  target/riscv/vector_internals.h | 12 ++++++++++++
  2 files changed, 28 insertions(+), 1 deletion(-)

diff --git a/target/riscv/vector_helper.c b/target/riscv/vector_helper.c
index 75c24653f0..e1c100e907 100644
--- a/target/riscv/vector_helper.c
+++ b/target/riscv/vector_helper.c
@@ -488,7 +488,22 @@ vext_group_ldst_host(CPURISCVState *env, void*vd, uint32_t byte_end,
      }
        fn = fns[is_load][group_size];
-    fn(vd, byte_offset, host + byte_offset);
+
+ /* __builtin_memcpy uses host 16 bytes vector loads and storesif supported.+ * We need to make sure that these instructions have guaranteesof atomicity.+ * E.g. x86 processors provide strong guarantees of atomicityfor 16-byte
+     * memory operations if the memory operands are 16-byte aligned */
+    if (!HOST_BIG_ENDIAN && (byte_offset + 16 < byte_end) &&
+            ((byte_offset % 16) == 0) && HOST_128_ATOMIC_MEM_OP) {
+      group_size = MO_128;
+      if (is_load) {
+ __builtin_memcpy((uint8_t *)(vd + byte_offset), (uint8_t*)(host + byte_offset), 16);
+      } else {
+ __builtin_memcpy((uint8_t *)(host + byte_offset), (uint8_t*)(vd + byte_offset), 16);
+      }
I said this last time and I'll say it again:

    __builtin_memcpy DOES NOT equal VMOVDQA

I am aware of this. I took __builtin_memcpy as a generic enough way toemulate loads and stores that should allow several hosts to generate thewidest load/store instructions they can and on x86 I see this generatesinstructions vmovdpu/movdqu that are not always guaranteed to be atomic.x86 though guarantees them to be atomic if the memory address is alignedto 16 bytes. Hence the check on the alignment. My x86 knowledge isadmittedly limited though so the check on alignment might be redundant?


Your comment there about 'if supported' does not really apply.

This refers to the fact that in order to decide whether to use__builtin_memcpy which could generate 16 byte mem ops we use cpuinfo totest for applicable host features (so for x86 whether the host supportsAVX or atomic vmovdqa/vmovdqu). The intent here is to only use__builtin_memcpy when we know that 16 byte mem ops, if generated, willhave a guarantee of atomicity.

Before I submit a new version of the patch, does this comment describethe situation more clearly?


    /* __builtin_memcpy could generate host 16 bytes memory operations that

* would accelerate the emulation of RVV loads and stores of morethan 16 bytes. * We need to make sure that these instructions have guarantees ofatomicity. * E.g. x86 processors provide strong guarantees of atomicity for16-byte

     * memory operations if the memory operands are 16-byte aligned */

Worst case scenario the memcpy will generate normal 8 byte mem ops.

(1) You'd need a compile-time test not the runtime test that isHOST_128_ATOMIC_MEM_OP to ensure that the compiler knows that AVXvector support is present.

(lack of QEMU knowledge here) I assumed it was safe enough to use QEMU'scpuinfo in the way it is initialized for x86 as an example inutil/cpuinfo-i386.c:cpuinfo_init() and by building this for Aarch64 Isee that these conditions work as intended (by not using memcpy, becausewe haven't added atomicity checks for Aarch64 yet).

(2) Even then, you're not giving the compiler any reason to useVMOVDQA over VMOVDQU or ANY OTHER vector load/store. So you're notreally doing what you say you're doing.


I hope that the new version of the comment addresses this.

If I misinterpreted the point of your feedback and you are saying thatit would be better to have guarantee to have specific host atomic vectorloads/stores rather then using memcpy, there's the option of usingcompiler builtins to generate some specific host instructions and soenhance performance for some hosts but that didn't seem like a neatsolution. Any thoughts?



Frankly, I think this entire patch set is premature.
We need to get Max Chou's patch set landed first.

Understood. I'm trying anyway to get to the bottom of this as it canprove useful for other optimizations.


Thanks again for the review work.

Paolo

Re: [RFC v4 2/2] target/riscv: rvv: improve performance of RISC-V vector loads and stores on large amounts of data.

Reply via email to