On 10/29/24 19:43, Paolo Savini wrote:
This patch optimizes the emulation of unit-stride load/store RVV instructions
when the data being loaded/stored per iteration amounts to 16 bytes or more.
The optimization consists of calling __builtin_memcpy on chunks of data of 16
bytes between the memory address of the simulated vector register and the
destination memory address and vice versa.
This is done only if we have direct access to the RAM of the host machine,
if the host is little endiand and if it supports atomic 128 bit memory
operations.
Signed-off-by: Paolo Savini <[email protected]>
---
target/riscv/vector_helper.c | 17 ++++++++++++++++-
target/riscv/vector_internals.h | 12 ++++++++++++
2 files changed, 28 insertions(+), 1 deletion(-)
diff --git a/target/riscv/vector_helper.c b/target/riscv/vector_helper.c
index 75c24653f0..e1c100e907 100644
--- a/target/riscv/vector_helper.c
+++ b/target/riscv/vector_helper.c
@@ -488,7 +488,22 @@ vext_group_ldst_host(CPURISCVState *env, void *vd,
uint32_t byte_end,
}
fn = fns[is_load][group_size];
- fn(vd, byte_offset, host + byte_offset);
+
+ /* __builtin_memcpy uses host 16 bytes vector loads and stores if
supported.
+ * We need to make sure that these instructions have guarantees of
atomicity.
+ * E.g. x86 processors provide strong guarantees of atomicity for 16-byte
+ * memory operations if the memory operands are 16-byte aligned */
+ if (!HOST_BIG_ENDIAN && (byte_offset + 16 < byte_end) &&
+ ((byte_offset % 16) == 0) && HOST_128_ATOMIC_MEM_OP) {
+ group_size = MO_128;
+ if (is_load) {
+ __builtin_memcpy((uint8_t *)(vd + byte_offset), (uint8_t *)(host +
byte_offset), 16);
+ } else {
+ __builtin_memcpy((uint8_t *)(host + byte_offset), (uint8_t *)(vd +
byte_offset), 16);
+ }
I said this last time and I'll say it again:
__builtin_memcpy DOES NOT equal VMOVDQA
Your comment there about 'if supported' does not really apply.
(1) You'd need a compile-time test not the runtime test that is HOST_128_ATOMIC_MEM_OP to
ensure that the compiler knows that AVX vector support is present.
(2) Even then, you're not giving the compiler any reason to use VMOVDQA over VMOVDQU or
ANY OTHER vector load/store. So you're not really doing what you say you're doing.
Frankly, I think this entire patch set is premature.
We need to get Max Chou's patch set landed first.
r~