Changes since v7: - Fixed typo `bits` -> `bytes` - Tuned threshold for applying the optimization - Provided results for larger sizes requested by Max Chou
This patch provides up to 60% speedup on the `memcpy` benchmark from: https://github.com/embecosm/rise-rvv-tcg-qemu-tooling/tree/main/strmem-benchmarks There is some variation in the measurements so results are attached for six runs on a single thread on an Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz. The three graphs are: memcpy-594c0cb1ab-128-speedup.pdf: VLEN 128 memcpy-594c0cb1ab-1024-speedup.pdf: VLEN 1024 memcpy-594c0cb1ab-stdlib-speedup.pdf: Scalar (to further illustrate measurement variation as this version will not touch the function modified by this patch) Previous versions: - v1:https://lore.kernel.org/all/[email protected]/ - v2:https://lore.kernel.org/all/[email protected]/ - v3:https://lore.kernel.org/all/[email protected]/ - v4:https://lore.kernel.org/all/[email protected]/ - v5:https://lore.kernel.org/all/[email protected]/ - v6:https://lore.kernel.org/all/[email protected]/ - v7:https://lore.kernel.org/all/[email protected]/ Cc: Richard Henderson<[email protected]> Cc: Palmer Dabbelt<[email protected]> Cc: Alistair Francis<[email protected]> Cc: Bin Meng<[email protected]> Cc: Weiwei Li<[email protected]> Cc: Daniel Henrique Barboza<[email protected]> Cc: Liu Zhiwei<[email protected]> Cc: Helene Chelin<[email protected]> Cc: Nathan Egge<[email protected]> Cc: Max Chou<[email protected]> Cc: Paolo Savini<[email protected]> Craig Blackmore (2): target/riscv: rvv: fix typo in vext continuous ldst function names target/riscv: rvv: speed up small unit-stride loads and stores target/riscv/vector_helper.c | 26 +++++++++++++++++++++----- 1 file changed, 21 insertions(+), 5 deletions(-) -- 2.43.0
memcpy-594c0cb1ab-graphs.tar.gz
Description: application/gzip
