Changes since v6:
- Limit access size to element size to address Max Chou's review.
- Fix a typo in the name of a function that this patch now calls.

With access size limited to element size this patch still provides a
significant speedup.  The `memcpy` benchmark from:
    
  
https://github.com/embecosm/rise-rvv-tcg-qemu-tooling/tree/main/strmem-benchmarks

shows up to 75% speedup with this patch:
    
  VLEN | Size | ns/inst (ratio)
  -----|------|-----------------
   128 |    1 |            1.50
   128 |    2 |            1.42
   128 |    3 |            1.35
   128 |    4 |            1.29
   128 |    5 |            1.23
   128 |    7 |            1.18
   128 |    8 |            1.09
   128 |    9 |            1.06
   128 |   11 |            1.01
    
  VLEN | Size | ns/inst (ratio)
  -----|------|-----------------
  1024 |    1 |            1.75
  1024 |    2 |            1.62
  1024 |    3 |            1.52
  1024 |    4 |            1.43
  1024 |    5 |            1.35
  1024 |    7 |            1.31
  1024 |    8 |            1.12
  1024 |    9 |            1.12
  1024 |   11 |            1.01

It is not clear to me exactly why the patch is now helping.  At first I
thought it was due to avoiding `vext_continuous_ldst_host` calling out
to `memcpy` for small sizes but trying that directly in
`vext_continuous_ldst_host` was much less beneficial:

  VLEN |  Size | ns/inst (ratio)
  -----|-------|-----------------
   128 |     1 |            1.06
   128 |     2 |            1.14
   128 |     3 |            1.03
   128 |     4 |            1.04
   128 |     5 |            1.02
   128 |     7 |            1.02
   128 |     8 |            0.91
   128 |     9 |            0.92
   128 |    11 |            1.03
  
  VLEN |  Size | ns/inst (ratio)
  -----|-------|-----------------
  1024 |     1 |            1.10
  1024 |     2 |            1.14
  1024 |     3 |            1.04
  1024 |     4 |            1.05
  1024 |     5 |            0.96
  1024 |     7 |            1.07
  1024 |     8 |            0.94
  1024 |     9 |            0.93
  1024 |    11 |            0.90

Previous versions:
- v1: 
https://lore.kernel.org/all/[email protected]/
- v2: 
https://lore.kernel.org/all/[email protected]/
- v3: 
https://lore.kernel.org/all/[email protected]/
- v4: 
https://lore.kernel.org/all/[email protected]/
- v5: 
https://lore.kernel.org/all/[email protected]/
- v6: 
https://lore.kernel.org/all/[email protected]/

Cc: Richard Henderson <[email protected]>
Cc: Palmer Dabbelt <[email protected]>
Cc: Alistair Francis <[email protected]>
Cc: Bin Meng <[email protected]>
Cc: Weiwei Li <[email protected]>
Cc: Daniel Henrique Barboza <[email protected]>
Cc: Liu Zhiwei <[email protected]>
Cc: Helene Chelin <[email protected]>
Cc: Nathan Egge <[email protected]>
Cc: Max Chou <[email protected]>
Cc: Paolo Savini <[email protected]>

Craig Blackmore (2):
  target/riscv: rvv: fix typo in vext continuous ldst function names
  target/riscv: rvv: speed up small unit-stride loads and stores

 target/riscv/vector_helper.c | 26 +++++++++++++++++++++-----
 1 file changed, 21 insertions(+), 5 deletions(-)

-- 
2.43.0


Reply via email to