On 7/17/23 22:47, Joern Rennecke wrote:
Subject:
cpymem for RISCV with v extension
From:
Joern Rennecke <joern.renne...@embecosm.com>
Date:
7/17/23, 22:47
To:
GCC Patches <gcc-patches@gcc.gnu.org>
As discussed on last week's patch call, this patch uses either a
straight copy or an opaque pattern that emits the loop as assembly to
optimize cpymem for the 'v' extension.
I used Ju-Zhe Zhong's patch - starting in git with:
Author: zhongjuzhe<66454988+zhongju...@users.noreply.github.com>
Date: Mon Mar 21 14:20:42 2022 +0800
PR for RVV support using splitted small chunks (#334)
as a starting point, even though not all that much of the original code remains.
Regression tested on x86_64-pc-linux-gnu X
riscv-sim
riscv-sim/-march=rv32imafdcv_zicsr_zifencei_zfh_zba_zbb_zbc_zbs_zve32f_zve32x_zve64d_zve64f_zve64x_zvl128b_zvl32b_zvl64b/-mabi=ilp32f
riscv-sim/-march=rv32imafdcv_zicsr_zifencei_zfh_zve32f_zve32x_zve64d_zve64f_zve64x_zvl128b_zvl32b_zvl64b/-mabi=ilp32
riscv-sim/-march=rv32imafdcv_zicsr_zifencei_zfh_zve32f_zve32x_zve64d_zve64f_zve64x_zvl128b_zvl32b_zvl64b/-mabi=ilp32f
riscv-sim/-march=rv32imfdcv_zicsr_zifencei_zfh_zve32f_zve32x_zve64d_zve64f_zve64x_zvl128b_zvl32b_zvl64b/-mabi=ilp32
riscv-sim/-march=rv64imafdcv_zicsr_zifencei_zfh_zba_zbb_zbc_zbs_zve32f_zve32x_zve64d_zve64f_zve64x_zvl128b_zvl32b_zvl64b/-mabi=lp64d
riscv-sim/-march=rv64imafdcv_zicsr_zifencei_zfh_zba_zbb_zbs_zve32f_zve32x_zve64d_zve64f_zve64x_zvl128b_zvl32b_zvl64b/-mabi=lp64d
riscv-sim/-march=rv64imafdcv_zicsr_zifencei_zfh_zve32f_zve32x_zve64d_zve64f_zve64x_zvl128b_zvl32b_zvl64b/-mabi=lp64d
cpymem-diff-20230718.txt
2023-07-12 Ju-Zhe Zhong<juzhe.zh...@rivai.ai>
Joern Rennecke<joern.renne...@embecosm.com>
* config/riscv/riscv-protos.h (riscv_vector::expand_block_move):
Declare.
* config/riscv/riscv-v.cc (riscv_vector::expand_block_move):
New function.
* config/riscv/riscv.md (cpymemsi): Use riscv_vector::expand_block_move.
* config/riscv/vector.md (@cpymem_straight<P:mode><V_WHOLE:mode>):
New define_insn patterns.
(@cpymem_loop<P:mode><V_WHOLE:mode>): Likewise.
(@cpymem_loop_fast<P:mode><V_WHOLE:mode>): Likewise.
diff --git a/gcc/config/riscv/riscv-v.cc b/gcc/config/riscv/riscv-v.cc
index b4884a30872..e61110fa3ad 100644
--- a/gcc/config/riscv/riscv-v.cc
+++ b/gcc/config/riscv/riscv-v.cc
@@ -49,6 +49,7 @@
#include "tm-constrs.h"
#include "rtx-vector-builder.h"
#include "targhooks.h"
+#include "predict.h"
Not sure this is needed, but I didn't scan for it explicitly. If it's
not needed, then remove it.
+ if (CONST_INT_P (length_in))
+ {
+ HOST_WIDE_INT length = INTVAL (length_in);
+
+ /* By using LMUL=8, we can copy as many bytes in one go as there
+ are bits in a vector register. If the entire block thus fits,
+ we don't need a loop. */
+ if (length <= TARGET_MIN_VLEN)
+ {
+ need_loop = false;
+
+ /* If a single scalar load / store pair can do the job, leave it
+ to the scalar code to do that. */
+
+ if (pow2p_hwi (length) && length <= potential_ew)
+ return false;
+ }
We could probably argue over the threshold for doing the copy on the
scalar side, but I don't think it's necessary. Once we start seeing V
hardware we can revisit.
+
+ /* Find the vector mode to use. Using the largest possible element
+ size is likely to give smaller constants, and thus potentially
+ reducing code size. However, if we need a loop, we need to update
+ the pointers, and that is more complicated with a larger element
+ size, unless we use an immediate, which prevents us from dynamically
+ using the largets transfer size that the hart supports. And then,
+ unless we know the*exact* vector size of the hart, we'd need
+ multiple vsetvli / branch statements, so it's not even a size win.
+ If, in the future, we find an RISCV-V implementation that is slower
+ for small element widths, we might allow larger element widths for
+ loops too. */
s/largets/largest/
And a space is missing in "the*extact*"
Note that I think the proposed glibc copier does allow larger elements
widths for this case.
+
+ /* Unless we get an implementation that's slow for small element
+ size / non-word-aligned accesses, we assume that the hardware
+ handles this well, and we don't want to complicate the code
+ with shifting word contents around or handling extra bytes at
+ the start and/or end. So we want the total transfer size and
+ alignemnt to fit with the element size. */
s/alignemnt/alignment/
Yes, let's not try to support every uarch we can envision and instead do
a good job on the uarches we know about. If a uarch with slow element
or non-word aligned accesses comes along, they can propose changes at
that time.
+
+ // The VNx*?I modes have a factor of riscv_vector_chunks for nunits.
Comment might need updating after the recent work to adjust the modes.
I don't recall if we kept the VNx*?I modes or not.
So the adjustments are all comment related, so this is OK after fixing
the comments. Just post the update for archival purposes and consider
it pre-approved for the trunk.
Thanks and sorry for the wait Joern.
jeff