On 7/17/23 22:47, Joern Rennecke wrote:
Subject:
cpymem for RISCV with v extension
From:
Joern Rennecke <joern.renne...@embecosm.com>
Date:
7/17/23, 22:47

To:
GCC Patches <gcc-patches@gcc.gnu.org>


As discussed on last week's patch call, this patch uses either a
straight copy or an opaque pattern that emits the loop as assembly to
optimize cpymem for the 'v' extension.
I used Ju-Zhe Zhong's patch - starting in git with:

Author: zhongjuzhe<66454988+zhongju...@users.noreply.github.com>
Date:   Mon Mar 21 14:20:42 2022 +0800

       PR for RVV support using splitted small chunks (#334)

as a starting point, even though not all that much of the original code remains.

Regression tested on x86_64-pc-linux-gnu X
     riscv-sim
     
riscv-sim/-march=rv32imafdcv_zicsr_zifencei_zfh_zba_zbb_zbc_zbs_zve32f_zve32x_zve64d_zve64f_zve64x_zvl128b_zvl32b_zvl64b/-mabi=ilp32f
     
riscv-sim/-march=rv32imafdcv_zicsr_zifencei_zfh_zve32f_zve32x_zve64d_zve64f_zve64x_zvl128b_zvl32b_zvl64b/-mabi=ilp32
     
riscv-sim/-march=rv32imafdcv_zicsr_zifencei_zfh_zve32f_zve32x_zve64d_zve64f_zve64x_zvl128b_zvl32b_zvl64b/-mabi=ilp32f
     
riscv-sim/-march=rv32imfdcv_zicsr_zifencei_zfh_zve32f_zve32x_zve64d_zve64f_zve64x_zvl128b_zvl32b_zvl64b/-mabi=ilp32
     
riscv-sim/-march=rv64imafdcv_zicsr_zifencei_zfh_zba_zbb_zbc_zbs_zve32f_zve32x_zve64d_zve64f_zve64x_zvl128b_zvl32b_zvl64b/-mabi=lp64d
     
riscv-sim/-march=rv64imafdcv_zicsr_zifencei_zfh_zba_zbb_zbs_zve32f_zve32x_zve64d_zve64f_zve64x_zvl128b_zvl32b_zvl64b/-mabi=lp64d
     
riscv-sim/-march=rv64imafdcv_zicsr_zifencei_zfh_zve32f_zve32x_zve64d_zve64f_zve64x_zvl128b_zvl32b_zvl64b/-mabi=lp64d


cpymem-diff-20230718.txt

2023-07-12  Ju-Zhe Zhong<juzhe.zh...@rivai.ai>
             Joern Rennecke<joern.renne...@embecosm.com>

        * config/riscv/riscv-protos.h (riscv_vector::expand_block_move):
        Declare.
        * config/riscv/riscv-v.cc (riscv_vector::expand_block_move):
        New function.
        * config/riscv/riscv.md (cpymemsi): Use riscv_vector::expand_block_move.
        * config/riscv/vector.md (@cpymem_straight<P:mode><V_WHOLE:mode>):
        New define_insn patterns.
        (@cpymem_loop<P:mode><V_WHOLE:mode>): Likewise.
        (@cpymem_loop_fast<P:mode><V_WHOLE:mode>): Likewise.


diff --git a/gcc/config/riscv/riscv-v.cc b/gcc/config/riscv/riscv-v.cc
index b4884a30872..e61110fa3ad 100644
--- a/gcc/config/riscv/riscv-v.cc
+++ b/gcc/config/riscv/riscv-v.cc
@@ -49,6 +49,7 @@
  #include "tm-constrs.h"
  #include "rtx-vector-builder.h"
  #include "targhooks.h"
+#include "predict.h"
Not sure this is needed, but I didn't scan for it explicitly. If it's not needed, then remove it.



+  if (CONST_INT_P (length_in))
+    {
+      HOST_WIDE_INT length = INTVAL (length_in);
+
+    /* By using LMUL=8, we can copy as many bytes in one go as there
+       are bits in a vector register.  If the entire block thus fits,
+       we don't need a loop.  */
+    if (length <= TARGET_MIN_VLEN)
+      {
+       need_loop = false;
+
+       /* If a single scalar load / store pair can do the job, leave it
+          to the scalar code to do that.  */
+
+       if (pow2p_hwi (length) && length <= potential_ew)
+         return false;
+      }
We could probably argue over the threshold for doing the copy on the scalar side, but I don't think it's necessary. Once we start seeing V hardware we can revisit.


+
+      /* Find the vector mode to use.  Using the largest possible element
+        size is likely to give smaller constants, and thus potentially
+        reducing code size.  However, if we need a loop, we need to update
+        the pointers, and that is more complicated with a larger element
+        size, unless we use an immediate, which prevents us from dynamically
+        using the largets transfer size that the hart supports.  And then,
+        unless we know the*exact*  vector size of the hart, we'd need
+        multiple vsetvli / branch statements, so it's not even a size win.
+        If, in the future, we find an RISCV-V implementation that is slower
+        for small element widths, we might allow larger element widths for
+        loops too.  */
s/largets/largest/

And a space is missing in "the*extact*"

Note that I think the proposed glibc copier does allow larger elements widths for this case.

+
+         /* Unless we get an implementation that's slow for small element
+            size / non-word-aligned accesses, we assume that the hardware
+            handles this well, and we don't want to complicate the code
+            with shifting word contents around or handling extra bytes at
+            the start and/or end.  So we want the total transfer size and
+            alignemnt to fit with the element size.  */
s/alignemnt/alignment/

Yes, let's not try to support every uarch we can envision and instead do a good job on the uarches we know about. If a uarch with slow element or non-word aligned accesses comes along, they can propose changes at that time.



+
+         // The VNx*?I modes have a factor of riscv_vector_chunks for nunits.
Comment might need updating after the recent work to adjust the modes. I don't recall if we kept the VNx*?I modes or not.

So the adjustments are all comment related, so this is OK after fixing the comments. Just post the update for archival purposes and consider it pre-approved for the trunk.

Thanks and sorry for the wait Joern.

jeff

Reply via email to