Re: [x265] [PATCH] RISC-V: Add RVV optimized DCT32x32

daichengrong Fri, 06 Feb 2026 04:05:31 -0800

Thanks for the suggestion. Just to clarify, the implementation does not choose 
between 128-bit and 256-bit vector widths at compile time. The code follows a 
Vector-Length Agnostic (VLA) approach, so the actual vector width is determined 
by the hardware at runtime via RVV semantics rather than by function pointer 
selection.
The current repository implementation was originally written with a 128-bit 
assumption, which is why the initial validation was performed on 128-bit 
hardware to provide a direct comparison. With the VLA design, the same code 
runs correctly on wider vector hardware (e.g., 256-bit) without requiring 
separate code paths, and the test results confirm good scalability.
Please let me know if I misunderstood your concern — I’m happy to clarify 
further.


On February 6, 2026 4:37:21 PM GMT+08:00, [email protected] wrote:
>It is recommended not to decide between using 128-bit or 256-bit width at 
>compile time. Instead, detect the bit width at runtime and then select 
>accordingly when initializing function pointers.
>
>
>
>
>吴昌盛0318004250
>
>Best Wishes！
>Changsheng Wu
>E：[email protected]
>SANECHIPS TECHNOLOGY CO.,LTD.
>
>
>
>
>
>
>Original
>
>
>From: daichengrong <[email protected]>
>  
>To: [email protected] <[email protected]>;
>  
>Date: 2026年02月06日 16:15
>  
>Subject: [x265] [PATCH] RISC-V: Add RVV optimized DCT32x32
>  
>
>
>This patch adds an RVV-optimized implementation of DCT 32x32 for RISC-V.
> 
>The current implementation in the repository is written with the assumption of 
>a 128-bit VLEN and does not account for wider vector lengths. Therefore, 
>initial testing was performed on a 128-bit platform, allowing the results to 
>directly reflect the advantages of the optimized code over the existing 
>implementation.
> 
>**SG2044 (128-bit VLEN):**
> 
>```
>dct32x32 | 5.14x | 1800.12 | 9247.73
>dct32x32 | 9.85x |  935.26 | 9214.26
>```
> 
>Building on this, the new implementation adopts a Vector-Length Agnostic (VLA) 
>design. Additional testing on a 256-bit platform demonstrates good scalability 
>and further performance gains.
> 
>**Banana Pi F3 (256-bit VLEN):**
> 
>```
>dct32x32 | 5.59x | 2222.48 | 12420.64
>dct32x32 | 13.28x |  935.97 | 12431.17
>```
> 
>To simplify comparison with the existing implementation, this patch introduces 
>an `RVV_DCT32_OPT` compile-time option. The optimization can be disabled using:
> 
>```
>-DRVV_DCT32_OPT=0
>```
> 
>allowing straightforward A/B performance testing.
> 
>Signed-off-by: daichengrong <[email protected]> 
>---
> source/CMakeLists.txt                    |   6 +
> source/common/CMakeLists.txt             |   2 +-
> source/common/riscv64/asm-primitives.cpp |   3 +
> source/common/riscv64/dct-32dct.S        | 714 +++++++++++++++++++++++
> source/common/riscv64/fun-decls.h        |   1 +
> 5 files changed, 725 insertions(+), 1 deletion(-)
> mode change 100755 => 100644 source/CMakeLists.txt
> create mode 100644 source/common/riscv64/dct-32dct.S
> 
>diff --git a/source/CMakeLists.txt b/source/CMakeLists.txt
>old mode 100755
>new mode 100644
>index 9f93b6ec2..fd91da702
>--- a/source/CMakeLists.txt
>+++ b/source/CMakeLists.txt
>@@ -512,6 +512,11 @@ int main() {
>             message(STATUS "Found RVV")
>             add_definitions(-DHAVE_RVV=1)
>  
>+        option(RVV_DCT32_OPT "Enable use of RVV DCT32 OPT" ON)
>+            if(RVV_DCT32_OPT)
>+                add_definitions(-DHAVE_RVV_OPT=1)
>+            endif()
>+
>             set(RVV_INTRINSIC_TEST [[
> #include <riscv_vector.h> 
> #include <stdint.h> 
>@@ -947,6 +952,7 @@ if((MSVC_IDE OR XCODE OR GCC) AND ENABLE_ASSEMBLY)
>         enable_language(ASM)
>         foreach(ASM ${RISCV64_ASMS})
>             set(ASM_SRC ${CMAKE_CURRENT_SOURCE_DIR}/common/riscv64/${ASM})
>+        message(STATUS "add ... ${ASM_SRC}")
>             list(APPEND ASM_SRCS ${ASM_SRC})
>             list(APPEND ASM_OBJS ${ASM}.${SUFFIX})
>             add_custom_command(
>diff --git a/source/common/CMakeLists.txt b/source/common/CMakeLists.txt
>index 69125c3cb..4945af009 100644
>--- a/source/common/CMakeLists.txt
>+++ b/source/common/CMakeLists.txt
>@@ -185,7 +185,7 @@ if(ENABLE_ASSEMBLY AND (RISCV64 OR CROSS_COMPILE_RISCV64))
>     source_group(Assembly FILES ${ASM_PRIMITIVES})
>  
>     # Add riscv64 assembly files here.
>-    set(A_SRCS asm.S blockcopy8.S dct.S sad-a.S ssd-a.S pixel-util.S mc-a.S 
>p2s.S sao.S loopfilter.S intrapred.S riscv64_utils.S)
>+    set(A_SRCS asm.S blockcopy8.S dct.S sad-a.S ssd-a.S pixel-util.S mc-a.S 
>p2s.S sao.S loopfilter.S intrapred.S riscv64_utils.S dct-32dct.S)
>     set(VEC_PRIMITIVES)
>  
>     if(CPU_HAS_RVV)
>diff --git a/source/common/riscv64/asm-primitives.cpp 
>b/source/common/riscv64/asm-primitives.cpp
>index ce03288f9..7bd017cf8 100644
>--- a/source/common/riscv64/asm-primitives.cpp
>+++ b/source/common/riscv64/asm-primitives.cpp
>@@ -234,6 +234,9 @@ void setupRVVPrimitives(EncoderPrimitives &p)
>     p.dst4x4                = PFX(dst4_v);
>  
>     ALL_LUMA_TU_S(dct, dct, v);
>+#if defined(HAVE_RVV_OPT)
>+    p.cu[BLOCK_32x32].dct = PFX(dct_32_v_opt);
>+#endif
>     ALL_LUMA_TU_S(idct, idct, v);
>  
>     ALL_LUMA_TU_L(nonPsyRdoQuant, nonPsyRdoQuant, v);
>diff --git a/source/common/riscv64/dct-32dct.S 
>b/source/common/riscv64/dct-32dct.S
>new file mode 100644
>index 000000000..a25521706
>--- /dev/null
>+++ b/source/common/riscv64/dct-32dct.S
>@@ -0,0 +1,714 @@
>+/*****************************************************************************
>+ * Copyright (C) 2026 MulticoreWare, Inc
>+ *
>+ * Authors: daichengrong <[email protected]> 
>+ *
>+ * This program is free software; you can redistribute it and/or modify
>+ * it under the terms of the GNU General Public License as published by
>+ * the Free Software Foundation; either version 2 of the License, or
>+ * (at your option) any later version.
>+ *
>+ * This program is distributed in the hope that it will be useful,
>+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
>+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
>+ * GNU General Public License for more details.
>+ *
>+ * You should have received a copy of the GNU General Public License
>+ * along with this program; if not, write to the Free Software
>+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02111, USA.
>+ *
>+ * This program is also available under a commercial proprietary license.
>+ * For more information, contact us at license @ x265.com.
>+ 
>*****************************************************************************/
>+
>+#include "asm.S" 
>+
>+#ifdef __APPLE__
>+.section __RODATA,__rodata
>+#else
>+.section .rodata
>+#endif
>+
>+.align 4
>+
>+.set dct32_shift_1, 4 + BIT_DEPTH - 8
>+.set dct32_shift_2, 11
>+
>+.text
>+
>+#define DCT32_O_CONSTANT_1_0 90, 90, 88, 85, 82, 78, 73, 67, 61, 54, 46, 38, 
>31, 22, 13, 4
>+#define DCT32_O_CONSTANT_3_1  90, 82, 67, 46, 22, -4, -31, -54, -73, -85, 
>-90, -88, -78, -61, -38, -13
>+#define DCT32_O_CONSTANT_5_2  88, 67, 31, -13, -54, -82, -90, -78, -46, -4, 
>38, 73, 90, 85, 61, 22
>+#define DCT32_O_CONSTANT_7_3  85, 46, -13, -67, -90, -73, -22, 38, 82, 88, 
>54, -4, -61, -90, -78, -31
>+#define DCT32_O_CONSTANT_9_4  82, 22, -54, -90, -61, 13, 78, 85, 31, -46, 
>-90, -67, 4, 73, 88, 38
>+#define DCT32_O_CONSTANT_11_5  78, -4, -82, -73, 13, 85, 67, -22, -88, -61, 
>31, 90, 54, -38, -90, -46
>+#define DCT32_O_CONSTANT_13_6  73, -31, -90, -22, 78, 67, -38, -90, -13, 82, 
>61, -46, -88, -4, 85, 54
>+#define DCT32_O_CONSTANT_15_7  67, -54, -78, 38, 85, -22, -90, 4, 90, 13, 
>-88, -31, 82, 46, -73, -61
>+#define DCT32_O_CONSTANT_17_8  61, -73, -46, 82, 31, -88, -13, 90, -4, -90, 
>22, 85, -38, -78, 54, 67
>+#define DCT32_O_CONSTANT_19_9  54, -85, -4, 88, -46, -61, 82, 13, -90, 38, 
>67, -78, -22, 90, -31, -73
>+#define DCT32_O_CONSTANT_21_10  46, -90, 38, 54, -90, 31, 61, -88, 22, 67, 
>-85, 13, 73, -82,  4, 78
>+#define DCT32_O_CONSTANT_23_11  38, -88, 73, -4, -67, 90, -46, -31, 85, -78, 
>13, 61, -90, 54, 22, -82
>+#define DCT32_O_CONSTANT_25_12  31, -78, 90, -61, 4, 54, -88, 82, -38, -22, 
>73, -90, 67, -13, -46, 85
>+#define DCT32_O_CONSTANT_27_13  22, -61, 85, -90, 73, -38, -4, 46, -78, 90, 
>-82, 54, -13, -31, 67, -88
>+#define DCT32_O_CONSTANT_29_14  13, -38, 61, -78, 88, -90, 85, -73, 54, -31, 
>4, 22, -46, 67, -82, 90
>+#define DCT32_O_CONSTANT_31_15  4, -13, 22, -31, 38, -46, 54, -61, 67, -73, 
>78, -82, 85, -88, 90, -90
>+
>+
>+#define DCT32_EO_CONSTANT_2_0  90, 87, 80, 70, 57, 43, 25, 9
>+#define DCT32_EO_CONSTANT_6_1  87, 57,  9, -43, -80, -90, -70, -25
>+#define DCT32_EO_CONSTANT_10_2  80,  9, -70, -87, -25, 57, 90, 43
>+#define DCT32_EO_CONSTANT_14_3  70, -43, -87,  9, 90, 25, -80, -57
>+
>+#define DCT32_EO_CONSTANT_18_4  57, -80, -25, 90, -9, -87, 43, 70
>+#define DCT32_EO_CONSTANT_22_5  43, -90, 57, 25, -87, 70,  9, -80
>+#define DCT32_EO_CONSTANT_26_6  25, -70, 90, -80, 43,  9, -57, 87
>+#define DCT32_EO_CONSTANT_30_7  9, -25, 43, -57, 70, -80, 87, -90
>+
>+.macro  lx rd, addr
>+#if (__riscv_xlen == 32)
>+        lw      \rd, \addr
>+#elif (__riscv_xlen == 64)
>+        ld      \rd, \addr
>+#else
>+        lq      \rd, \addr
>+#endif
>+.endm
>+
>+.macro  sx rd, addr
>+#if (__riscv_xlen == 32)
>+        sw      \rd, \addr
>+#elif (__riscv_xlen == 64)
>+        sd      \rd, \addr
>+#else
>+        sq      \rd, \addr
>+#endif
>+.endm
>+
>+.macro butterfly e, o, tmp_p, tmp_m
>+        vadd.vv         \tmp_p, \e, \o
>+        vsub.vv         \tmp_m, \e, \o
>+.endm
>+
>+.macro butterfly_widen e, o, tmp_p, tmp_m
>+        vwadd.vv         \tmp_p, \e, \o
>+        vwsub.vv         \tmp_m, \e, \o
>+.endm
>+
>+.macro DCT32_EEO_CAL dst, m1, m2, m3, m4, s1, s2, s3, s4, line, shift
>+    li              a2, \m1
>+    li              a3, \m2
>+    li              a4, \m3
>+    li              a5, \m4
>+    vmul.vx         \dst, \s1, a2
>+    vmacc.vx        \dst, a3, \s2
>+    vmacc.vx        \dst, a4, \s3
>+    vmacc.vx        \dst, a5, \s4
>+.endm
>+
>+.macro DCT32_4_DST_ADD_1_MEMBER first, in, dst_start_index, dst1, dst2, dst3, 
>dst4, t0, t1,t2,t3,t4,t5,t6,t7,t8,t9,t10,t11,t12,t13,t14,t15
>+.if \dst_start_index == 0
>+    li              a2, \t0
>+    li              a3, \t1
>+    li              a4, \t2
>+    li              a5, \t3
>+.elseif \dst_start_index == 4
>+    li              a2, \t4
>+    li              a3, \t5
>+    li              a4, \t6
>+    li              a5, \t7
>+.elseif \dst_start_index == 8
>+    li              a2, \t8
>+    li              a3, \t9
>+    li              a4, \t10
>+    li              a5, \t11
>+.else
>+    li              a2, \t12
>+    li              a3, \t13
>+    li              a4, \t14
>+    li              a5, \t15
>+.endif
>+
>+.if \first == 1
>+    vmul.vx        \dst1, \in, a2
>+    vmul.vx        \dst2, \in, a3
>+    vmul.vx        \dst3, \in, a4
>+    vmul.vx        \dst4, \in, a5
>+.else
>+    vmacc.vx       \dst1, a2, \in
>+    vmacc.vx       \dst2, a3, \in
>+    vmacc.vx       \dst3, a4, \in
>+    vmacc.vx       \dst4, a5, \in
>+.endif
>+.endm
>+
>+.macro DCT32_STORE_L line, shift, in
>+    vnclip.wi           \in, \in, \shift
>+    addi                t0, a1, 32 * 2 * \line
>+    vse16.v             \in, (t0)
>+.endm
>+
>+.macro tr_32xN_rvv name, shift
>+function func_tr_32xN_\name\()_rvv
>+        .option arch, +zba
>+        // E saved from tmp stack
>+        mv              a7, t5
>+        // one vector bytes after widen
>+        slli            t2, t4, 2
>+        // O saved from tmp stack + 16xE
>+        slli            t0, t2, 4
>+        add             a6, t5, t0
>+
>+        // load 0-3 28-31
>+        add             t0, a0, 2*0
>+        vlsseg4e16.v        v0,(a0), t3
>+        add             t0, a0, 2*28
>+        vlsseg4e16.v       v4,(t0), t3
>+
>+        butterfly_widen     v0, v7, v8, v16
>+        butterfly_widen     v1, v6, v10, v18
>+        butterfly_widen     v2, v5, v12, v20
>+        butterfly_widen     v3, v4, v14, v22
>+
>+        // load 4-7 24-27
>+        add             t0, a0, 2*4
>+        vlsseg4e16.v       v0,(t0), t3
>+        add             t0, a0, 2*24
>+        vlsseg4e16.v       v4,(t0), t3
>+
>+        // save E 0 1 2 3
>+        vse32.v         v8, (a7)
>+        add             a7, a7, t2
>+        vse32.v         v10, (a7)
>+        add             a7, a7, t2
>+        vse32.v         v12, (a7)
>+        add             a7, a7, t2
>+        vse32.v         v14, (a7)
>+
>+        // save O 1 2 3 4
>+        vse32.v         v16, (a6)
>+        add             a6, a6, t2
>+        vse32.v         v18, (a6)
>+        add             a6, a6, t2
>+        vse32.v         v20, (a6)
>+        add             a6, a6, t2
>+        vse32.v         v22, (a6)
>+
>+        vsetvli zero, zero, e32, m2, ta, ma
>+        DCT32_4_DST_ADD_1_MEMBER     1, v16, 0, v24, v26, v28, v30, 
>DCT32_O_CONSTANT_1_0
>+        DCT32_4_DST_ADD_1_MEMBER     0, v18, 0, v24, v26, v28, v30, 
>DCT32_O_CONSTANT_3_1
>+
>+        DCT32_4_DST_ADD_1_MEMBER     0, v20, 0, v24, v26, v28, v30, 
>DCT32_O_CONSTANT_5_2
>+        DCT32_4_DST_ADD_1_MEMBER     0, v22, 0, v24, v26, v28, v30, 
>DCT32_O_CONSTANT_7_3
>+
>+        vsetvli zero, zero, e16, m1, ta, ma
>+        butterfly_widen     v0, v7, v8, v16
>+        butterfly_widen     v1, v6, v10, v18
>+        butterfly_widen     v2, v5, v12, v20
>+        butterfly_widen     v3, v4, v14, v22
>+
>+        // load 8-11 20-23
>+        add             t0, a0, 2*8
>+        vlsseg4e16.v       v0,(t0), t3
>+        add             t0, a0, 2*20
>+        vlsseg4e16.v       v4,(t0), t3
>+
>+        // save E 4 5 6 7
>+        add             a7, a7, t2
>+        vse32.v         v8, (a7)
>+        add             a7, a7, t2
>+        vse32.v         v10, (a7)
>+        add             a7, a7, t2
>+        vse32.v         v12, (a7)
>+        add             a7, a7, t2
>+        vse32.v         v14, (a7)
>+
>+        // save O 4 5 6 7
>+        add             a6, a6, t2
>+        vse32.v         v16, (a6)
>+        add             a6, a6, t2
>+        vse32.v         v18, (a6)
>+        add             a6, a6, t2
>+        vse32.v         v20, (a6)
>+        add             a6, a6, t2
>+        vse32.v         v22, (a6)
>+
>+        vsetvli zero, zero, e32, m2, ta, ma
>+        DCT32_4_DST_ADD_1_MEMBER      0, v16, 0, v24, v26, v28, v30, 
>DCT32_O_CONSTANT_9_4
>+        DCT32_4_DST_ADD_1_MEMBER      0, v18, 0, v24, v26, v28, v30, 
>DCT32_O_CONSTANT_11_5
>+
>+        DCT32_4_DST_ADD_1_MEMBER      0, v20, 0, v24, v26, v28, v30, 
>DCT32_O_CONSTANT_13_6
>+        DCT32_4_DST_ADD_1_MEMBER      0, v22, 0, v24, v26, v28, v30, 
>DCT32_O_CONSTANT_15_7
>+
>+        vsetvli zero, zero, e16, m1, ta, ma
>+        butterfly_widen     v0, v7, v8, v16
>+        butterfly_widen     v1, v6, v10, v18
>+        butterfly_widen     v2, v5, v12, v20
>+        butterfly_widen     v3, v4, v14, v22
>+
>+        // load 12-15 16-19
>+        add             t0, a0, 2*12
>+        vlsseg4e16.v       v0,(t0), t3
>+        add             t0, a0, 2*16
>+        vlsseg4e16.v       v4,(t0), t3
>+
>+        // save E 8 9 10 11
>+        add             a7, a7, t2
>+        vse32.v         v8, (a7)
>+        add             a7, a7, t2
>+        vse32.v         v10, (a7)
>+        add             a7, a7, t2
>+        vse32.v         v12, (a7)
>+        add             a7, a7, t2
>+        vse32.v         v14, (a7)
>+
>+        // save O 8 9 10 11
>+        add             a6, a6, t2
>+        vse32.v         v16, (a6)
>+        add             a6, a6, t2
>+        vse32.v         v18, (a6)
>+        add             a6, a6, t2
>+        vse32.v         v20, (a6)
>+        add             a6, a6, t2
>+        vse32.v         v22, (a6)
>+
>+        vsetvli zero, zero, e32, m2, ta, ma
>+        DCT32_4_DST_ADD_1_MEMBER      0, v16, 0, v24, v26, v28, v30, 
>DCT32_O_CONSTANT_17_8
>+        DCT32_4_DST_ADD_1_MEMBER      0, v18, 0, v24, v26, v28, v30, 
>DCT32_O_CONSTANT_19_9
>+
>+       DCT32_4_DST_ADD_1_MEMBER      0, v20, 0, v24, v26, v28, v30, 
>DCT32_O_CONSTANT_21_10
>+       DCT32_4_DST_ADD_1_MEMBER      0, v22, 0, v24, v26, v28, v30, 
>DCT32_O_CONSTANT_23_11
>+
>+        vsetvli zero, zero, e16, m1, ta, ma
>+        butterfly_widen     v0, v7, v8, v16
>+        butterfly_widen     v1, v6, v10, v18
>+        butterfly_widen     v2, v5, v12, v20
>+        butterfly_widen     v3, v4, v14, v22
>+
>+        // save E 12 13 14 15
>+        add             a7, a7, t2
>+        vse32.v         v8, (a7)
>+        add             a7, a7, t2
>+        vse32.v         v10, (a7)
>+        add             a7, a7, t2
>+        vse32.v         v12, (a7)
>+        add             a7, a7, t2
>+        vse32.v         v14, (a7)
>+
>+        vsetvli zero, zero, e32, m2, ta, ma
>+       DCT32_4_DST_ADD_1_MEMBER      0, v16, 0, v24, v26, v28, v30, 
>DCT32_O_CONSTANT_25_12
>+       DCT32_4_DST_ADD_1_MEMBER      0, v18, 0, v24, v26, v28, v30, 
>DCT32_O_CONSTANT_27_13
>+
>+       DCT32_4_DST_ADD_1_MEMBER      0, v20, 0, v24, v26, v28, v30, 
>DCT32_O_CONSTANT_29_14
>+       DCT32_4_DST_ADD_1_MEMBER      0, v22, 0, v24, v26, v28, v30, 
>DCT32_O_CONSTANT_31_15
>+
>+        vsetvli zero, zero, e16, m1, ta, ma
>+        DCT32_STORE_L   1, \shift, v24
>+        DCT32_STORE_L   3, \shift, v26
>+        DCT32_STORE_L   5, \shift, v28
>+        DCT32_STORE_L   7, \shift, v30
>+
>+
>+        // cal dst 4-15
>+        vsetvli zero, zero, e32, m2, ta, ma
>+        // 12
>+       DCT32_4_DST_ADD_1_MEMBER      1, v16, 4, v0, v2, v4, v6, 
>DCT32_O_CONSTANT_25_12
>+       DCT32_4_DST_ADD_1_MEMBER      1, v16, 8, v8, v10, v12, v14, 
>DCT32_O_CONSTANT_25_12
>+       DCT32_4_DST_ADD_1_MEMBER      1, v16, 12, v24, v26, v28, v30, 
>DCT32_O_CONSTANT_25_12
>+        // reload O0 to v16
>+        slli                        t0, t2, 4
>+        add                         a6, t5, t0
>+        vle32.v                     v16, (a6)
>+
>+        // 13
>+       DCT32_4_DST_ADD_1_MEMBER      0, v18, 4, v0, v2, v4, v6, 
>DCT32_O_CONSTANT_27_13
>+       DCT32_4_DST_ADD_1_MEMBER      0, v18, 8, v8, v10, v12, v14, 
>DCT32_O_CONSTANT_27_13
>+       DCT32_4_DST_ADD_1_MEMBER      0, v18, 12, v24, v26, v28, v30, 
>DCT32_O_CONSTANT_27_13
>+        // reload O1 to v18
>+        add                         a6, a6, t2
>+        vle32.v                     v18, (a6)
>+
>+        // 14
>+       DCT32_4_DST_ADD_1_MEMBER      0, v20, 4, v0, v2, v4, v6, 
>DCT32_O_CONSTANT_29_14
>+       DCT32_4_DST_ADD_1_MEMBER      0, v20, 8, v8, v10, v12, v14, 
>DCT32_O_CONSTANT_29_14
>+       DCT32_4_DST_ADD_1_MEMBER      0, v20, 12, v24, v26, v28, v30, 
>DCT32_O_CONSTANT_29_14
>+        // reload O2 to v20
>+        add                         a6, a6, t2
>+        vle32.v                     v20, (a6)
>+
>+        // 15
>+       DCT32_4_DST_ADD_1_MEMBER      0, v22, 4, v0, v2, v4, v6, 
>DCT32_O_CONSTANT_31_15
>+       DCT32_4_DST_ADD_1_MEMBER      0, v22, 8, v8, v10, v12, v14, 
>DCT32_O_CONSTANT_31_15
>+       DCT32_4_DST_ADD_1_MEMBER      0, v22, 12, v24, v26, v28, v30, 
>DCT32_O_CONSTANT_31_15
>+        // reload O3 to v22
>+        add                         a6, a6, t2
>+        vle32.v                     v22, (a6)
>+
>+        DCT32_4_DST_ADD_1_MEMBER      0, v16, 4, v0, v2, v4, v6, 
>DCT32_O_CONSTANT_1_0
>+        DCT32_4_DST_ADD_1_MEMBER      0, v16, 8, v8, v10, v12, v14, 
>DCT32_O_CONSTANT_1_0
>+        DCT32_4_DST_ADD_1_MEMBER      0, v16, 12, v24, v26, v28, v30, 
>DCT32_O_CONSTANT_1_0
>+        // reload O4 to v16
>+        add                         a6, a6, t2
>+        vle32.v                     v16, (a6)
>+
>+        DCT32_4_DST_ADD_1_MEMBER      0, v18, 4, v0, v2, v4, v6, 
>DCT32_O_CONSTANT_3_1
>+        DCT32_4_DST_ADD_1_MEMBER      0, v18, 8, v8, v10, v12, v14, 
>DCT32_O_CONSTANT_3_1
>+        DCT32_4_DST_ADD_1_MEMBER      0, v18, 12, v24, v26, v28, v30, 
>DCT32_O_CONSTANT_3_1
>+        // reload O5 to v18
>+        add                         a6, a6, t2
>+        vle32.v                     v18, (a6)
>+
>+        DCT32_4_DST_ADD_1_MEMBER      0, v20, 4, v0, v2, v4, v6, 
>DCT32_O_CONSTANT_5_2
>+        DCT32_4_DST_ADD_1_MEMBER      0, v20, 8, v8, v10, v12, v14, 
>DCT32_O_CONSTANT_5_2
>+        DCT32_4_DST_ADD_1_MEMBER      0, v20, 12, v24, v26, v28, v30, 
>DCT32_O_CONSTANT_5_2
>+        // reload O6 to v20
>+        add                         a6, a6, t2
>+        vle32.v                     v20, (a6)
>+
>+        DCT32_4_DST_ADD_1_MEMBER      0, v22, 4, v0, v2, v4, v6, 
>DCT32_O_CONSTANT_7_3
>+        DCT32_4_DST_ADD_1_MEMBER      0, v22, 8, v8, v10, v12, v14, 
>DCT32_O_CONSTANT_7_3
>+        DCT32_4_DST_ADD_1_MEMBER      0, v22, 12, v24, v26, v28, v30, 
>DCT32_O_CONSTANT_7_3
>+        // reload O7 to v22
>+        add                         a6, a6, t2
>+        vle32.v                     v22, (a6)
>+
>+        DCT32_4_DST_ADD_1_MEMBER      0, v16, 4, v0, v2, v4, v6, 
>DCT32_O_CONSTANT_9_4
>+        DCT32_4_DST_ADD_1_MEMBER      0, v16, 8, v8, v10, v12, v14, 
>DCT32_O_CONSTANT_9_4
>+        DCT32_4_DST_ADD_1_MEMBER      0, v16, 12, v24, v26, v28, v30, 
>DCT32_O_CONSTANT_9_4
>+        // reload O8 to v16
>+        add                         a6, a6, t2
>+        vle32.v                     v16, (a6)
>+
>+        DCT32_4_DST_ADD_1_MEMBER      0, v18, 4, v0, v2, v4, v6, 
>DCT32_O_CONSTANT_11_5
>+        DCT32_4_DST_ADD_1_MEMBER      0, v18, 8, v8, v10, v12, v14, 
>DCT32_O_CONSTANT_11_5
>+        DCT32_4_DST_ADD_1_MEMBER      0, v18, 12, v24, v26, v28, v30, 
>DCT32_O_CONSTANT_11_5
>+        // reload O9 to v18
>+        add                         a6, a6, t2
>+        vle32.v                     v18, (a6)
>+
>+        DCT32_4_DST_ADD_1_MEMBER      0, v20, 4, v0, v2, v4, v6, 
>DCT32_O_CONSTANT_13_6
>+        DCT32_4_DST_ADD_1_MEMBER      0, v20, 8, v8, v10, v12, v14, 
>DCT32_O_CONSTANT_13_6
>+        DCT32_4_DST_ADD_1_MEMBER      0, v20, 12, v24, v26, v28, v30, 
>DCT32_O_CONSTANT_13_6
>+        // reload O10 to v20
>+        add                         a6, a6, t2
>+        vle32.v                     v20, (a6)
>+
>+        DCT32_4_DST_ADD_1_MEMBER      0, v22, 4, v0, v2, v4, v6, 
>DCT32_O_CONSTANT_15_7
>+        DCT32_4_DST_ADD_1_MEMBER      0, v22, 8, v8, v10, v12, v14, 
>DCT32_O_CONSTANT_15_7
>+        DCT32_4_DST_ADD_1_MEMBER      0, v22, 12, v24, v26, v28, v30, 
>DCT32_O_CONSTANT_15_7
>+        // reload O11 to v22
>+        add                         a6, a6, t2
>+        vle32.v                     v22, (a6)
>+
>+
>+        DCT32_4_DST_ADD_1_MEMBER      0, v16, 4, v0, v2, v4, v6, 
>DCT32_O_CONSTANT_17_8
>+        DCT32_4_DST_ADD_1_MEMBER      0, v16, 8, v8, v10, v12, v14, 
>DCT32_O_CONSTANT_17_8
>+        DCT32_4_DST_ADD_1_MEMBER      0, v16, 12, v24, v26, v28, v30, 
>DCT32_O_CONSTANT_17_8
>+
>+        // reload   E 0 to v16
>+        add                             a7, t5, zero
>+        vle32.v                         v16, (a7)
>+
>+        DCT32_4_DST_ADD_1_MEMBER      0, v18, 4, v0, v2, v4, v6, 
>DCT32_O_CONSTANT_19_9
>+        DCT32_4_DST_ADD_1_MEMBER      0, v18, 8, v8, v10, v12, v14, 
>DCT32_O_CONSTANT_19_9
>+        DCT32_4_DST_ADD_1_MEMBER      0, v18, 12, v24, v26, v28, v30, 
>DCT32_O_CONSTANT_19_9
>+        // reload   E1 to v18
>+        add                             a7, a7, t2
>+        vle32.v                         v18, (a7)
>+
>+       DCT32_4_DST_ADD_1_MEMBER      0, v20, 4, v0, v2, v4, v6, 
>DCT32_O_CONSTANT_21_10
>+       DCT32_4_DST_ADD_1_MEMBER      0, v20, 8, v8, v10, v12, v14, 
>DCT32_O_CONSTANT_21_10
>+       DCT32_4_DST_ADD_1_MEMBER      0, v20, 12, v24, v26, v28, v30, 
>DCT32_O_CONSTANT_21_10
>+        // reload   E2 to v20
>+        add                             a7, a7, t2
>+        vle32.v                         v20, (a7)
>+
>+       DCT32_4_DST_ADD_1_MEMBER      0, v22, 4, v0, v2, v4, v6, 
>DCT32_O_CONSTANT_23_11
>+
>+        vsetvli zero, zero, e16, m1, ta, ma
>+        // write 9 11 13 15
>+        DCT32_STORE_L   9, \shift, v0
>+        DCT32_STORE_L   11, \shift, v2
>+        DCT32_STORE_L   13, \shift, v4
>+        DCT32_STORE_L   15, \shift, v6
>+
>+        // reload   E3 to v0
>+        add                             a7, a7, t2
>+        vle32.v                         v0, (a7)
>+        // reload   E12 to v2
>+        add                             a7, a7, t2
>+        sh3add                          a7, t2, a7
>+        vle32.v                         v2, (a7)
>+        // reload   E13 to v4
>+        add                             a7, a7, t2
>+        vle32.v                         v4, (a7)
>+        // reload   E14 to v6
>+        add                             a7, a7, t2
>+        vle32.v                         v6, (a7)
>+
>+        vsetvli zero, zero, e32, m2, ta, ma
>+       DCT32_4_DST_ADD_1_MEMBER      0, v22, 8, v8, v10, v12, v14, 
>DCT32_O_CONSTANT_23_11
>+        // write 17 19 21 23
>+        vsetvli zero, zero, e16, m1, ta, ma
>+        DCT32_STORE_L   17, \shift, v8
>+        DCT32_STORE_L   19, \shift, v10
>+        DCT32_STORE_L   21, \shift, v12
>+        DCT32_STORE_L   23, \shift, v14
>+
>+        // reload   E15 to v8
>+        add                             a7, a7, t2
>+        vle32.v                         v8, (a7)
>+
>+        vsetvli zero, zero, e32, m2, ta, ma
>+       DCT32_4_DST_ADD_1_MEMBER      0, v22, 12, v24, v26, v28, v30, 
>DCT32_O_CONSTANT_23_11
>+        vsetvli zero, zero, e16, m1, ta, ma
>+        // write 25 27 29 31
>+        DCT32_STORE_L   25, \shift, v24
>+        DCT32_STORE_L   27, \shift, v26
>+        DCT32_STORE_L   29, \shift, v28
>+        DCT32_STORE_L   31, \shift, v30
>+
>+        vsetvli zero, zero, e32, m2, ta, ma
>+        // cal  E 3 12  EE EO  3
>+        butterfly v0, v2, v10, v0
>+        // save EE 3
>+        slli            t0, t2, 4
>+        add             a6, t5, t0
>+        vse32.v         v10, (a6)
>+        // reload E 4
>+        sh2add          a7, t2, t5
>+        vle32.v         v10, (a7)
>+
>+        // cal dst 2 4 6 10
>+        DCT32_4_DST_ADD_1_MEMBER      1, v0, 0, v24 v26 v28 v30, 
>DCT32_EO_CONSTANT_14_3
>+
>+        // cal  E 2 13  EE EO  2
>+        butterfly v20, v4, v12, v20
>+        // save EE 2
>+        add             a6, a6, t2
>+        vse32.v         v12, (a6)
>+        // reload E 5
>+        add             a7, a7, t2
>+        vle32.v         v12, (a7)
>+
>+        DCT32_4_DST_ADD_1_MEMBER      0, v20, 0, v24 v26 v28 v30, 
>DCT32_EO_CONSTANT_10_2
>+
>+        // cal E 1 14  EE EO  1
>+        butterfly v18, v6, v14, v18
>+        // save EE 1
>+        add             a6, a6, t2
>+        vse32.v         v14, (a6)
>+        // reload E 6
>+        add             a7, a7, t2
>+        vle32.v         v14, (a7)
>+        DCT32_4_DST_ADD_1_MEMBER      0, v18, 0, v24 v26 v28 v30, 
>DCT32_EO_CONSTANT_6_1
>+
>+        // cal  E 0 15  EE EO  0
>+        butterfly v16, v8, v22, v16
>+        // reload EE 0
>+        add             a6, a6, t2
>+        vse32.v         v22, (a6)
>+        // reload E 7
>+        add             a7, a7, t2
>+        vle32.v         v22, (a7)
>+
>+        DCT32_4_DST_ADD_1_MEMBER      0, v16, 0, v24 v26 v28 v30, 
>DCT32_EO_CONSTANT_2_0
>+
>+        // cal dst 18 22 26 30
>+        DCT32_4_DST_ADD_1_MEMBER      1, v0, 4, v2 v4 v6 v8, 
>DCT32_EO_CONSTANT_14_3
>+        // reload E 8   v0
>+        add             a7, a7, t2
>+        vle32.v         v0, (a7)
>+
>+        DCT32_4_DST_ADD_1_MEMBER      0, v20, 4, v2 v4 v6 v8, 
>DCT32_EO_CONSTANT_10_2
>+        // reload E 9     v20
>+        add             a7, a7, t2
>+        vle32.v         v20, (a7)
>+
>+        DCT32_4_DST_ADD_1_MEMBER      0, v18, 4, v2 v4 v6 v8, 
>DCT32_EO_CONSTANT_6_1
>+        // reload E 10     v18
>+        add             a7, a7, t2
>+        vle32.v         v18, (a7)
>+
>+        DCT32_4_DST_ADD_1_MEMBER      0, v16, 4, v2 v4 v6 v8, 
>DCT32_EO_CONSTANT_2_0
>+
>+
>+        // cal  E 7 8  EE EO  7
>+        butterfly v22, v0, v16, v22
>+        // reload E 11     v0
>+        add             a7, a7, t2
>+        vle32.v         v0, (a7)
>+        DCT32_4_DST_ADD_1_MEMBER      0, v22, 0, v24 v26 v28 v30, 
>DCT32_EO_CONSTANT_30_7
>+        DCT32_4_DST_ADD_1_MEMBER      0, v22, 4, v2 v4 v6 v8, 
>DCT32_EO_CONSTANT_30_7
>+
>+        // cal  E 6 9  EE EO  6
>+        butterfly v14, v20, v22, v14
>+        // reload EE 0  v20
>+        vle32.v         v20, (a6)
>+
>+        DCT32_4_DST_ADD_1_MEMBER      0, v14, 0, v24 v26 v28 v30, 
>DCT32_EO_CONSTANT_26_6
>+        DCT32_4_DST_ADD_1_MEMBER      0, v14, 4, v2 v4 v6 v8, 
>DCT32_EO_CONSTANT_26_6
>+
>+        // cal E 5 10  EE EO  5
>+        butterfly v12, v18, v14, v12
>+
>+        // reload EE 1  v18
>+        sub             a6, a6, t2
>+        vle32.v         v18, (a6)
>+
>+        DCT32_4_DST_ADD_1_MEMBER      0, v12, 0, v24 v26 v28 v30, 
>DCT32_EO_CONSTANT_22_5
>+        DCT32_4_DST_ADD_1_MEMBER      0, v12, 4, v2 v4 v6 v8, 
>DCT32_EO_CONSTANT_22_5
>+        // load EE 1  v18
>+
>+        // cal  E 4 11  EE EO  4
>+        butterfly v10, v0, v12, v10
>+        // reload EE 2  v18
>+        sub             a6, a6, t2
>+        vle32.v         v0, (a6)
>+        DCT32_4_DST_ADD_1_MEMBER      0, v10, 0, v24 v26 v28 v30, 
>DCT32_EO_CONSTANT_18_4
>+        DCT32_4_DST_ADD_1_MEMBER      0, v10, 4, v2 v4 v6 v8, 
>DCT32_EO_CONSTANT_18_4
>+        // reload EE 3  v10
>+        sub             a6, a6, t2
>+        vle32.v         v10, (a6)
>+
>+        //write dst 2 6 10 14 18 22 26 30
>+        vsetvli zero, zero, e16, m1, ta, ma
>+        DCT32_STORE_L   2, \shift, v24
>+        DCT32_STORE_L   6, \shift, v26
>+        DCT32_STORE_L   10, \shift, v28
>+        DCT32_STORE_L   14, \shift, v30
>+
>+        DCT32_STORE_L   18, \shift, v2
>+        DCT32_STORE_L   22, \shift, v4
>+        DCT32_STORE_L   26, \shift, v6
>+        DCT32_STORE_L   30, \shift, v8
>+
>+        vsetvli zero, zero, e32, m2, ta, ma
>+        //  EE 0-7 ready in register
>+
>+        // EE 3 4 EEE EEO 3
>+        butterfly       v10, v12, v28, v26
>+        // EE 1 6 EEE EEO 1
>+        butterfly       v18, v22, v24, v22
>+        // EE 2 5 EEE EEO 2
>+        butterfly       v0, v14, v30, v10
>+        // EE 0 7       EEE EEO 0
>+        butterfly v20, v16, v14, v12
>+
>+
>+        // EEO[0-4] v12 v22 v16 v26
>+        //dst 4 12 20 28
>+        DCT32_EEO_CAL   v4, 89, 75, 50, 18, v12, v22, v10, v26, 4, \shift
>+        DCT32_EEO_CAL   v8, 75, -18, -89, -50, v12, v22, v10, v26, 12, \shift
>+        DCT32_EEO_CAL   v6, 50, -89, 18, 75, v12, v22, v10, v26, 20, \shift
>+        DCT32_EEO_CAL   v16, 18, -50, 75, -89, v12, v22, v10, v26, 28, \shift
>+
>+        vsetvli         zero, zero, e16, m1, ta, ma
>+
>+        DCT32_STORE_L   4, \shift, v4
>+        DCT32_STORE_L   12, \shift, v8
>+        DCT32_STORE_L   20, \shift, v6
>+        DCT32_STORE_L   28, \shift, v16
>+
>+        vsetvli         zero, zero, e32, m2, ta, ma
>+        # EEEE[0] = EEE[0] + EEE[3];
>+        # EEEO[0] = EEE[0] - EEE[3];
>+        butterfly       v14, v28, v16, v20
>+        # EEEE[1] = EEE[1] + EEE[2];
>+        # EEEO[1] = EEE[1] - EEE[2];
>+        butterfly       v24, v30, v2, v4
>+
>+
>+        # dst[0] = (int16_t)((g_t32[0][0] * EEEE[0] + g_t32[0][1] * EEEE[1] + 
>add) >> shift);
>+        // 64 64
>+        li              a2, 64
>+        li              a3, 64
>+        vmul.vx         v18, v16, a2
>+        vmacc.vx        v18, a3, v2
>+        # dst[8 * line] = (int16_t)((g_t32[8][0] * EEEO[0] + g_t32[8][1] * 
>EEEO[1] + add) >> shift);
>+        // 83  36
>+        li              a2, 83
>+        li              a3, 36
>+        vmul.vx         v6, v20, a2
>+        vmacc.vx        v6, a3, v4
>+        # dst[16 * line] = (int16_t)((g_t32[16][0] * EEEE[0] + g_t32[16][1] * 
>EEEE[1] + add) >> shift);
>+        // 64  -64
>+        li              a2, 64
>+        li              a3, -64
>+        vmul.vx         v8, v16, a2
>+        vmacc.vx        v8, a3, v2
>+        # dst[24 * line] = (int16_t)((g_t32[24][0] * EEEO[0] + g_t32[24][1] * 
>EEEO[1] + add) >> shift);
>+        // 36 -83
>+        li              a2, 36
>+        li              a3, -83
>+        vmul.vx         v10, v20, a2
>+        vmacc.vx        v10, a3, v4
>+
>+        //write dst 0 8 16 24
>+        vsetvli         zero, zero, e16, m1, ta, ma
>+        DCT32_STORE_L   0, \shift, v18
>+        DCT32_STORE_L   8, \shift, v6
>+        DCT32_STORE_L   16, \shift, v8
>+        DCT32_STORE_L   24, \shift, v10
>+
>+        ret
>+endfunc
>+.endm
>+
>+tr_32xN_rvv firstpass, dct32_shift_1
>+tr_32xN_rvv secondpass, dct32_shift_2
>+
>+.macro DCT_N size
>+function PFX(dct_\size\()_v_opt)
>+        .option arch, +zba
>+
>+        addi    sp, sp, -16
>+        sx      ra, (sp)
>+
>+        mv      t6, a1
>+        csrwi   vxrm, 0
>+
>+        li     t1, 32
>+        vsetvli t4, t1, e16, m1, ta, ma
>+
>+        li      t0, 4096
>+        // temp stack address
>+        sub     t5, sp, t0
>+        li      t0, 2048
>+        sub     sp, t5, t0
>+
>+        // a0
>+        mv      a1, sp
>+        slli    t3, a2, 1
>+1:
>+        jal     func_tr_32xN_firstpass_rvv
>+        mul     t0, t4, t3
>+        add     a0, a0, t0
>+        slli    t0, t4, 1
>+        add     a1, a1, t0
>+        sub     t1, t1, t4
>+        bnez    t1, 1b
>+
>+        li      t1, 32
>+        mv      a0, sp
>+        mv      a1, t6
>+        li      t3, 64
>+1:
>+        jal     func_tr_32xN_secondpass_rvv
>+        slli    t0, t4, 6
>+        add     a0, a0, t0
>+        slli    t0, t4, 1
>+        add     a1, a1, t0
>+        sub     t1, t1, t4
>+        bnez    t1, 1b
>+
>+2:
>+        li      t0, 4096+2048
>+        add     sp, sp, t0
>+        lx      ra, (sp)
>+        addi    sp, sp, 16
>+
>+        ret
>+endfunc
>+.endm
>+
>+DCT_N 32
>diff --git a/source/common/riscv64/fun-decls.h 
>b/source/common/riscv64/fun-decls.h
>index ec04d9968..7ffb32e65 100644
>--- a/source/common/riscv64/fun-decls.h
>+++ b/source/common/riscv64/fun-decls.h
>@@ -123,6 +123,7 @@ FUNCDEF_TU_S(void, cpy1Dto2D_shr, v, int16_t* dst, const 
>int16_t* src, intptr_t
> FUNCDEF_TU_S(void, ssimDist, v, const pixel *fenc, uint32_t fStride, const 
> pixel *recon, intptr_t rstride, uint64_t *ssBlock, int shift, uint64_t *ac_k);
> FUNCDEF_TU_S(void, idct, v, const int16_t* src, int16_t* dst, intptr_t 
> dstStride);
> FUNCDEF_TU_S(void, dct, v, const int16_t* src, int16_t* dst, intptr_t 
> srcStride);
>+FUNCDEF_TU_S(void, dct, v_opt, const int16_t* src, int16_t* dst, intptr_t 
>srcStride);
> FUNCDEF_TU_S(void, getResidual, v, const pixel* fenc, const pixel* pred, 
> int16_t* residual, intptr_t stride);
>  
> FUNCDEF_TU_S2(void, intra_pred_planar, rvv, pixel* dst, intptr_t dstride, 
> const pixel* srcPix, int, int);
>--  
>2.34.1
> 
>_______________________________________________
>x265-devel mailing list
>[email protected]
>https://mailman.videolan.org/listinfo/x265-devel

_______________________________________________
x265-devel mailing list
[email protected]
https://mailman.videolan.org/listinfo/x265-devel

Re: [x265] [PATCH] RISC-V: Add RVV optimized DCT32x32

Reply via email to