Re: [x265] [PATCH] RISCV64: add copy_cnt assembly optimization

chen Tue, 08 Jul 2025 21:13:03 -0700

Hi Changsheng,




Thank for the remind, I ignored 'm2'.

I still feeling RISC-VV in the early stage, such as ISA unclear, no 
cycles/throughput table, etc. The RISC-V is fragmented, the asm optimized code 
for one kind of CPU may slower on the other brand.

We may keep GCC/LLVM with auto-vectorized as start point, we may add asm code 
if document good enough future.




Regards,
Chen




At 2025-07-09 10:02:27, [email protected] wrote:

Hi Chen,




Thank you for your reply. Let me try to address your questions.





1, GCC uses registers (V8, V10, V12, V14) because the m2 in vsetvli 
a5,zero,e32,m2,ta,ma indicates LMUL=2, which combines Vi and Vi+1 into a 
register group, doubling the vector width to 2*VLEN. In this case, it actually 
uses 8 registers: V8–V15.  

2, Our coding is based on the officially ratified and released specification 
documents, and we have tested and verified it on hardware compliant with the 
RVA23 profile. Therefore, there will be no code modifications due to 
instruction changes. For example, the Matrix extension planned for future 
RISC-V additions will not alter the existing Vector 1.0 extension.








Best Wishes！

Changsheng Wu

E：[email protected]

SANECHIPS TECHNOLOGY CO.,LTD.




Original
From: chen <[email protected]>
To: 吴昌盛0318004250;
Cc: [email protected] 
<[email protected]>;[email protected] 
<[email protected]>;[email protected] 
<[email protected]>;沈显来0318003851;袁佳0318004243;
Date: 2025年07月08日 22:35
Subject: Re:Re: [x265] [PATCH] RISCV64: add copy_cnt assembly optimization

Hi Changsheng,




Thank you for providing so much detailed information. I am glad to see that 
RISC-VV is gradually maturing.




I conducted a simple experiment on GCC, it can automatically generate 
vectorized code now. However, I found a special instruction vlseg4e32.v in 
output, and consulted the RISC-VV documentation. I got many doubts.

Such as registers is (V8, V9, V10, V11) in document, or (V8, V10, V12, V14) in 
GCC.

And, there are still many ASM instruction's details that only exist in 
fragmented PPT from different user/companies, without a unified official 
document







These issues may lead to repeated code rework in the future, which makes me 
still not recommend accepting RISC-VV now.

If possible, please help the RISC-V community to improving the RISC-VV ISA 
documentation. I am pleased to accept RISC-VV as one of the target platforms in 
the future.




Regards,
Chen




Code

int test1(int *x, int N)
{
    int sum = 0;
    if (__builtin_expect(N % 4, 0))
    {
        for(int i = 0; i < N; i+=4)
        {
            sum += (x[i+0] + x[i+1] + x[i+2] + x[i+3]);
        }
    }
    else
        __builtin_unreachable();
    return sum;
}




GCC output

test1(int*, int):
        ble     a1,zero,.L4
        vsetvlia5,zero,e32,m2,ta,ma
        addiw   a4,a1,-1
        vmv.v.iv4,0
        srliw   a4,a4,2
        addiw   a4,a4,1
.L3:
        vsetvlia5,a4,e32,m2,tu,ma
        vlseg4e32.v     v8,(a0)
        slli    a3,a5,4
        sub     a4,a4,a5
        add     a0,a0,a3
        vadd.vvv2,v10,v8
        vadd.vvv2,v2,v12
        vadd.vvv2,v2,v14
        vadd.vvv4,v4,v2
        bne     a4,zero,.L3
        vsetvlia5,zero,e32,m2,ta,ma
        vmv.s.xv1,zero
        vredsum.vs      v4,v4,v1
        vmv.x.sa0,v4
        ret
.L4:
        li      a0,0
        ret




RISC-V Document
7.8.2. Vector Strided Segment Loads and Stores

Vector strided segment loads and stores move contiguous segments where each 
segment is separated by the byte-stride offset given in the rs2 GPR argument.

|
Note
| Negative and zero strides are supported. |
    # Format     vlsseg<nf>e<eew>.v vd, (rs1), rs2, vm          # Strided 
segment loads     vssseg<nf>e<eew>.v vs3, (rs1), rs2, vm         # Strided 
segment stores      # Examples     vsetvli a1, t0, e8, ta, ma     vlsseg3e8.v 
v4, (x5), x6   # Load bytes at addresses x5+i*x6   into v4[i],                  
             #  and bytes at addresses x5+i*x6+1 into v5[i],                    
           #  and bytes at addresses x5+i*x6+2 into v6[i].      # Examples     
vsetvli a1, t0, e32, ta, ma     vssseg2e32.v v2, (x5), x6   # Store words from 
v2[i] to address x5+i*x6                                 #   and words from 
v3[i] to address x5+i*x6+4



At 2025-07-07 15:24:16, [email protected] wrote:

Hi Chen,




Thank you for your previous feedback.




I'd like to supplement some information about RISC-V Vector V1.0 and hope you 
can reconsider x265 support for the RISC-V architecture.   




1. The RISC-V community considers Vector V1.0 a stable version. The RISC-V 
Vector V1.0 was officially approved and released in 2021. The server profile 
RVA23, released in October 2024, also specifies Vector V1.0, and The RISC-V 
Instruction Set Manual Volume published the same year adopts Vector V1.0 as 
well. 

2. Many chip manufacturers already support RISC-V Vector Extension V1.0, such 
as the already released SiFive P670/P470, Andes NX27V, Alibaba C920, and 
SpaceMIT X100 CPUs. In the next year or two, many more vendors will launch 
chips supporting Vector V1.0.

3. GCC experimentally introduced RISC-V Vector support in GCC 12 (May 2022) and 
officially supported RISC-V Vector V1.0 in GCC 14 (May 2024).

4. The Linux kernel merged support for RISC-V Vector V1.0 in June 2023 and 
released it in the LTS 6.21 version.  

5. Our company has already planned to deploy RISC-V servers in data centers, 
with x265 video encoding being one of the key business scenarios. We will 
continue contributing RISC-V architecture patches.




RISC-V has garnered widespread attention and strong investment, leading to 
rapid development. I believe it will become another mainstream architecture 
following x86 and Arm. RISC-V is now commercially viable and deserves adoption 
by the x265 community.







Best Wishes！

Changsheng Wu

M: +86 13776570034

E：[email protected]

SANECHIPS TECHNOLOGY CO.,LTD.









From: chen <[email protected]>
To: 吴昌盛0318004250;
Cc: [email protected] 
<[email protected]>;[email protected] 
<[email protected]>;[email protected] 
<[email protected]>;沈显来0318003851;袁佳0318004243;吴昌盛0318004250;
Date: 2025年07月07日 03:28
Subject: Re:[x265] [PATCH] RISCV64: add copy_cnt assembly optimization

Hi Changsheng,




Thank for the patches.




However, I don't think RISC-V Extension-V stable enough nowadays.

v1.0 frozen at September 2021

v1.1 public review at May 2023

no more update until July 2025



And most instructions has not behavior description,




For example, vredsum.vs in the patch

vredsum.vs  vd, vs2, vs1, vm   # vd[0] =  sum( vs1[0] , vs2[*] )




I just guess it is
vd[0] =  vs1[0] + sum(vs2[*])




Another example is vlse8.v,

I may guess it is equal to x86 PSHUFB or ARM VTBL,




Above example I just guess, I can't confirm my concept in past couple years, 
too many similar problem inside RISC-V Extension-V

So, I suggest do not integrate / implement RISC-V patch, until specification 
become stable enough.




Rgards,

Chen

2025-07-06 10:08:25，[email protected] 

From 7562e3a834a6a5ea76ab1b97acf915e095646cd5 Mon Sep 17 00:00:00 2001


From: Changsheng Wu <[email protected]>

Date: Sat, 5 Jul 2025 23:09:14 +0800

Subject: [PATCH] RISCV64: add copy_cnt assembly optimization




TestBench test result:

  copy_cnt[4x4] |        1.34x |          123.12   |      165.06

  copy_cnt[8x8] |        2.64x |          214.07   |      564.26

copy_cnt[16x16] |        3.96x |          563.83   |      2232.00

copy_cnt[32x32] |        7.44x |          2144.80  |      15954.42

_______________________________________________
x265-devel mailing list
[email protected]
https://mailman.videolan.org/listinfo/x265-devel

Re: [x265] [PATCH] RISCV64: add copy_cnt assembly optimization

Reply via email to