Hi Zhili & Rémi Denis-Courmont,
Thank you very much, Zhao Zhili, for the helpful pointers! I’ll immediately 
review the latest RVV-related patches and pull requests on code.ffmpeg.org 
<https://code.ffmpeg.org/ > and study your excellent summary on assembly 
optimization.
I also sincerely appreciate Rémi Denis-Courmont’s detailed feedback. In 
response to the points you raised, I’d like to share a bit about our current 
efforts:

 * 
RISE Multimedia Group: We’ll reach out to our internal colleagues to check 
whether there are any ongoing initiatives within that community, to avoid 
duplication and explore potential collaboration.

 * 
Segmented load/store performance: We’ve encountered similar bottlenecks in our 
video decoding optimizations. To address this, we’re actively proposing new 
vector instructions tailored for media workloads to the RISC-V International 
standards body. At the same time, we’re working closely with RISC-V CPU 
microarchitecture teams to improve the hardware efficiency of these memory 
operations.

 * 
Scalable VLEN (variable vector length) challenges: I fully agree with your 
observation. The scalability of RVV is meant to provide flexibility—ideally, a 
single optimized implementation should adapt gracefully across different VLEN 
configurations. However, in practice, video codecs like HEVC predominantly use 
fixed-size blocks (e.g., 4×4, 8×8). As a result, an algorithm optimized for 
VLEN=128 may not perform better—or may even regress—on a VLEN=256 system, 
despite the latter having higher theoretical compute throughput. This forces us 
to develop separate optimizations per VLEN, which undermines the original 
intent of RVV’s scalability. We believe there’s significant room for discussion 
and innovation here—both in software strategies and hardware design.
We recognize that the RISC-V vector ecosystem is still evolving rapidly. 
Nevertheless, we’re confident that through close hardware-software co-design, 
RVV can become highly competitive in video coding workloads over time. RISC-V’s 
open nature makes it especially well-suited for such collaborative 
improvements—and we warmly welcome any performance insights, suggestions, or 
discussions from the community.
Thank you again for your valuable input and support!
Best regards,
Yunfei Zhou
Alibaba DAMO Academy
------------------------------------------------------------------
发件人:Rémi Denis-Courmont via ffmpeg-devel <[email protected]>
发送时间:2025年11月14日(周五) 22:22
收件人:FFmpeg development discussions and patches<[email protected]>
抄 送:"[email protected]"<[email protected]>; "Rémi 
Denis-Courmont"<[email protected]>
主 题:[FFmpeg-devel] Re: [Question]Inquiry Regarding RISC-V RVV Optimization for 
HEVC Decoding in FFmpeg
Nihao,
Le 14 novembre 2025 03:52:51 GMT+02:00, yunfei_zhou--- via ffmpeg-devel 
<[email protected]> a écrit :
>Before proceeding, we would like to understand whether there are any existing 
>or ongoing efforts in this area to avoid duplication and, ideally, align or 
>collaborate with current initiatives.
Existing code you can find in the official Git repo. Ongoing efforts are 
unknown to us. You had probably better ask the RISE multimedia group than 
FFmpeg-devel. I suppose you or one of your colleagues should have access. (I 
don't anyone here has.)
> * 
>Available documentation or resources that could help us better understand the 
>existing codebase and optimization strategies.
To be honest, in my experience, while it is obviously possible to optimise 
video decoding with RVV, the current implementations are not competitive (with 
e.g. Armv8 AdvSIMD) due most particularly to two aspects:
1) Segmented loads&stores are slow. Because video decoding often involves 
transposition, we would really need segmented unit-strided accesses to run as 
fast or almost as fast as single-segment unit-strided accesses of the same 
size. Likewise we need segmented register-strided accesses to be almost as fast 
as single-segment register strided accesses.
2) Because RVV is scalable, and video decoding uses a lot of fixed-size and/or 
small vectors, we need instruction execution cost to scale according to VL or 
next_power_of_two(VL). Currently it seems to scale according to VLMAX, which 
means larger vectors make optimisations worse rather than better.
(This is based on benchmarks for your C910 and C908 cores, and SpacemiT's X60. 
I don't have access to any other hardware at the moment.)
Point being, the available hardware seems a little bit immature, so we don't 
really have settled optimisations strategies.
Br,
_______________________________________________
ffmpeg-devel mailing list -- [email protected]
To unsubscribe send an email to [email protected]
_______________________________________________
ffmpeg-devel mailing list -- [email protected]
To unsubscribe send an email to [email protected]

Reply via email to