Re: [FFmpeg-devel] [PATCH 0/5] RISC-V: Improve H264 decoding performance using RVV intrinsic

2023-05-09 Thread Lynne
May 9, 2023, 11:51 by arnie.ch...@sifive.com:

> We are submitting a set of patches that significantly improve H.264 decoding 
> performance
> by utilizing RVV intrinsic code. The average speedup(FPS) achieved by these 
> patches is more than 2x,
> as experimented on 720P videos running on an internal FPGA board.
>
> Patch1: add support for RVV intrinsic code in the configure file
> Patch2: optimize chroma motion compensation
> Patch3: optimize luma motion compensation
> Patch4: optimize dsp functions, such as IDCT, in-loop filtering, and weighed 
> filtering
> Patch5: optimize intra prediction
>
> Arnie Chang (5):
>  configure: Add detection of RISC-V vector intrinsic support
>  lavc/h264chroma: Add vectorized implementation of chroma MC for RISC-V
>  lavc/h264qpel: Add vectorized implementation of luma MC for RISC-V
>  lavc/h264dsp: Add vectorized implementation of DSP functions for
>  RISC-V
>  lavc/h264pred: Add vectorized implementation of intra prediction for
>  RISC-V
>

Could you rewrite this in asm instead? I'd like for risc-v to have the same
policy like we do for arm - no intrinsics. There's a long list of reasons we
don't use intrinsics which I won't get into.
Just a few days ago, I discovered that our PPC intrinsics were quite badly
performing due to compiler issues, in some cases, 500x slower than C.
Also, we don't care about overall speedup. We have checkasm --bench
to measure the per-function speedup over C.
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".


Re: [FFmpeg-devel] [PATCH 0/5] RISC-V: Improve H264 decoding performance using RVV intrinsic

2023-05-09 Thread Rémi Denis-Courmont
Hi,

Le tiistaina 9. toukokuuta 2023, 12.50.25 EEST Arnie Chang a écrit :
> We are submitting a set of patches that significantly improve H.264 decoding
> performance by utilizing RVV intrinsic code.

I believe that there is a general dislike of compiler intrinsic for vector 
optimisations in FFmpeg for a plurality of reasons. FWIW, that dislike is not 
limited to FFmpeg:
https://www.reddit.com/r/RISCV/comments/131hlgq/comment/ji1ie3l/
Indeed, in my personal opinion, RISC-V V intrinsics specifically are painful to 
read/write compared to assembler.

On top of that, in this particular case, intrinsics have at least three, 
possibly four, additional and more objective challenges as compared to the 
existing RVV assembler:

1) They are less portable, requiring the most bleeding edge version of 
compilers. Case in point: our FATE GCC instance does not support them as of 
today (because Debian Unstable does not).

2) They do not work with run-time CPU detection, at least not currently. This 
is going to be a major stumbling point for Linux distributions which need to 
build code that runs on processors without vector unit.

3) V intrinsics require specifying the group multiplier at every instruction. 
In most cases, this is just very inconvenient. But in those algorithms that 
require a fixed vector size (e.g. Opus DSP already now), this simply does _not_ 
work.

Essentially, this is the downside of relying on the compiler to do the 
register allocation.

4) (Unsure) Intrinsics are notorious for missing some code points.


The first two points may be addressed eventually. But the third point is 
intrinsic to intrinsics (hohoho). So unless there is a case for why intrinsics 
would be all but _required_, please avoid them.

Now I do realise that that means some of the code won't be XLEN-indepent. 
Well, we can cross that bridge with macros if/when somebody actually cares 
about FFmpeg vector optimisations on RV32I.

Br,

-- 
雷米‧德尼-库尔蒙
http://www.remlab.net/



___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".


Re: [FFmpeg-devel] [PATCH 0/5] RISC-V: Improve H264 decoding performance using RVV intrinsic

2023-05-10 Thread Arnie Chang
Hi Lynne

I fully respect the policy and understand the disadvantages of intrinsic
code.
Considering the benefits of the open ISA like RISC-V,
the intrinsic code should still have a better chance of being optimized by
the compiler for hardware variants.

At this moment, the intrinsic implementation is the only thing available.
It would take a significant amount of time to rewrite it in assembly due to
the large amount of functions.

I was wondering if we could treat the intrinsic code as an initial version
for the RISC-V port with the following modification.
- Add an option --enable-rvv-intrinsic to EXPLICITLY enable the
intrinsic optimization, which is disabled by default.
  Based on the given conditions, vector supports in GCC and intrinsics
dislike and limits. Disabling it by default seems a reasonable way.

For those who want to be involved in the optimization of H.264 decoder on
RISC-V can work on the assembly and decide whether to refer to intrinsic
code.
I believe this would be a good starting point for future optimization.


On Wed, May 10, 2023 at 12:51 AM Rémi Denis-Courmont 
wrote:

> Hi,
>
> Le tiistaina 9. toukokuuta 2023, 12.50.25 EEST Arnie Chang a écrit :
> > We are submitting a set of patches that significantly improve H.264
> decoding
> > performance by utilizing RVV intrinsic code.
>
> I believe that there is a general dislike of compiler intrinsic for vector
> optimisations in FFmpeg for a plurality of reasons. FWIW, that dislike is
> not
> limited to FFmpeg:
> https://www.reddit.com/r/RISCV/comments/131hlgq/comment/ji1ie3l/
> Indeed, in my personal opinion, RISC-V V intrinsics specifically are
> painful to
> read/write compared to assembler.
>
> On top of that, in this particular case, intrinsics have at least three,
> possibly four, additional and more objective challenges as compared to the
> existing RVV assembler:
>
> 1) They are less portable, requiring the most bleeding edge version of
> compilers. Case in point: our FATE GCC instance does not support them as
> of
> today (because Debian Unstable does not).
>
> 2) They do not work with run-time CPU detection, at least not currently.
> This
> is going to be a major stumbling point for Linux distributions which need
> to
> build code that runs on processors without vector unit.
>
> 3) V intrinsics require specifying the group multiplier at every
> instruction.
> In most cases, this is just very inconvenient. But in those algorithms
> that
> require a fixed vector size (e.g. Opus DSP already now), this simply does
> _not_
> work.
>
> Essentially, this is the downside of relying on the compiler to do the
> register allocation.
>
> 4) (Unsure) Intrinsics are notorious for missing some code points.
>
>
> The first two points may be addressed eventually. But the third point is
> intrinsic to intrinsics (hohoho). So unless there is a case for why
> intrinsics
> would be all but _required_, please avoid them.
>
> Now I do realise that that means some of the code won't be XLEN-indepent.
> Well, we can cross that bridge with macros if/when somebody actually cares
> about FFmpeg vector optimisations on RV32I.
>
> Br,
>
> --
> 雷米‧德尼-库尔蒙
> http://www.remlab.net/
>
>
>
>
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".


Re: [FFmpeg-devel] [PATCH 0/5] RISC-V: Improve H264 decoding performance using RVV intrinsic

2023-05-10 Thread Lynne
May 10, 2023, 10:47 by arnie.ch...@sifive.com:

> Hi Lynne
>
> I fully respect the policy and understand the disadvantages of intrinsic
> code.
> Considering the benefits of the open ISA like RISC-V,
> the intrinsic code should still have a better chance of being optimized by
> the compiler for hardware variants.
>

ISA being open or not is irrelevant. Power9 is open and yet compilers
still fail at having consistent performance rather than thrashing vectors
to stack.
Optimizing assembly code for new ISA features is simple with the
much more advanced templating system present in assemblers.
Plus, we can confirm that it's a net gain rather than a compiler artifact.

As advanced compilers are, we cannot even trust them to compile
C code correctly. GCC still has issues and miscompiles/misvectorizes
our code, so we have to disable tree vectorization. Not that it's a big
issue, performance-sensitive code is all assembly for us.


> At this moment, the intrinsic implementation is the only thing available.
> It would take a significant amount of time to rewrite it in assembly due to
> the large amount of functions.
>

It's precisely because there isn't a lot of code written that this ought to
be done now. Rewriting intrinsics or inline assembly is a hard process
after being merged, and all sorts of bugs and weird behavior appears
when rewriting to assembly.
You could start by just disassembling the compiled version and cleaning
it up. We've had to do this in the past.


> I was wondering if we could treat the intrinsic code as an initial version
> for the RISC-V port with the following modification.
>  - Add an option --enable-rvv-intrinsic to EXPLICITLY enable the
> intrinsic optimization, which is disabled by default.
>  Based on the given conditions, vector supports in GCC and intrinsics
> dislike and limits. Disabling it by default seems a reasonable way.
>
> For those who want to be involved in the optimization of H.264 decoder on
> RISC-V can work on the assembly and decide whether to refer to intrinsic
> code.
> I believe this would be a good starting point for future optimization.
>

Well, sort of, no. No CPU has support for RVV 1.0 at the moment.
There's no reason to hurry with this at all and merge less than desirable
code, disabled by default, which hasn't even been tested on actual hardware.

There's hardly real hardware on the horizon either. The P670 was
allegedly released last year, but even you had to test your code on an FPGA.
Even then, the P670 only has 128bit ALUs, which is suboptimal as
variable vector code tends to be more latency-bound.
The XuanTie C908 is a better candidate that I heard is getting released
sooner, and it has 256bit ALUs.

I've been wanting to write RVV code for years now, but the hardware
simply hasn't been there yet.
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".


Re: [FFmpeg-devel] [PATCH 0/5] RISC-V: Improve H264 decoding performance using RVV intrinsic

2023-05-10 Thread Rémi Denis-Courmont
Hi,

Le 10 mai 2023 11:46:57 GMT+03:00, Arnie Chang  a écrit 
:
>Considering the benefits of the open ISA like RISC-V,
>the intrinsic code should still have a better chance of being optimized by
>the compiler for hardware variants.

You probably have access to proprietary performance information of SiFive which 
nobody else here can argue about, so maybe you are onto something here.

However, FFmpeg needs to support any RV64GC CPU with a single build, because 
that's how many Linux distributions and applications will build it. So we can't 
really rely on the compiler's per-CPU model tuning for scheduling. In any case, 
my guess is that there won't be that much room for the compiler to reorder 
vector code, even if it's using intrinsics.

To the contrary, I fear that we need to tune the group multiplier (LMUL) at 
runtime to get good performance on different processor designs. Essentially 
unrolling. And if that turns out to be true, then we *cannot* use intrinsics 
since they don't support varying the group multiplier at runtime unlike outline 
assembler.

So I could be completely wrong but if so, we'd need more substantial 
explanation and justification why.

>At this moment, the intrinsic implementation is the only thing available.
>It would take a significant amount of time to rewrite it in assembly due to
>the large amount of functions.

Is it really that much work? Leaving aside maybe converting the inline 
functions into assembler macros, it seems mostly like a case of passing the C 
code through the compiler, then disassembling the result and then reformatting 
for legibility here and there.

As the proverb goes, "on the Internet, nobody knows you're a monkey". Nobody 
needs to know that somebody wrote their assembler with the help of intrinsics 
and a compiler.

>I was wondering if we could treat the intrinsic code as an initial version
>for the RISC-V port with the following modification.
>- Add an option --enable-rvv-intrinsic to EXPLICITLY enable the
>intrinsic optimization, which is disabled by default.

I will let more senior developers to comment here, but I suspect that this 
would set a bad example that would eventually induce other people into choosing 
intrinsics over outline assembler for new code.

Adding a build option could be viable if we wanted to advise against using the 
code. But here we rather want to advise against using the code as a reference, 
not against running it.

If this were the kernel, I'd argue merging the code into `staging` but FFmpeg 
is not so large that it'd have a staging area.

>  Based on the given conditions, vector supports in GCC and intrinsics
>dislike and limits. Disabling it by default seems a reasonable way.
>
>For those who want to be involved in the optimization of H.264 decoder on
>RISC-V can work on the assembly and decide whether to refer to intrinsic
>code.
>I believe this would be a good starting point for future optimization.

Well most likely. The thing is though that nobody in the FFmpeg community 
(except you) has hardware access in any shape or form at this time, at least 
that I'd know. That's one of the reasons why my own efforts have stalled.

>
>
>On Wed, May 10, 2023 at 12:51 AM Rémi Denis-Courmont 
>wrote:
>
>> Hi,
>>
>> Le tiistaina 9. toukokuuta 2023, 12.50.25 EEST Arnie Chang a écrit :
>> > We are submitting a set of patches that significantly improve H.264
>> decoding
>> > performance by utilizing RVV intrinsic code.
>>
>> I believe that there is a general dislike of compiler intrinsic for vector
>> optimisations in FFmpeg for a plurality of reasons. FWIW, that dislike is
>> not
>> limited to FFmpeg:
>> https://www.reddit.com/r/RISCV/comments/131hlgq/comment/ji1ie3l/
>> Indeed, in my personal opinion, RISC-V V intrinsics specifically are
>> painful to
>> read/write compared to assembler.
>>
>> On top of that, in this particular case, intrinsics have at least three,
>> possibly four, additional and more objective challenges as compared to the
>> existing RVV assembler:
>>
>> 1) They are less portable, requiring the most bleeding edge version of
>> compilers. Case in point: our FATE GCC instance does not support them as
>> of
>> today (because Debian Unstable does not).
>>
>> 2) They do not work with run-time CPU detection, at least not currently.
>> This
>> is going to be a major stumbling point for Linux distributions which need
>> to
>> build code that runs on processors without vector unit.
>>
>> 3) V intrinsics require specifying the group multiplier at every
>> instruction.
>> In most cases, this is just very inconvenient. But in those algorithms
>> that
>> require a fixed vector size (e.g. Opus DSP already now), this simply does
>> _not_
>> work.
>>
>> Essentially, this is the downside of relying on the compiler to do the
>> register allocation.
>>
>> 4) (Unsure) Intrinsics are notorious for missing some code points.
>>
>>
>> The first two points may be addressed eventually. But the third point is
>> intrinsic