Re: [FFmpeg-devel] RISC-V vector DSP functions: Motivation for commit 446b009

2024-07-06 Thread Rémi Denis-Courmont
(Updating an old thread)

Le perjantaina 19. tammikuuta 2024, 19.14.02 EEST Rémi Denis-Courmont a écrit 
:
> Hi,
> 
> Le perjantaina 19. tammikuuta 2024, 17.30.00 EET Michael Platzer via ffmpeg-
> devel a écrit :
> > Commit 446b0090cbb66ee614dcf6ca79c78dc8eb7f0e37 by Remi Denis-Courmont has
> > replaced RISC-V vector loads and stores with negative stride with vrgather
> > (generalized permutation within vector registers) instructions in order to
> > reverse the elements in a vector register. The commit message explains
> > that
> > this change was done, but it does not explain why.
> 
> It was faster on what the best approximation of real hardware available at
> the time, i.e. a Sipeed Lichee Pi4A board. There are no benchmarks in the
> commit because I don't like to publish benchmarks collected from
> prototypes.

FWIW, it still seems true on SpacemiT X60 (BananaPi-F3).

On that hardware, strided load/stores scale linear to the number of elements, 
as you'd expect, with no optimisations for minus one stride (or more 
accurately, minus element byte site). Vector gathers scale quadratic.

So for sufficiently large vectors, strided load/stores would presumably be 
faster. But on real hardware with relatively small vector sizes, highest 
bandwidth is achieved by unit-strided load/stores and vector gather with 
LMUL=1.

-- 
雷米‧德尼-库尔蒙
http://www.remlab.net/



___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".


Re: [FFmpeg-devel] RISC-V vector DSP functions: Motivation for commit 446b009

2024-01-23 Thread Rémi Denis-Courmont
Le tiistaina 23. tammikuuta 2024, 19.34.46 EET Michael Platzer via ffmpeg-devel 
a écrit :
> I agree that the indexed and strided loads and stores are certainly slower
> than unit-strided loads and stores. However, the vrgather instruction is
> unlikely to be very performant either, unless the vector length is
> relatively short.

> Particularly, if vector register groups are used via a
> length multiplier LMUL of, e.g., 8, then any element in the destination
> vector register could be sourced from any element in the 8 source vector
> registers (i.e., 1/4 of the vector register file).

Gather instruction seem to scale quadratically on existing hardware, which is 
bad. That's why the FFmpeg code was later modified to use LMUL=1 in that 
particular case.

Now if you want to argue that VLSE is better, then please provide a patch 
exhibiting better performance on FFmpeg's checkasm on real hardware.
Otherwise, this discussion is not much more than he-said-she-said.

> By contrast, the performance of strided loads and stores, while certainly
> slower than unit-strided loads and stores, likely scales linearly with the
> vector length, so on CPUs with large VLEN the original code could very well
> run faster than the variant with vrgather, despite the slower strided loads
> and stores.

Yes, but it's a stretch to expect that accessing memory will be faster than 
accessing registers, especially when the dataset is typically too large to fit 
in L1. Furthermore strided loads require adders to compute the accessed 
address - something VRGATHER (or even VLUEXI) does not need.

Some people wish that processor cores would make a special optimised case of 
minus EEW/8 strides. And sure, that would be nice. But so far that's just 
wishful thinking.

> > > The RISC-V vector loads and stores support negative stride values for 
> > > use cases such as this one.
> > 
> > [Citation required]
>
> The purpose of strided loads and stores is to load/store elements that are
> not consecutive in memory, but instead separated by a constant offset.
> Additionally, the authors of the specification decided to allow negative
> stride values, since they apparently deemed it useful to be able to reverse
> the order of those elements.

FFmpeg *still* uses strided loads and stores where applicable, typically where 
the stride is legitimately variable. I cannot find a justification that small 
constant non-unit strides would be a good idea anywhere though.
 
Just because you can use negative offsets does not mean that this will be 
optimised for negative-unit offsets. Again, I have only seen some wishful 
thinking from some developers here and there. I have yet to see a serious 
quote from a IP vendor or a benchmark that would support this.

> > > Using vrgather instead replaces the more specific operation with a 
> > > more generic one,
> > 
> > 
> > That is a very subjective and unsubstantiated assertion. This feels a bit
> > hypocritical while you are attacking me for not providing justification.
> 
> vrgather is more generic because it can be used for any kind of permutation,
> which strided loads and stores cannot. This is not subjective.

That would be a fair comparison of vrgather with hypothetical vreverse or 
vtranspose instructions. But you're comparing apples and oranges here.
 
> > As far as I can tell, neither instruction are specific to reversing vector
> > element order. An actual real-life specific instruction exists on Arm in
> > the form of vector-reverse. I don't know any ISA with load-reverse or
> > store- reverse.
> 
> A load-reverse or store-reverse would just be a special case of strided
> load/store.

By that logic, a unit-stride load is just a special case of a strided load, 
and a strided load is just a special case of an indexed load. From an 
architectural functional standpoint, that is indeed definitely true. From a 
hardware silicon design and microbenchmark standpoint, that is however 
certainly false.
 
> When writing about the performance of vrgather I primarily had the
> scalability issues explained above in mind. It seems that you have already
> experienced these, since you found that a larger LMUL reduces the
> performance of vrgather.

> How would the reverse subtraction be optimized away? I assume that it needs
> to be part of the loop since it depends on the VL of the current iteration.

VRSUB computes the same vector at all but the last two iterations. All you 
need to do is make a special case for the tail iterations. Then VRSUB can be 
ran just twice for the whole function, zero times per loop iteration.

-- 
雷米‧德尼-库尔蒙
http://www.remlab.net/



___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".


Re: [FFmpeg-devel] RISC-V vector DSP functions: Motivation for commit 446b009

2024-01-23 Thread Michael Platzer via ffmpeg-devel
Hi Rémi,

Thanks for your reply.

> It was faster on what the best approximation of real hardware available at 
> the time, i.e. a Sipeed Lichee Pi4A board. There are no benchmarks in the 
> commit because I don't like to publish benchmarks collected from prototypes.
> Nevertheless I think the commit message hints enough that anybody could 
> easily guess that it was a performance optimisation, if I'm being honest.
> 
> This is not exactly surprising: typical hardware can only access so many 
> memory addresses simultaneously (i.e. one or maybe two), so indexed loads and 
> strided loads are bound to be much slower than unit-strided loads.

I agree that the indexed and strided loads and stores are certainly slower than 
unit-strided loads and stores. However, the vrgather instruction is unlikely to 
be very performant either, unless the vector length is relatively short. 
Particularly, if vector register groups are used via a length multiplier LMUL 
of, e.g., 8, then any element in the destination vector register could be 
sourced from any element in the 8 source vector registers (i.e., 1/4 of the 
vector register file).

AFAIK (but please correct me if I am wrong) the Sipeed Lichee Pi4A uses a 
quad-core XT-910, which depending on the exact variant has a vector register 
length (VLEN) of either 64 or 128 bits, so given the configured element width 
of 32 bits and length multiplier of 2, we are looking at vectors of 4 or 8 
elements.

There is a comment that reads "e16/m2 and e32/m4 are possible but slower due to 
gather", which does not surprise me, since the performance of vrgather most 
likely scales quadratically compared to the vector length. Similarly, vrgather 
is likely less performant on a RISC-V CPU with larger VLEN, since the hardware 
resources for a crossbar required for a permutation over the full vector 
register length become prohibitive for VLEN beyond 128 bits. This requires the 
permutation to be spread over several iterations instead, which need to cover 
every combination of input and output elements (hence the quadratic growth in 
execution time).

By contrast, the performance of strided loads and stores, while certainly 
slower than unit-strided loads and stores, likely scales linearly with the 
vector length, so on CPUs with large VLEN the original code could very well run 
faster than the variant with vrgather, despite the slower strided loads and 
stores.

> Maybe you have access to special hardware that is able to optimise the 
> special case of strides equal to minus one to reduce the number of memory 
> accesses.
> But I didn't back then, and as a matter of fact, I still don't. Hardware 
> donations are welcome.

Hardware availability is indeed still an issue for RISC-V vector processing.

> > The RISC-V vector loads and stores support negative stride values for 
> > use cases such as this one.
> 
> [Citation required]

The purpose of strided loads and stores is to load/store elements that are not 
consecutive in memory, but instead separated by a constant offset. 
Additionally, the authors of the specification decided to allow negative stride 
values, since they apparently deemed it useful to be able to reverse the order 
of those elements.

> > Using vrgather instead replaces the more specific operation with a 
> > more generic one,
> 
> That is a very subjective and unsubstantiated assertion. This feels a bit 
> hypocritical while you are attacking me for not providing justification.

vrgather is more generic because it can be used for any kind of permutation, 
which strided loads and stores cannot. This is not subjective.

> As far as I can tell, neither instruction are specific to reversing vector 
> element order. An actual real-life specific instruction exists on Arm in the 
> form of vector-reverse. I don't know any ISA with load-reverse or store- 
> reverse.

A load-reverse or store-reverse would just be a special case of strided 
load/store.

> > which is likely to be less performant on most HW architectures.
> 
> Would you care to define "most architectures"? I only know one commercially 
> available hardware architecture as of today, Kendryte K230 SoC with T-Head
> C908 CPU, so I can't make much sense of your sentence here.

When writing about the performance of vrgather I primarily had the scalability 
issues explained above in mind. It seems that you have already experienced 
these, since you found that a larger LMUL reduces the performance of vrgather.

> > In addition, it requires to setup an index vector,
> 
> That is irrelevant since in this loop, the vector bank is not a bottleneck.
> The loop can run with maximul LMUL either way. And besides, the loop turned 
> out to be faster with a smaller multiplier.

That is because the performance of vrgather does not scale linearly. I would 
assume that this does not happen with the original code (i.e., the performance 
of strided loads/stores does not decrease for larger LMUL).

> > thus raising dynamic instruction 

Re: [FFmpeg-devel] RISC-V vector DSP functions: Motivation for commit 446b009

2024-01-19 Thread Rémi Denis-Courmont
Hi,

Le perjantaina 19. tammikuuta 2024, 17.30.00 EET Michael Platzer via ffmpeg-
devel a écrit :
> Commit 446b0090cbb66ee614dcf6ca79c78dc8eb7f0e37 by Remi Denis-Courmont has
> replaced RISC-V vector loads and stores with negative stride with vrgather
> (generalized permutation within vector registers) instructions in order to
> reverse the elements in a vector register. The commit message explains that
> this change was done, but it does not explain why.

It was faster on what the best approximation of real hardware available at the 
time, i.e. a Sipeed Lichee Pi4A board. There are no benchmarks in the commit 
because I don't like to publish benchmarks collected from prototypes. 
Nevertheless I think the commit message hints enough that anybody could easily 
guess that it was a performance optimisation, if I'm being honest.

This is not exactly surprising: typical hardware can only access so many 
memory addresses simultaneously (i.e. one or maybe two), so indexed loads and 
strided loads are bound to be much slower than unit-strided loads.

Maybe you have access to special hardware that is able to optimise the special 
case of strides equal to minus one to reduce the number of memory accesses. 
But I didn't back then, and as a matter of fact, I still don't. Hardware 
donations are welcome.

> I fail to see what could possibly have motivated this change.

> The RISC-V vector loads and stores support negative stride values for use
> cases such as this one.

[Citation required]

> Using vrgather instead replaces the more specific operation with a more
> generic one,

That is a very subjective and unsubstantiated assertion. This feels a bit 
hypocritical while you are attacking me for not providing justification.

As far as I can tell, neither instruction are specific to reversing vector 
element order. An actual real-life specific instruction exists on Arm in the 
form of vector-reverse. I don't know any ISA with load-reverse or store-
reverse.

> which is likely to be less performant on most HW architectures.

Would you care to define "most architectures"? I only know one commercially 
available hardware architecture as of today, Kendryte K230 SoC with T-Head 
C908 CPU, so I can't make much sense of your sentence here.

> In addition, it requires to setup an index vector,

That is irrelevant since in this loop, the vector bank is not a bottleneck. 
The loop can run with maximul LMUL either way. And besides, the loop turned 
out to be faster with a smaller multiplier.

> thus raising dynamic instruction count.

It adds only one instruction (reverse subtraction) in the main loop, and even 
that could be optimised away if relevant.

-- 
レミ・デニ-クールモン
http://www.remlab.net/



___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".


[FFmpeg-devel] RISC-V vector DSP functions: Motivation for commit 446b009

2024-01-19 Thread Michael Platzer via ffmpeg-devel
Hi,

Commit 446b0090cbb66ee614dcf6ca79c78dc8eb7f0e37 by Remi Denis-Courmont has 
replaced RISC-V vector loads and stores with negative stride with vrgather 
(generalized permutation within vector registers) instructions in order to 
reverse the elements in a vector register. The commit message explains that 
this change was done, but it does not explain why.

I fail to see what could possibly have motivated this change. The RISC-V vector 
loads and stores support negative stride values for use cases such as this one. 
Using vrgather instead replaces the more specific operation with a more generic 
one, which is likely to be less performant on most HW architectures. In 
addition, it requires to setup an index vector, thus raising dynamic 
instruction count.

Could someone familiar with this change (perhaps Remi himself) please explain 
the motivation for this change?

Thanks,
Michael
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".