Apr 19, 2021, 22:27 by d...@lynne.ee:

> This commit adds a pure x86 assembly SIMD version of the FFT in libavutil/tx. 
> The design of this pure assembly FFT is pretty unconventional.
>

Oh, I forgot to mention _why_ on the majority of transforms it's slower than 
FFTW.
It's simple - we don't hardcode anything. In fact, the biggest bottleneck by 
far is
the lookup done during loading of input data. Easily more than 40% of the entire
time is spent just doing lookups.
FFTW hardcodes the addresses in the transforms themselves (they call them
codelets). And in the process duplicate herculean amounts of code, amplified by
their just-as-inefficient Split-Radix refactoring which needs 4 versions of 
each codelet.
The 5+MB stripped binary can't get fat by itself, after all.
Also FFTW only supports a single extension (e.g. fma, avx, avx2) at once during
compile time, so that size figure looks even worse.

For an in-place pre-permuted transform where all the lookups are gone and 
replaced
with simple loads, we're actually consistently faster than FFTW. Thus, once
non-power-of-two transforms are implemented (where this situation happens),
we'll be faster than FFTW by quite a margin. The old non-power-of-two FFT
code is faster than FFTW by a bit, and this code's C version is quite a bit 
faster
than that code's C version.
I'd have implemented this as well before I sent the patch, but there are only 
so many
hours in a day, and ways to ignore the outside world, and the 5.0 release 
demands
my urgent attention, so I'll write the non-power-of-two part up when I feel 
like it in
hopefully a week or two.

An easy way to gauge optimization is the perf graph:
 39.58%  a.out    a.out              [.] ff_split_radix_fft_float_fma3.32pt
 21.17%  a.out    a.out              [.] ff_split_radix_fft_float_fma3.64pt
   8.21%  a.out    a.out              [.] 
ff_split_radix_fft_float_fma3.synth_deinterleave
   3.90%  a.out    a.out              [.] 
ff_split_radix_fft_float_fma3.synth_32768
   3.15%  a.out    a.out              [.] 
ff_split_radix_fft_float_fma3.synth_8192
   3.08%  a.out    a.out              [.] 
ff_split_radix_fft_float_fma3.synth_16384
   3.00%  a.out    a.out              [.] 
ff_split_radix_fft_float_fma3.synth_65536
   2.97%  a.out    a.out              [.] 
ff_split_radix_fft_float_fma3.synth_512
   2.96%  a.out    a.out              [.] 
ff_split_radix_fft_float_fma3.synth_4096
   2.96%  a.out    a.out              [.] 
ff_split_radix_fft_float_fma3.synth_2048
   2.95%  a.out    a.out              [.] ff_split_radix_fft_float_fma3.256pt
   2.91%  a.out    a.out              [.] ff_split_radix_fft_float_fma3.128pt
   2.89%  a.out    a.out              [.] 
ff_split_radix_fft_float_fma3.synth_1024
   0.04%  a.out    a.out              [.] ff_split_radix_fft_float_fma3.512pt
   0.01%  a.out    a.out              [.] ff_split_radix_fft_float_fma3.1024pt
   0.01%  a.out    a.out              [.] ff_split_radix_fft_float_fma3.4096pt
   0.01%  a.out    a.out              [.] ff_split_radix_fft_float_fma3.2048pt
   0.00%  a.out    a.out              [.] ff_split_radix_fft_float_fma3.131072pt
   0.00%  a.out    a.out              [.] ff_split_radix_fft_float_fma3.8192pt

Only the 32-point and 64-point transforms read data from the input,
and that's where 60% of the overhead is.

A fast vgatherdpd will go a long way to speed this up, as well as
AVX512. This code should be quite a bit faster on Ice Lake and Tiger Lake
systems, where vgatherdpd was significantly improved over Skylake.
Maybe someone can run checkasm on one of those systems.
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".

Reply via email to