On 7/12/2021 7:46 AM, Lynne wrote:
12 Jul 2021, 11:29 by alankelly-at-google....@ffmpeg.org:

On Fri, Jun 25, 2021 at 1:24 PM Alan Kelly <alanke...@google.com> wrote:

On Fri, Jun 25, 2021 at 10:40 AM Lynne <d...@lynne.ee> wrote:

Jun 25, 2021, 09:54 by alankelly-at-google....@ffmpeg.org:

Broadwell and later and Zen3 and later have fast gather instructions.
---
  Gather requires between 9 and 12 cycles on Haswell, 5 to 7 on
Broadwell,
  and 2 to 5 on Skylake and newer. It is also slow on AMD before Zen 3.
  libavutil/cpu.h     |  2 ++
  libavutil/x86/cpu.c | 18 ++++++++++++++++--
  libavutil/x86/cpu.h |  1 +
  3 files changed, 19 insertions(+), 2 deletions(-)


No, we really don't need more FAST/SLOW flags, especially for
something like this which is just fixable by _not_using_vgather_.
Take a look at libavutil/x86/tx_float.asm, we only use vgather
if it's guaranteed to either be faster for what we're gathering or
is just as fast "slow". If neither is true, we use manual lookups,
which is actually advantageous since for AVX2 we can interleave
the lookups that happen in each lane.

Even if we disregard this, I've extensively benchmarked vgather
on Zen 3, Zen 2, Cascade Lake and Skylake, and there's hardly
a great vgather improvement to be found in Zen 3 to justify
using a new CPU flag for this.
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".


Thanks for your response. I'm not against finding a cleaner way of
enabling/disabling the code which will be protected by this flag. However,
the manual lookups solution proposed will not work in this case, the avx2
version of hscale will only be faster if fast gathers are available,
otherwise, the ssse3 version should be used.

I haven't got access to a Zen3 so I can't comment on the performance. I
have tested on a Zen 2 and it is slow. On Broadwell hscale avx2 is about
10% faster than the ssse3 version and on Skylake about 40% faster, Haswell
has similar performance to Zen2.

Is there a proxy which could be used for detecting Broadwell or Skylake
and later? AVX512 seems too strict as there are Skylake chips without
AVX512. Thanks


Hi,

I will paste the performance figures from the thread for the other part of
this patch here so that the justification for this flag is clearer:

Skylake Haswell
hscale_8_to_15_width4_ssse3 761.2 760
hscale_8_to_15_width4_avx2 468.7 957
hscale_8_to_15_width8_ssse3 1170.7 1032
hscale_8_to_15_width8_avx2 865.7 1979
hscale_8_to_15_width12_ssse3 2172.2 2472
hscale_8_to_15_width12_avx2 1245.7 2901
hscale_8_to_15_width16_ssse3 2244.2 2400
hscale_8_to_15_width16_avx2 1647.2 3681

As you can see, it is catastrophic on Haswell and older chips but the gains
on Skylake are impressive.
As I don't have performance figures for Zen 3, I can disable this feature
on all cpus apart from Broadwell and later as you say that there is no
worthwhile improvement on Zen3. Is this OK with you?


It's not that catastrophic. Since Haswell CPUs generally don't have
large AVX2 gains, could you just exclude Haswell only from
EXTERNAL_AVX2_FAST, and require EXTERNAL_AVX2_FAST
to enable those functions?

And disable all non gather AVX2 asm functions on Haswell? No. And it's a lie that Haswell doesn't have large gains with AVX2.

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".


_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".

Reply via email to