aarch64: add hscale specializations

Martin Storsjö Wed, 20 Apr 2022 01:44:47 -0700

On Sun, 17 Apr 2022, Martin Storsjö wrote:

On Fri, 15 Apr 2022, Swinney, Jonathan wrote:
This patch adds specializations for hscale for filterSize == 4 and 8 and
converts the existing implementation for the X8 version. For the old code,nowused for the X8 version, it improves the efficiency of the final summationsby
reducing 11 instructions to 7.
ff_hscale8to15_8_neon is mostly unchanged from the original except for afew
changes.
- The loads for the filter data were consolidated into a single 64 byte ld1
  instruction.
Couldn't you do this optimization on the existing function too?

Sorry, now I realized why this optimization only can be done if youoperate on a specific known filter width.

- The final summations were improved.
- The inner loop on filterSize was completely removed
I presume that this is the only differing factor which affects whether it'sworthwhile to keep a separate width=8 function or not. At least from thecheckasm benchmark numbers, the difference is notable but not huge (on therange of 4-10%, while the summation improvements gain even more).
Given a fully optimized function that has an inner loop (which is only takenonce for the width=8 case), is the separate function without an inner loopreally necessary?

With the ideal version of the final summation in both functions, theseparate filtersize=8 function is 11-19% faster than the genericmultiple-of-8 function (on Cortex A53 and A72 - on A73 the both versionsare essentially equally fast), so there's probably good reason to go withthe separate version.


Thus, disregard the review comments above.

// Martin
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".

Re: [FFmpeg-devel] [PATCH 1/2] swscale/aarch64: add hscale specializations

Reply via email to