On Sun, 17 Apr 2022, Martin Storsjö wrote:

On Fri, 15 Apr 2022, Swinney, Jonathan wrote:

This patch adds specializations for hscale for filterSize == 4 and 8 and
converts the existing implementation for the X8 version. For the old code, now used for the X8 version, it improves the efficiency of the final summations by
reducing 11 instructions to 7.

ff_hscale8to15_8_neon is mostly unchanged from the original except for a few
changes.
- The loads for the filter data were consolidated into a single 64 byte ld1
  instruction.

Couldn't you do this optimization on the existing function too?

Sorry, now I realized why this optimization only can be done if you operate on a specific known filter width.

- The final summations were improved.
- The inner loop on filterSize was completely removed

I presume that this is the only differing factor which affects whether it's worthwhile to keep a separate width=8 function or not. At least from the checkasm benchmark numbers, the difference is notable but not huge (on the range of 4-10%, while the summation improvements gain even more).

Given a fully optimized function that has an inner loop (which is only taken once for the width=8 case), is the separate function without an inner loop really necessary?

With the ideal version of the final summation in both functions, the separate filtersize=8 function is 11-19% faster than the generic multiple-of-8 function (on Cortex A53 and A72 - on A73 the both versions are essentially equally fast), so there's probably good reason to go with the separate version.

Thus, disregard the review comments above.

// Martin
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".

Reply via email to