On Sun, 17 Apr 2022, Martin Storsjö wrote:
On Fri, 15 Apr 2022, Swinney, Jonathan wrote:
This patch adds specializations for hscale for filterSize == 4 and 8 and
converts the existing implementation for the X8 version. For the old code,
now
used for the X8 version, it improves the efficiency of the final summations
by
reducing 11 instructions to 7.
ff_hscale8to15_8_neon is mostly unchanged from the original except for a
few
changes.
- The loads for the filter data were consolidated into a single 64 byte ld1
instruction.
Couldn't you do this optimization on the existing function too?
Sorry, now I realized why this optimization only can be done if you
operate on a specific known filter width.
- The final summations were improved.
- The inner loop on filterSize was completely removed
I presume that this is the only differing factor which affects whether it's
worthwhile to keep a separate width=8 function or not. At least from the
checkasm benchmark numbers, the difference is notable but not huge (on the
range of 4-10%, while the summation improvements gain even more).
Given a fully optimized function that has an inner loop (which is only taken
once for the width=8 case), is the separate function without an inner loop
really necessary?
With the ideal version of the final summation in both functions, the
separate filtersize=8 function is 11-19% faster than the generic
multiple-of-8 function (on Cortex A53 and A72 - on A73 the both versions
are essentially equally fast), so there's probably good reason to go with
the separate version.
Thus, disregard the review comments above.
// Martin
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".