Hi, On Tue, May 14, 2024 at 4:40 PM Stone Chen <chen.stonec...@gmail.com> wrote:
> + vvc_sad_8: > + .loop_height: > + movu xm0, [src1q] > + movu xm1, [src2q] > + MIN_MAX_SAD xm2, xm0, xm1 > + vpmovzxwd m1, xm1 > + vpaddd m3, m1 > [..] > + vvc_sad_16_128: > + .loop_height: > [..] > + .loop_width: > + movu xm0, [src1q] > + movu xm1, [src2q] > + MIN_MAX_SAD xm2, xm0, xm1 > + vpmovzxwd m1, xm1 > + vpaddd m3, m1 > Wouldn't it be more efficient if the main loops did a full register worth at a time? vpbroadcastd m4, [pw_1] loop: movu m0, [src1q] movu m1, [src2q] MIN_MAX_SAD m2, m0, m1 pmaddwd m1, m4 paddd m3, m1 (And then for w8, load 2 rows per iteration using movu xmN, [row0] and vinserti128 mN, [row1], 1.) Ronald _______________________________________________ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org https://ffmpeg.org/mailman/listinfo/ffmpeg-devel To unsubscribe, visit link above, or email ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".