On Mon, Sep 5, 2016 at 1:02 PM, Anton Khirnov <an...@khirnov.net> wrote: > +cglobal vector_clipf, 3, 3, 6, dst, src, len, min, max > +%if ARCH_X86_32 > + VBROADCASTSS m0, minm > + VBROADCASTSS m1, maxm > +%else > + VBROADCASTSS m0, m0 > + VBROADCASTSS m1, m1 > +%endif
This will fail on WIN64, to deal with the somewhat silly calling conventions on that platform you need to do something like VBROADCASTSS m0, m3 VBROADCASTSS m1, maxm (not tested, I don't have access to a Windows machine at the moment). > + movsxdifnidn lenq, lend > + shl lenq, 2 > + > +.loop > + sub lenq, 4 * mmsize Move the subtraction just before the branch (jg) to allow macro-op fusion on modern Intel CPUs. > + > + mova m2, [srcq + lenq + 0 * mmsize] > + mova m3, [srcq + lenq + 1 * mmsize] > + mova m4, [srcq + lenq + 2 * mmsize] > + mova m5, [srcq + lenq + 3 * mmsize] > + > + maxps m2, m0 > + maxps m3, m0 > + maxps m4, m0 > + maxps m5, m0 Use 3-arg maxps instead of mova. > + minps m2, m1 > + minps m3, m1 > + minps m4, m1 > + minps m5, m1 > + > + mova [dstq + lenq + 0 * mmsize], m2 > + mova [dstq + lenq + 1 * mmsize], m3 > + mova [dstq + lenq + 2 * mmsize], m4 > + mova [dstq + lenq + 3 * mmsize], m5 > + > + jg .loop > + > + RET Otherwise LGTM, you could make an AVX version using ymm registers as well in a separate patch if you want to, just need to make sure the buffers are aligned. _______________________________________________ libav-devel mailing list libav-devel@libav.org https://lists.libav.org/mailman/listinfo/libav-devel