On Mon, Apr 30, 2018 at 6:17 PM, Paul B Mahol <one...@gmail.com> wrote: > + .loop0: > + movu m1, [dq + xq] > + movu m2, [aq + xq] > + movu m3, [sq + xq] > + > + pshufb m1, [pb_b2dw] > + pshufb m2, [pb_b2dw] > + pshufb m3, [pb_b2dw] > + mova m4, [pd_255] > + psubd m4, m2 > + pmulld m1, m4 > + pmulld m3, m2 > + paddd m1, m3 > + paddd m1, [pd_128] > + pmulld m1, [pd_257] > + psrad m1, 16 > + pshufb m1, [pb_dw2b] > + movd [dq+xq], m1 > + add xq, mmsize / 4
Unpacking to dwords seems inefficient when you could do something like this (untested): mova m3, [pw_255] mova m4, [pw_128] mova m5, [pw_257] .loop0: pmovzxbw m0, [sq + xq] pmovzxbw m2, [aq + xq] pmovzxbw m1, [dq + xq] pmullw m0, m2 pxor m2, m3 pmullw m1, m2 paddw m0, m4 paddw m0, m1 pmulhuw m0, m5 packuswb m0, m0 movq [dq+xq], m0 add xq, mmsize / 2 which does twice as much per iteration. Also note that pmulld is slow on most CPUs. > + .loop1: > + xor tq, tq > + xor uq, uq > + xor vq, vq > + mov rd, 255 > + mov tb, [aq + xq] > + neg tb > + add rb, tb > + mov ub, [sq + xq] > + neg tb > + imul ud, td > + mov vb, [dq + xq] > + imul rd, vd > + add rd, ud > + add rd, 128 > + imul rd, 257 > + sar rd, 16 > + mov [dq + xq], rb > + add xq, 1 > + cmp xq, wq > + jl .loop1 Is doing the tail in scalar necessary? E.g. can you pad the buffers so that reading/writing past the end is OK and just run the SIMD loop? If that's impossible it'd probably be better to do a separate SIMD loop and pinsr/pextr input/output pixels depending on the number of elements left. _______________________________________________ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel