On 4/30/18, Henrik Gramner <hen...@gramner.com> wrote: > On Mon, Apr 30, 2018 at 6:17 PM, Paul B Mahol <one...@gmail.com> wrote: >> + .loop0: >> + movu m1, [dq + xq] >> + movu m2, [aq + xq] >> + movu m3, [sq + xq] >> + >> + pshufb m1, [pb_b2dw] >> + pshufb m2, [pb_b2dw] >> + pshufb m3, [pb_b2dw] >> + mova m4, [pd_255] >> + psubd m4, m2 >> + pmulld m1, m4 >> + pmulld m3, m2 >> + paddd m1, m3 >> + paddd m1, [pd_128] >> + pmulld m1, [pd_257] >> + psrad m1, 16 >> + pshufb m1, [pb_dw2b] >> + movd [dq+xq], m1 >> + add xq, mmsize / 4 > > Unpacking to dwords seems inefficient when you could do something like > this (untested): > > mova m3, [pw_255] > mova m4, [pw_128] > mova m5, [pw_257] > .loop0: > pmovzxbw m0, [sq + xq] > pmovzxbw m2, [aq + xq] > pmovzxbw m1, [dq + xq] > pmullw m0, m2 > pxor m2, m3 > pmullw m1, m2 > paddw m0, m4 > paddw m0, m1 > pmulhuw m0, m5 > packuswb m0, m0 > movq [dq+xq], m0 > add xq, mmsize / 2
Will experiment with this. > > which does twice as much per iteration. Also note that pmulld is slow > on most CPUs. This SIMD is not for CPUs found in museums. > >> + .loop1: >> + xor tq, tq >> + xor uq, uq >> + xor vq, vq >> + mov rd, 255 >> + mov tb, [aq + xq] >> + neg tb >> + add rb, tb >> + mov ub, [sq + xq] >> + neg tb >> + imul ud, td >> + mov vb, [dq + xq] >> + imul rd, vd >> + add rd, ud >> + add rd, 128 >> + imul rd, 257 >> + sar rd, 16 >> + mov [dq + xq], rb >> + add xq, 1 >> + cmp xq, wq >> + jl .loop1 > > Is doing the tail in scalar necessary? E.g. can you pad the buffers so > that reading/writing past the end is OK and just run the SIMD loop? Overlay does not operate that way, you can overlay 1 pixel onto hd720 frame. Do you get it now? > > If that's impossible it'd probably be better to do a separate SIMD > loop and pinsr/pextr input/output pixels depending on the number of > elements left. That seems too complicated. _______________________________________________ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel