On 4/30/2018 3:57 PM, Paul B Mahol wrote: > On 4/30/18, Henrik Gramner <hen...@gramner.com> wrote: >> On Mon, Apr 30, 2018 at 6:17 PM, Paul B Mahol <one...@gmail.com> wrote: >>> + .loop0: >>> + movu m1, [dq + xq] >>> + movu m2, [aq + xq] >>> + movu m3, [sq + xq] >>> + >>> + pshufb m1, [pb_b2dw] >>> + pshufb m2, [pb_b2dw] >>> + pshufb m3, [pb_b2dw] >>> + mova m4, [pd_255] >>> + psubd m4, m2 >>> + pmulld m1, m4 >>> + pmulld m3, m2 >>> + paddd m1, m3 >>> + paddd m1, [pd_128] >>> + pmulld m1, [pd_257] >>> + psrad m1, 16 >>> + pshufb m1, [pb_dw2b] >>> + movd [dq+xq], m1 >>> + add xq, mmsize / 4 >> >> Unpacking to dwords seems inefficient when you could do something like >> this (untested): >> >> mova m3, [pw_255] >> mova m4, [pw_128] >> mova m5, [pw_257] >> .loop0: >> pmovzxbw m0, [sq + xq] >> pmovzxbw m2, [aq + xq] >> pmovzxbw m1, [dq + xq] >> pmullw m0, m2 >> pxor m2, m3 >> pmullw m1, m2 >> paddw m0, m4 >> paddw m0, m1 >> pmulhuw m0, m5 >> packuswb m0, m0 >> movq [dq+xq], m0 >> add xq, mmsize / 2 > > > Will experiment with this. > >> >> which does twice as much per iteration. Also note that pmulld is slow >> on most CPUs. > > This SIMD is not for CPUs found in museums.
pmulld is sse4.1 and no museum CPU supports it. It's the slowest multiplication instruction by far on every CPU that supports it (in some cases twice as slow as pmullw, pmuldq, etc), and if it can be avoided, it absolutely should. > >> >>> + .loop1: >>> + xor tq, tq >>> + xor uq, uq >>> + xor vq, vq >>> + mov rd, 255 >>> + mov tb, [aq + xq] >>> + neg tb >>> + add rb, tb >>> + mov ub, [sq + xq] >>> + neg tb >>> + imul ud, td >>> + mov vb, [dq + xq] >>> + imul rd, vd >>> + add rd, ud >>> + add rd, 128 >>> + imul rd, 257 >>> + sar rd, 16 >>> + mov [dq + xq], rb >>> + add xq, 1 >>> + cmp xq, wq >>> + jl .loop1 >> >> Is doing the tail in scalar necessary? E.g. can you pad the buffers so >> that reading/writing past the end is OK and just run the SIMD loop? > > Overlay does not operate that way, you can overlay 1 pixel onto hd720 frame. > Do you get it now? > >> >> If that's impossible it'd probably be better to do a separate SIMD >> loop and pinsr/pextr input/output pixels depending on the number of >> elements left. > > That seems too complicated. > _______________________________________________ > ffmpeg-devel mailing list > ffmpeg-devel@ffmpeg.org > http://ffmpeg.org/mailman/listinfo/ffmpeg-devel > _______________________________________________ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel