On 11/07/2011 06:26 PM, Loren Merritt wrote: > On Mon, 7 Nov 2011, Justin Ruggles wrote: > >> +.loop: >> + movu m1, [v1q+offsetq] >> + mulps m1, m1, [v2q+offsetq] >> + addps m0, m0, m1 >> + add offsetq, mmsize >> js .loop > > addps had latency 3 or 4, whereas the loop should be 1 or 2 cycles per > iteration just counting uops. Thus it's latency bound and could be > improved by multiple accumulators.
i was wondering about that. i'll try it out. >> +%if cpuflag(avx) >> + vextractf128 xmm0, ymm0, 0 > > Does this work? Docs say that (like any VEX op) vextractf128 to xmm > clobbers the upper half of the corresponding ymm. > And it's unnecessary, xmm0 is already the lower half of ymm0. ok, so i'll just extract the upper half to xmm1 >> + vextractf128 xmm1, ymm0, 1 >> + addps xmm0, xmm1 >> +%endif >> +%if cpuflag(sse3) >> + haddps xmm0, xmm0 >> + haddps xmm0, xmm0 > > Is this really an improvement? How about pshuflw? you mean pshuflw instead of the shufps below? >> +%else >> movhlps xmm1, xmm0 >> addps xmm0, xmm1 >> movss xmm1, xmm0 >> shufps xmm0, xmm0, 1 >> addss xmm0, xmm1 >> +%endif >> %ifndef ARCH_X86_64 >> movd r0m, xmm0 >> fld dword r0m >> %endif >> RET >> +%endmacro > > Does this need a vzeroupper? i suppose it's actually needed after the vextractf128 in order to avoid a transition penalty when using the low half of ymm0 as xmm0 in the next addps. from that point on it's all xmm instructions. -Justin _______________________________________________ libav-devel mailing list [email protected] https://lists.libav.org/mailman/listinfo/libav-devel
