On 11/07/2011 06:26 PM, Loren Merritt wrote:

> On Mon, 7 Nov 2011, Justin Ruggles wrote:
> 
>> +.loop:
>> +    movu            m1, [v1q+offsetq]
>> +    mulps           m1, m1, [v2q+offsetq]
>> +    addps           m0, m0, m1
>> +    add        offsetq, mmsize
>>      js           .loop
> 
> addps had latency 3 or 4, whereas the loop should be 1 or 2 cycles per
> iteration just counting uops. Thus it's latency bound and could be
> improved by multiple accumulators.

i was wondering about that. i'll try it out.

>> +%if cpuflag(avx)
>> +    vextractf128  xmm0, ymm0, 0
> 
> Does this work? Docs say that (like any VEX op) vextractf128 to xmm
> clobbers the upper half of the corresponding ymm.
> And it's unnecessary, xmm0 is already the lower half of ymm0.

ok, so i'll just extract the upper half to xmm1

>> +    vextractf128  xmm1, ymm0, 1
>> +    addps         xmm0, xmm1
>> +%endif
>> +%if cpuflag(sse3)
>> +    haddps        xmm0, xmm0
>> +    haddps        xmm0, xmm0
> 
> Is this really an improvement? How about pshuflw?

you mean pshuflw instead of the shufps below?

>> +%else
>>      movhlps xmm1, xmm0
>>      addps   xmm0, xmm1
>>      movss   xmm1, xmm0
>>      shufps  xmm0, xmm0, 1
>>      addss   xmm0, xmm1
>> +%endif
>>  %ifndef ARCH_X86_64
>>      movd    r0m,  xmm0
>>      fld     dword r0m
>>  %endif
>>      RET
>> +%endmacro
> 
> Does this need a vzeroupper?


i suppose it's actually needed after the vextractf128 in order to avoid
a transition penalty when using the low half of ymm0 as xmm0 in the next
addps. from that point on it's all xmm instructions.

-Justin


_______________________________________________
libav-devel mailing list
[email protected]
https://lists.libav.org/mailman/listinfo/libav-devel

Reply via email to