On Wed, Jun 8, 2016 at 11:34 PM, James Almer <jamr...@gmail.com> wrote: > On 6/7/2016 6:18 AM, Muhammad Faiz wrote: >>>> + sub lend, 2 >>>> >> + lea dstq, [dstq + 16] >>> > >>> > Use add >>> > >>>> >> + lea coeffsq, [coeffsq + 2*Coeffs.sizeof] >>> > >>> > Same, assuming sizeof is an immediate. >>> > >> This is optimization to separate sub and jnz with lea. > > Does it make a noticeable difference? We don't really bother > keeping the instruction that sets the flag and the one that > checks it separated anywhere else, so i wonder. > I don't know.
>> Using add will clobber flag register. > > You could keep the coeffsq lea but change the dstq into an add > and move it above the sub instruction. No need for two slow leas > for the above purpose. > OK, probably add is faster. But AMD64 optimization guide said that lea instr have a latency 1 when there are two source operand (so it is as fast as add) Anyway, this code is not the critical path, so measuring faster one is difficult. >> Also lea does not need rex prefix > > add dstq, 16 seems to be four bytes, same as lea dstq [dstq+16], > at least on win64. I'm sorry, I'm wrong. It uses rex prefix > >>>> + SELECT_CQT_CALC(sse, SSE, 4, 0); >>>> >> + SELECT_CQT_CALC(sse3, SSE3, 4, 0); >>>> >> + SELECT_CQT_CALC(avx, AVX, 8, 01452367); >>> > >>> > Use AVX_FAST, so this function will not be used on CPUs that set the >>> > AV_CPU_FLAG_AVXSLOW flag. >>> > >>>> >> + SELECT_CQT_CALC(fma3, FMA3, 8, 01452367); >>> > >>> > Same, use FMA3_FAST. The result will then be the FMA3 version used by >>> > Intel CPUs and hopefully AMD Zen, and the FMA4 one by Bulldozer based >>> > CPUs. >>> > >>>> >> + SELECT_CQT_CALC(fma4, FMA4, 8, 01452367); >>>> >> +} >>>> >> >>> > >> OK, also reorder (FMA4 before AVX because AVX/ymm without FMA4 is faster than >> FMA4/xmm) > > Not in any machine supporting FMA4. And they will not use the > AVX or FMA3 functions here because they set the avxslow flag, > which the EXTERNAL_*_FAST macros check for. > > You could have kept the FMA4 init at the end and it would be > functionally the same, but in any case it's purely cosmetic. > If there is machine that support AVX (and fast ymm), FMA3 and FMA4, without reorder the selected code is FMA4. Anyway, I have applied the patch. Thank's for the review _______________________________________________ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel