On Thu, 5 Jan 2012, Vitor Sessak wrote: >+; input %1={x1,x2,x3,x4}, %2={y1,y2,y3,y4} >+; output %3={x4,y1,y2,y3} >+%macro ROTLEFT_SSE 3 >+ BUILDINVHIGHLOW %1, %2, %3 >+ shufps %3, %3, %2, 0x99 >+%endmacro (and other such macros)
If some macro args can be described as output and some as input, then output should come first, to match the order of instruction arguments. >+%macro PSHUFD_SSE_AVX 3 >+ shufps %1, %2, %2, %3 >+%endmacro >+%macro PSHUFD_SSE2 3 >+ pshufd %1, %2, %3 >+%endmacro The recommended way to write such things has changed since you previously posted this patch. %macro PSHUFD 3 %if cpuflag(sse2) && notcpuflag(avx) pshufd %1, %2, %3 %else shufps %1, %2, %2, %3 %endif %endmacro This eliminates the defines at toplevel that used to be needed to select an implementation. >+%macro SPILL 2 ; xmm#, mempos >+ movaps [tmpq+(%2-8)*16 + 32*4], m%1 >+%endmacro >+%macro UNSPILL 2 >+ movaps m%1, [tmpq+(%2-8)*16 + 32*4] >+%endmacro >+%define SPILLED(x) [tmpq+(x-8)*16 + 32*4] Use SPILLED in defining SPILL. >+%define mova movaps >+%define movu movups cglobal undoes this. But it becomes unnecessary with cpuflags if you only have a sse1 version. > AVX_INSTR movsd, 1, 0, 0 > AVX_INSTR movss, 1, 0, 0 > AVX_INSTR mpsadbw, 0, 1, 0 >+AVX_INSTR movhlps, 1, 0, 0 >+AVX_INSTR movlhps, 1, 0, 0 > AVX_INSTR mulpd, 1, 0, 1 > AVX_INSTR mulps, 1, 0, 1 > AVX_INSTR mulsd, 1, 0, 1 Alphabetize. >int align_end = count - (count & 3); How much faster is ff_four_imdct36_float_sse? If you have 3 trailing blocks, should you round up? Caveat: make sure any unused space that is processed by simd float arithmetic contains valid floats, because NANs are slow. --Loren Merritt _______________________________________________ libav-devel mailing list libav-devel@libav.org https://lists.libav.org/mailman/listinfo/libav-devel