Hi, On Sat, Jul 14, 2012 at 9:29 PM, Justin Ruggles <justin.rugg...@gmail.com> wrote: > --- > libavresample/x86/audio_convert.asm | 62 > ++++++++++++++++++++++++++++++++ > libavresample/x86/audio_convert_init.c | 9 +++++ > 2 files changed, 71 insertions(+), 0 deletions(-) > > diff --git a/libavresample/x86/audio_convert.asm > b/libavresample/x86/audio_convert.asm > index 0ca562a..fdcea3a 100644 > --- a/libavresample/x86/audio_convert.asm > +++ b/libavresample/x86/audio_convert.asm > @@ -269,6 +269,68 @@ INIT_XMM avx > CONV_S16P_TO_S16_2CH > %endif > > +;------------------------------------------------------------------------------ > +; void ff_conv_s16p_to_s16_6ch(int16_t *dst, int16_t *const *src, int len, > +; int channels); > +;------------------------------------------------------------------------------ > + > +%macro CONV_S16P_TO_S16_6CH 0 > +cglobal conv_s16p_to_s16_6ch, 2,8,6, dst, src, src1, src2, src3, src4, src5, > len > +%if ARCH_X86_64 > + mov lend, r2d > +%else > + %define lend dword r2m > +%endif
Eehw, just do: %if ARCH_X86_64 cglobal ..., 3, 8, 6, dst, src, len, src1, src2, .. %else .. what you do up there .. %endif > + movq [dstq ], m1 > + movq [dstq+ 8], m0 > + movq [dstq+16], m2 > + movhps [dstq+24], m1 > + movhps [dstq+32], m0 > + movhps [dstq+40], m2 > + add srcq, mmsize/2 > + add dstq, mmsize*3 > + sub lend, mmsize/4 > + jg .loop > + REP_RET > +%endmacro Here, too, I think you can use imul lenq, 6, then add that to dstq, neg it and index dstq as [dstq+lenq+0/8/16/..]. Then add lend, mmsize/4 instead of sub, and jl instead of jg, and you can remove the add dstq, mmsize*3 from the inner loop. Does unrolling this by another factor of 2 (and thus being able to use aligned loads/stores) make a performance difference? Ronald _______________________________________________ libav-devel mailing list libav-devel@libav.org https://lists.libav.org/mailman/listinfo/libav-devel