On 12/01/2012 06:20 AM, Christophe Gisquet wrote:
> Corrected 2 vs 4-spaces tabs and base4 constants.
> 
> 
> 0007-SBR-DSP-x86-implement-SSE-qmf_deint_bfly.patch
> 
> 
> From cf3326b20caea42b8654b754abb49f6186743501 Mon Sep 17 00:00:00 2001
> From: Christophe Gisquet <christophe.gisq...@gmail.com>
> Date: Mon, 26 Nov 2012 23:12:03 +0100
> Subject: [PATCH 07/11] SBR DSP x86: implement SSE qmf_deint_bfly
> 
> From 713 to 209 cycles on Penrynn.
> Having a loop counter is a 7 cycle gain.
> Unrolling is another 7 cycle gain.
> Working in reverse scan is another 6 cycles.
> ---
>  libavcodec/x86/sbrdsp.asm    |   31 +++++++++++++++++++++++++++++++
>  libavcodec/x86/sbrdsp_init.c |    2 ++
>  2 files changed, 33 insertions(+), 0 deletions(-)
> 
> diff --git a/libavcodec/x86/sbrdsp.asm b/libavcodec/x86/sbrdsp.asm
> index 2304983..9c3ea84 100644
> --- a/libavcodec/x86/sbrdsp.asm
> +++ b/libavcodec/x86/sbrdsp.asm
> @@ -286,3 +286,34 @@ cglobal sbr_qmf_deint_neg, 2,3,4,v,src,vrev
>      sub      srcq, 32
>      cmp        vq, vrevq
>      jl      .loop
> +
> +; sbr_qmf_deint_bfly(float *v, const float *src0, const float *src1)
> +cglobal sbr_qmf_deint_bfly, 3,5,8, v,src0,src1,vrev,c

needs INIT_XMM sse

> +    mov        cq, 64*4-2*mmsize
> +    lea     vrevq, [vq + 64*4]
> +.loop:
> +    mova       m0, [src0q+cq]
> +    mova       m1, [src1q]
> +    mova       m4, [src0q+cq+mmsize]
> +    mova       m5, [src1q+mmsize]
> +    mova       m2, m0
> +    mova       m3, m1
> +    shufps     m2, m2, q0123
> +    shufps     m3, m3, q0123
> +    mova       m6, m4
> +    mova       m7, m5
> +    shufps     m6, m6, q0123
> +    shufps     m7, m7, q0123

shufps m2, m0, m0, q0123
shufps m3, m1, m1, q0123
shufps m6, m4, m4, q0123
shufps m7, m5, m5, q0123

An AVX version might be worth testing with that as well.

> +    addps      m5, m2
> +    subps      m0, m7
> +    addps      m1, m6
> +    subps      m4, m3
> +    mova  [vrevq], m1
> +    mova  [vrevq+mmsize], m5
> +    mova  [vq+cq], m0
> +    mova  [vq+cq+mmsize], m4
> +    add     src1q, 2*mmsize
> +    add     vrevq, 2*mmsize
> +    sub        cq, 2*mmsize
> +    jge     .loop
> +    REP_RET

Other than that it looks ok.

-Justin
_______________________________________________
libav-devel mailing list
libav-devel@libav.org
https://lists.libav.org/mailman/listinfo/libav-devel

Reply via email to