About 1.8x speedup compared to AVX version for full IDCT. Other
sub-IDCT scenarios also see speedups. Full --bench output for
idct_32x32_add_{bpp}_${subidct}_${opt} (50k cycles):
nop: 16.5
vp9_inv_dct_dct_32x32_add_8_1_c: 2284.4
vp9_inv_dct_dct_32x32_add_8_1_sse2: 145.0
About 1.8x speedup compared to AVX version for full IDCT. Other
sub-IDCT scenarios also see speedups. Full --bench output for
idct_32x32_add_{bpp}_${subidct}_${opt} (50k cycles):
nop: 16.5
vp9_inv_dct_dct_32x32_add_8_1_c: 2284.4
vp9_inv_dct_dct_32x32_add_8_1_sse2: 145.0
Hi,
On Sat, Jul 16, 2016 at 5:55 AM, Henrik Gramner wrote:
> On Wed, Jul 13, 2016 at 6:37 PM, Ronald S. Bultje
> wrote:
> > +cglobal vp9_idct_idct_32x32_add, 4, 9, 16, 2048, dst, stride, block, eob
> [...]
> > +movd xm0, [blockq]
> > +
On Wed, Jul 13, 2016 at 6:37 PM, Ronald S. Bultje wrote:
> +cglobal vp9_idct_idct_32x32_add, 4, 9, 16, 2048, dst, stride, block, eob
[...]
> +movd xm0, [blockq]
> +movam1, [pw_11585x2]
> +pmulhrswm0, m1
> +pmulhrsw
Hi,
On Wed, Jul 13, 2016 at 12:37 PM, Ronald S. Bultje
wrote:
> About 1.8x speedup compared to AVX version for full IDCT. Other
> sub-IDCT scenarios also see speedups. Full --bench output for
> idct_32x32_add_{bpp}_${subidct}_${opt} (50k cycles):
>
> nop: 16.5
>
About 1.8x speedup compared to AVX version for full IDCT. Other
sub-IDCT scenarios also see speedups. Full --bench output for
idct_32x32_add_{bpp}_${subidct}_${opt} (50k cycles):
nop: 16.5
vp9_inv_dct_dct_32x32_add_8_1_c: 2284.4
vp9_inv_dct_dct_32x32_add_8_1_sse2: 145.0