[FFmpeg-devel] [PATCH] vp9: add 32x32 idct AVX2 implementation.

2016-07-21 Thread Ronald S. Bultje
About 1.8x speedup compared to AVX version for full IDCT. Other sub-IDCT scenarios also see speedups. Full --bench output for idct_32x32_add_{bpp}_${subidct}_${opt} (50k cycles): nop: 16.5 vp9_inv_dct_dct_32x32_add_8_1_c: 2284.4 vp9_inv_dct_dct_32x32_add_8_1_sse2: 145.0

[FFmpeg-devel] [PATCH] vp9: add 32x32 idct AVX2 implementation.

2016-07-21 Thread Ronald S. Bultje
About 1.8x speedup compared to AVX version for full IDCT. Other sub-IDCT scenarios also see speedups. Full --bench output for idct_32x32_add_{bpp}_${subidct}_${opt} (50k cycles): nop: 16.5 vp9_inv_dct_dct_32x32_add_8_1_c: 2284.4 vp9_inv_dct_dct_32x32_add_8_1_sse2: 145.0

Re: [FFmpeg-devel] [PATCH] vp9: add 32x32 idct AVX2 implementation.

2016-07-19 Thread Ronald S. Bultje
Hi, On Sat, Jul 16, 2016 at 5:55 AM, Henrik Gramner wrote: > On Wed, Jul 13, 2016 at 6:37 PM, Ronald S. Bultje > wrote: > > +cglobal vp9_idct_idct_32x32_add, 4, 9, 16, 2048, dst, stride, block, eob > [...] > > +movd xm0, [blockq] > > +

Re: [FFmpeg-devel] [PATCH] vp9: add 32x32 idct AVX2 implementation.

2016-07-16 Thread Henrik Gramner
On Wed, Jul 13, 2016 at 6:37 PM, Ronald S. Bultje wrote: > +cglobal vp9_idct_idct_32x32_add, 4, 9, 16, 2048, dst, stride, block, eob [...] > +movd xm0, [blockq] > +movam1, [pw_11585x2] > +pmulhrswm0, m1 > +pmulhrsw

Re: [FFmpeg-devel] [PATCH] vp9: add 32x32 idct AVX2 implementation.

2016-07-15 Thread Ronald S. Bultje
Hi, On Wed, Jul 13, 2016 at 12:37 PM, Ronald S. Bultje wrote: > About 1.8x speedup compared to AVX version for full IDCT. Other > sub-IDCT scenarios also see speedups. Full --bench output for > idct_32x32_add_{bpp}_${subidct}_${opt} (50k cycles): > > nop: 16.5 >

[FFmpeg-devel] [PATCH] vp9: add 32x32 idct AVX2 implementation.

2016-07-13 Thread Ronald S. Bultje
About 1.8x speedup compared to AVX version for full IDCT. Other sub-IDCT scenarios also see speedups. Full --bench output for idct_32x32_add_{bpp}_${subidct}_${opt} (50k cycles): nop: 16.5 vp9_inv_dct_dct_32x32_add_8_1_c: 2284.4 vp9_inv_dct_dct_32x32_add_8_1_sse2: 145.0