On 2/13/2017 9:44 AM, James Darnley wrote: > x86-64 only > > Yorkfield: > - sse2: 2.16x (434 vs. 201 cycles) > > Skylake: > - sse2: 3.04x (378 vs. 124 cycles) > - avx: 3.29x (378 vs. 115 cycles) > --- > libavcodec/x86/h264_deblock.asm | 119 > ++++++++++++++++++++++++++++++++++++++++ > libavcodec/x86/h264dsp_init.c | 10 ++++ > 2 files changed, 129 insertions(+) > > diff --git a/libavcodec/x86/h264_deblock.asm b/libavcodec/x86/h264_deblock.asm > index 509a0dbe0c..f47a199e8f 100644 > --- a/libavcodec/x86/h264_deblock.asm > +++ b/libavcodec/x86/h264_deblock.asm > @@ -377,10 +377,129 @@ cglobal deblock_h_luma_8, 5,9,0,0x60+16*WIN64 > RET > %endmacro > > +; TODO: use macro arguments > +%macro TRANSPOSE_8X8B_XMM 8
Why not put this in x86util? And using arguments, of course. Also, just call it TRANSPOSE_8X8B. > + punpcklbw m0, m1 > + punpcklbw m2, m3 > + punpcklbw m4, m5 > + punpcklbw m6, m7 > + > + punpckhwd m1, m0, m2 > + punpcklwd m0, m2 Use SBUTTERFLY here and below. > + > + punpckhwd m5, m4, m6 > + punpcklwd m4, m6 > + > + punpckhdq m2, m0, m4 > + punpckldq m0, m4 > + > + punpckhdq m6, m1, m5 > + punpckldq m1, m5 > + > + MOVHL m4, m0 > + MOVHL m3, m2 > + MOVHL m7, m6 > + MOVHL m5, m1 > + SWAP 1, 4 > +%endmacro > + > +%macro TRANSPOSE_8X8B_XMM 0 > + TRANSPOSE_8X8B_XMM 0, 1, 2, 3, 4, 5, 6, 7 This seems wrong, or at least superfluous. > +%endmacro > + > +%macro DEBLOCK_H_LUMA_MBAFF 0 > + > +cglobal deblock_h_luma_mbaff_8, 5, 9, 10, 8*16, pix_, stride_, alpha_, > beta_, tc0_ Why the underscores? > + movsxd stride_q, stride_d > + dec alpha_d > + dec beta_d > + mov r5, pix_q > + lea r6, [3*stride_q] Call r6 stride3. _______________________________________________ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel