Re: [FFmpeg-devel] [PATCH] libavcodec Adding ff_v210_planar_unpack AVX2
Hello, I’ve accounted for all feedback on this so far, I’m wondering if it is ready to be pushed upstream? Here are my results from ‘checkasm’ (lower is better): v210_unpack_c: 1636 v210_unpack_ssse3: 611 v210_unpack_avx: 601 v210_unpack_avx2: 423 I ran it 5 times and averaged the middle 3 results for each CPU target (ignoring the highest and lowest time). https://patchwork.ffmpeg.org/patch/12325/ Thanks… -Mike ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org https://ffmpeg.org/mailman/listinfo/ffmpeg-devel To unsubscribe, visit link above, or email ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".
Re: [FFmpeg-devel] [PATCH] libavcodec Adding ff_v210_planar_unpack AVX2
Hello, I resent my AVX2 patch for v210 unpacking. My first attempt didn't get picked up by the Patchwork list for some reason. I installed Linux on a Broadwell laptop to utilize James Darnley's checkasm patch for v210 decode. The results are below. AVX2 gets a nice boost from replacing SHUFPS instructions with VPBLENDD, which has more flexible port bindings. VBLENDPS could also be substituted and is available from SSE4.1 onward, however I found only the AVX2 code received any measureable gain from that change. Any further comments are greatly appreciated. Thanks, Mike Tested on Broadwell CPU, Ubuntu 18.10 x86_64 ~/FFmpeg$ tests/checkasm/checkasm --bench --test=v210dec benchmarking with native FFmpeg timers nop: 94.1 checkasm: using random seed 3963743306 SSSE3: - v210dec.v210_unpack [OK] AVX: - v210dec.v210_unpack [OK] AVX2: - v210dec.v210_unpack [OK] checkasm: all 3 tests passed v210_unpack_c: 1625.2 v210_unpack_ssse3: 604.2 v210_unpack_avx: 592.2 v210_unpack_avx2: 422.2 ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
Re: [FFmpeg-devel] [PATCH] Revised ff_v210_planar_unpack AVX2
I am submitting another patch. Please disregard this one. -Mike ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
Re: [FFmpeg-devel] [PATCH] Added ff_v210_planar_unpack_aligned_avx2
Thanks for the feedback. You are right, I can use VPERMQ to free up a register. I can also remove the PAND mask by doing PSLLD/PSRLD. That eliminates the need for an x86-64 block. I tried the naive 'unrolled' version with no permute, and it was much slower, about the same as the AVX/SSSE3 code. VPERMQ/D is a single shuffle uop on port 5, so it turns out to be useful. I will submit a new patch with those improvements and the VBROADCASTI128 macro. I role-modeled my code from 'v210enc.asm' which also could be updated with VBROADCASTI128. Note, I'm running on Windows and it looks like 'checkasm' performance benchmarking is only enabled on Linux. For my tests I put a 100x loop around the 'unpack_frame' call and ran: ffmpeg.exe -s:v 1920x1080 -vcodec v210 -stream_loop 200 -i OddaView_1920x1080.v210 -f null -y NUL If there is a better way, let me know... Thanks,Mike ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org https://ffmpeg.org/mailman/listinfo/ffmpeg-devel