On Mon, Oct 19, 2015 at 10:00 PM, Timothy Gu <timothyg...@gmail.com> wrote: > About 16% faster on large clips (>1200px width), more than 2x slower on small > clips > (352px).
The reason is for this is likely the fact that you fall back to scalar as soon as you have less than 2*mmsize bytes left to process which leads to a larger portion being done in scalar with larger vector sizes. A possible workaround for this is to gradually decrease the amount you process with SIMD when you're approaching the end, e.g. fallback to using xmm registers, then half of an xmm register, and maybe even a quarter of an xmm register (as always, benchmark to see what helps) before doing scalar for the last few bytes. This is assuming that you cannot overread src and/or overwrite dst, if you're allowed to do that then it's a bit easier of course. _______________________________________________ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel