Hi guys, more generic question/proposal:
On Mon, Aug 22, 2011 at 2:25 PM, Loren Merritt <lor...@u.washington.edu> wrote: > On Mon, 22 Aug 2011, Ronald S. Bultje wrote: >> On Mon, Aug 22, 2011 at 4:17 AM, Loren Merritt <lor...@u.washington.edu> >> wrote: >>> On Sun, 21 Aug 2011, Ronald S. Bultje wrote: >> >>>> hscale10to15_X8_ssse3: >>>> ... >>>> pshufd m4, m4, 11011000b >>>> movhlps m0, m4 >>>> paddd m0, m4 >>> >>> phaddd is slightly faster than this on conroe and penryn (which surprised >>> me). Equal speed on sandybridge. >>> 2x pshufd (which can execute in parallel) is faster than pshufd,movhlps in >>> series, but slower than phaddd. Dunno if this is also true on the cpus >>> that would use non-ssse3 though: k10 might agree, k8 might not, but those >>> guesses aren't based on a benchmark. >> >> If phaddd is faster on some, plus smaller, we can safely use that. >> >> I'm surprised phaddd is faster though, I specifically changed this >> (without measuring, indeed), because Jason said phaddd is always >> slower unless you use both halves of the dst reg. But as said, if it's >> faster, I'll change it back. > > That's the heuristic that I use too. But the context I learned it in is > where a horizontal sum step usually consists of 1 shuffle, not 2. > >>>> anything with 8bit input >>> >>> When differences between successive filterPos are small (i.e. downscaling >>> by not too much, or any upscaling), then a single load instruction could >>> cover all the input samples corresponding to several outputs, and then >>> pshufb them into place. Might be cache-unfriendly, though: shuffle >>> constants are larger than filterPos. >>> >>> Remove redundant loads of filterPos by processing more than 1 row at once? >> >> Wouldn't require branches to see if filterPos is equal to the previous >> one? Or would you like me to have multiple scaling variants, one for >> "downscaling by not too much" which assumes filterPos is mostly equal, >> and one for the general case as it is now? That sounds icky. I don't >> really see how I'd do this in practice. > > This is two separate proposals. In the first proposal, such branches would > be uniform (distance between successive filterPos is determined solely by > scaling ratio), so can be factored out of the loop, resulting in multiple > scaling variants. > > In the second proposal, filterPos values are always equal for the same > horizontal position in different rows, no branches necessary. I'd like to delay these until after this patch is applied. Does anyone mind if I apply the patch as-is (with the phaddd change as suggested by Loren) so others can continue from there? I have little time ATM and it'd suck to see this patch just be left in the bushes. Ronald _______________________________________________ libav-devel mailing list libav-devel@libav.org https://lists.libav.org/mailman/listinfo/libav-devel