SSE4 versions for horizontal scaling.

Ronald S. Bultje Mon, 22 Aug 2011 09:15:01 -0700

Hi,

On Mon, Aug 22, 2011 at 4:17 AM, Loren Merritt <lor...@u.washington.edu> wrote:
> On Sun, 21 Aug 2011, Ronald S. Bultje wrote:
>
>> hscale10to15_8_ssse3:
>> ...
>> phaddd  m0, m1
>> phaddd  m4, m5
>> phaddd  m0, m4
>
> If you load with movq/movhps and rearrange filter[] accordingly, then 2 of
> those can become paddd. 13% faster on conroe, 3% faster on penryn, slower
> on sandybridge. (I didn't try to do the rearrange, just ran it with wrong
> coefs.)


I can transpose or interleave (whatever you call this) coefficients
during init in "align-size" blocks, i.e. sets of 4 coefficients for
dstpixel 1, then 4 coefficients for dstpixel 2, then 4 coeffs for
dstpixel 1, 4 coeffs for dstpixel 2 [etc], and then similar blocks of
4 coeffs for dstpixel 3/4 interleaved, and so on for the rest of the
row. Then it becomes a set of pmaddwd, 1 phaddd + a series of paddd,
which may be faster.

>> hscale10to15_X8_ssse3:
>> ...
>> pshufd  m4, m4, 11011000b
>> movhlps m0, m4
>> paddd   m0, m4
>
> phaddd is slightly faster than this on conroe and penryn (which surprised
> me). Equal speed on sandybridge.
> 2x pshufd (which can execute in parallel) is faster than pshufd,movhlps in
> series, but slower than phaddd. Dunno if this is also true on the cpus
> that would use non-ssse3 though: k10 might agree, k8 might not, but those
> guesses aren't based on a benchmark.

If phaddd is faster on some, plus smaller, we can safely use that.

I'm surprised phaddd is faster though, I specifically changed this
(without measuring, indeed), because Jason said phaddd is always
slower unless you use both halves of the dst reg. But as said, if it's
faster, I'll change it back.

>> anything with 8bit input
>
> When differences between successive filterPos are small (i.e. downscaling
> by not too much, or any upscaling), then a single load instruction could
> cover all the input samples corresponding to several outputs, and then
> pshufb them into place. Might be cache-unfriendly, though: shuffle
> constants are larger than filterPos.
>
> Remove redundant loads of filterPos by processing more than 1 row at once?

Wouldn't require branches to see if filterPos is equal to the previous
one? Or would you like me to have multiple scaling variants, one for
"downscaling by not too much" which assumes filterPos is mostly equal,
and one for the general case as it is now? That sounds icky. I don't
really see how I'd do this in practice.

Ronald
_______________________________________________
libav-devel mailing list
libav-devel@libav.org
https://lists.libav.org/mailman/listinfo/libav-devel

Re: [libav-devel] [PATCH] sws: implement MMX/SSE2/SSSE3/SSE4 versions for horizontal scaling.

Reply via email to