Hi guys,

more generic question/proposal:

On Mon, Aug 22, 2011 at 2:25 PM, Loren Merritt <lor...@u.washington.edu> wrote:
> On Mon, 22 Aug 2011, Ronald S. Bultje wrote:
>> On Mon, Aug 22, 2011 at 4:17 AM, Loren Merritt <lor...@u.washington.edu> 
>> wrote:
>>> On Sun, 21 Aug 2011, Ronald S. Bultje wrote:
>>
>>>> hscale10to15_X8_ssse3:
>>>> ...
>>>> pshufd  m4, m4, 11011000b
>>>> movhlps m0, m4
>>>> paddd   m0, m4
>>>
>>> phaddd is slightly faster than this on conroe and penryn (which surprised
>>> me). Equal speed on sandybridge.
>>> 2x pshufd (which can execute in parallel) is faster than pshufd,movhlps in
>>> series, but slower than phaddd. Dunno if this is also true on the cpus
>>> that would use non-ssse3 though: k10 might agree, k8 might not, but those
>>> guesses aren't based on a benchmark.
>>
>> If phaddd is faster on some, plus smaller, we can safely use that.
>>
>> I'm surprised phaddd is faster though, I specifically changed this
>> (without measuring, indeed), because Jason said phaddd is always
>> slower unless you use both halves of the dst reg. But as said, if it's
>> faster, I'll change it back.
>
> That's the heuristic that I use too. But the context I learned it in is
> where a horizontal sum step usually consists of 1 shuffle, not 2.
>
>>>> anything with 8bit input
>>>
>>> When differences between successive filterPos are small (i.e. downscaling
>>> by not too much, or any upscaling), then a single load instruction could
>>> cover all the input samples corresponding to several outputs, and then
>>> pshufb them into place. Might be cache-unfriendly, though: shuffle
>>> constants are larger than filterPos.
>>>
>>> Remove redundant loads of filterPos by processing more than 1 row at once?
>>
>> Wouldn't require branches to see if filterPos is equal to the previous
>> one? Or would you like me to have multiple scaling variants, one for
>> "downscaling by not too much" which assumes filterPos is mostly equal,
>> and one for the general case as it is now? That sounds icky. I don't
>> really see how I'd do this in practice.
>
> This is two separate proposals. In the first proposal, such branches would
> be uniform (distance between successive filterPos is determined solely by
> scaling ratio), so can be factored out of the loop, resulting in multiple
> scaling variants.
>
> In the second proposal, filterPos values are always equal for the same
> horizontal position in different rows, no branches necessary.

I'd like to delay these until after this patch is applied. Does anyone
mind if I apply the patch as-is (with the phaddd change as suggested
by Loren) so others can continue from there? I have little time ATM
and it'd suck to see this patch just be left in the bushes.

Ronald
_______________________________________________
libav-devel mailing list
libav-devel@libav.org
https://lists.libav.org/mailman/listinfo/libav-devel

Reply via email to