On 06/20/2011 07:08 AM, Loren Merritt wrote:

> On Sun, 19 Jun 2011, Justin Ruggles wrote:
>> On 06/19/2011 04:46 AM, Loren Merritt wrote:
>>
>>> I included both "interleaved copies" and "consecutive copies" in
>>> "unrolling", under the prediction that they'd have the same effect. I was
>>> wrong, but I still don't know why. 4x should be plenty to max out ILP
>>> (regardless of whether it gets done by manual interleaving or instruction
>>> reordering), so I don't see what 8x unroll could possibly do other than
>>> have a tiny effect on loop overhead and increase code size.
>>>
>>> That said, I can reproduce your result on sandybridge.
>>
>> sandybridge:
> [...]
>> 947 - SSE4.1
>> 907 - SSE4.1 (unroll 8x consecutive)
>> 1094 - SSE4.1 (unroll 8x interleaved)
> 
> Did you get that backwards? Interleaved is the one that uses more xmmregs,
> consecutive is the one that can be done with %rep.

Oh, yeah I did then. read/process/write/read/process/write/loop just
sounded more "interleaved" to me...

>> - Clipping by doing int2float/clip/float2int only benefits athlon64. The
>> integer-only version is insanely faster on Atom. Are there other CPUs
>> that might benefit from the float version?
> 
> conroe:
> 6252 mmx
> 2855 sse2 float, 4x
> 2844 sse2 float, 8x interleaved
> 2804 sse2 float, 8x consecutive
> 3230 sse2 int, 4x
> 3044 sse2 int, 8x interleaved
> 3200 sse2 int, 8x consecutive


Well darn. This makes the decision more complicated.

How about:
sse2  float, 8x consecutive - default
sse2  float, 4x - sse2slow (athlon64)
sse2  int,   4x - atom
sse41        8x interleaved

Alternatively, we could leave out the float 4x version and it will be
about 1% slower on athlon64.

-Justin
_______________________________________________
libav-devel mailing list
[email protected]
https://lists.libav.org/mailman/listinfo/libav-devel

Reply via email to