On 06/19/2011 04:46 AM, Loren Merritt wrote:

> On Sat, 18 Jun 2011, Justin Ruggles wrote:
>> On 06/17/2011 09:44 PM, Loren Merritt wrote:
>>> On Thu, 16 Jun 2011, Justin Ruggles wrote:
>>>
>>>> Also, unrolling to 32 values per loop on x86-64 does help, so I'll send
>>>> an updated patch to do that.
>>>
>>> On Atom, you mean? Penryn is indifferent to amount of unrolling here.
>>> Can you unroll with %rep instead of copy/paste?
>>
>> Maybe we're thinking of different things.  I was referring to unrolling
>> by using more xmm registers for x86-64.  This helps on atom and sandy
>> bridge, but doesn't seem to have a significant effect on athlon64.  I
>> also don't see how I could do that cleanly with %rep.
> 
> I included both "interleaved copies" and "consecutive copies" in 
> "unrolling", under the prediction that they'd have the same effect. I was 
> wrong, but I still don't know why. 4x should be plenty to max out ILP 
> (regardless of whether it gets done by manual interleaving or instruction 
> reordering), so I don't see what 8x unroll could possibly do other than 
> have a tiny effect on loop overhead and increase code size.
> 
> That said, I can reproduce your result on sandybridge.

Ok, well I did some more thorough testing on athlon64, atom, and
sandybridge.

Here are the numbers:
The tests were all done with the same data using len=3072

atom:
23564 - C
15120 - MMX
16051 - MMX  (unroll 8x interleaved)
20651 - SSE2 (float clipping)
20106 - SSE2 (float clipping - unroll 8x consecutive)
20220 - SSE2 (float clipping - unroll 8x interleaved)
 7714 - SSE2 (int clipping)
 7750 - SSE2 (int clipping - unroll 8x consecutive)
 8476 - SSE2 (int clipping - unroll 8x interleaved)

athlon64:
8877 - C
8907 - MMX
9230 - MMX  (unroll 8x interleaved)
6652 - SSE2 (float clipping)
7051 - SSE2 (float clipping - unroll 8x consecutive)
6731 - SSE2 (float clipping - unroll 8x interleaved)
8240 - SSE2 (int clipping)
8773 - SSE2 (int clipping - unroll 8x consecutive)
8294 - SSE2 (int clipping - unroll 8x interleaved)

sandybridge:
6904 - C
5513 - MMX
5653 - MMX  (unroll 8x interleaved)
2834 - SSE2 (float clipping)
2830 - SSE2 (float clipping - unroll 8x consecutive)
2868 - SSE2 (float clipping - unroll 8x interleaved)
2769 - SSE2 (int clipping)
2699 - SSE2 (int clipping - unroll 8x consecutive)
2833 - SSE2 (int clipping - unroll 8x interleaved)
 947 - SSE4.1
 907 - SSE4.1 (unroll 8x consecutive)
1094 - SSE4.1 (unroll 8x interleaved)

Conclusions:

- Clipping by doing int2float/clip/float2int only benefits athlon64. The
integer-only version is insanely faster on Atom. Are there other CPUs
that might benefit from the float version?

- Unrolling 8x only significantly improves speed on sandybridge, and
only when doing it by using more xmm registers. It might be worth
testing on Intel without SSE4.1 other than Atom.

At this point I'm inclined to do:
- unroll 8x only for SSE4.1 by using more xmm registers
- have a separate version for SSE2SLOW (i.e. Athlon64) that uses the
int->float conversion.

Thoughts?

Thanks,
Justin
_______________________________________________
libav-devel mailing list
[email protected]
https://lists.libav.org/mailman/listinfo/libav-devel

Reply via email to