On Sun, 19 Jun 2011, Justin Ruggles wrote:
On 06/19/2011 04:46 AM, Loren Merritt wrote:
I included both "interleaved copies" and "consecutive copies" in
"unrolling", under the prediction that they'd have the same effect. I was
wrong, but I still don't know why. 4x should be plenty to max out ILP
(regardless of whether it gets done by manual interleaving or instruction
reordering), so I don't see what 8x unroll could possibly do other than
have a tiny effect on loop overhead and increase code size.
That said, I can reproduce your result on sandybridge.
sandybridge:
[...]
947 - SSE4.1
907 - SSE4.1 (unroll 8x consecutive)
1094 - SSE4.1 (unroll 8x interleaved)
Did you get that backwards? Interleaved is the one that uses more xmmregs,
consecutive is the one that can be done with %rep.
- Clipping by doing int2float/clip/float2int only benefits athlon64. The
integer-only version is insanely faster on Atom. Are there other CPUs
that might benefit from the float version?
conroe:
6252 mmx
2855 sse2 float, 4x
2844 sse2 float, 8x interleaved
2804 sse2 float, 8x consecutive
3230 sse2 int, 4x
3044 sse2 int, 8x interleaved
3200 sse2 int, 8x consecutive
--Loren Merritt
_______________________________________________
libav-devel mailing list
[email protected]
https://lists.libav.org/mailman/listinfo/libav-devel