On 06/20/2011 07:08 AM, Loren Merritt wrote: > On Sun, 19 Jun 2011, Justin Ruggles wrote: >> On 06/19/2011 04:46 AM, Loren Merritt wrote: >> >>> I included both "interleaved copies" and "consecutive copies" in >>> "unrolling", under the prediction that they'd have the same effect. I was >>> wrong, but I still don't know why. 4x should be plenty to max out ILP >>> (regardless of whether it gets done by manual interleaving or instruction >>> reordering), so I don't see what 8x unroll could possibly do other than >>> have a tiny effect on loop overhead and increase code size. >>> >>> That said, I can reproduce your result on sandybridge. >> >> sandybridge: > [...] >> 947 - SSE4.1 >> 907 - SSE4.1 (unroll 8x consecutive) >> 1094 - SSE4.1 (unroll 8x interleaved) > > Did you get that backwards? Interleaved is the one that uses more xmmregs, > consecutive is the one that can be done with %rep.
Oh, yeah I did then. read/process/write/read/process/write/loop just sounded more "interleaved" to me... >> - Clipping by doing int2float/clip/float2int only benefits athlon64. The >> integer-only version is insanely faster on Atom. Are there other CPUs >> that might benefit from the float version? > > conroe: > 6252 mmx > 2855 sse2 float, 4x > 2844 sse2 float, 8x interleaved > 2804 sse2 float, 8x consecutive > 3230 sse2 int, 4x > 3044 sse2 int, 8x interleaved > 3200 sse2 int, 8x consecutive Well darn. This makes the decision more complicated. How about: sse2 float, 8x consecutive - default sse2 float, 4x - sse2slow (athlon64) sse2 int, 4x - atom sse41 8x interleaved Alternatively, we could leave out the float 4x version and it will be about 1% slower on athlon64. -Justin _______________________________________________ libav-devel mailing list [email protected] https://lists.libav.org/mailman/listinfo/libav-devel
