On Sat, 18 Jun 2011, Justin Ruggles wrote:
On 06/17/2011 09:44 PM, Loren Merritt wrote:
On Thu, 16 Jun 2011, Justin Ruggles wrote:

Also, unrolling to 32 values per loop on x86-64 does help, so I'll send
an updated patch to do that.

On Atom, you mean? Penryn is indifferent to amount of unrolling here.
Can you unroll with %rep instead of copy/paste?

Maybe we're thinking of different things.  I was referring to unrolling
by using more xmm registers for x86-64.  This helps on atom and sandy
bridge, but doesn't seem to have a significant effect on athlon64.  I
also don't see how I could do that cleanly with %rep.

I included both "interleaved copies" and "consecutive copies" in "unrolling", under the prediction that they'd have the same effect. I was wrong, but I still don't know why. 4x should be plenty to max out ILP (regardless of whether it gets done by manual interleaving or instruction reordering), so I don't see what 8x unroll could possibly do other than have a tiny effect on loop overhead and increase code size.

That said, I can reproduce your result on sandybridge.

Also, document the limitations on min/max values due to the float
implementation.

Indeed. It's accurate for +/- 1<<24 right?

yes

--Loren Merritt
_______________________________________________
libav-devel mailing list
[email protected]
https://lists.libav.org/mailman/listinfo/libav-devel

Reply via email to