On Sat, 18 Jun 2011, Justin Ruggles wrote:
On 06/17/2011 09:44 PM, Loren Merritt wrote:
On Thu, 16 Jun 2011, Justin Ruggles wrote:
Also, unrolling to 32 values per loop on x86-64 does help, so I'll send
an updated patch to do that.
On Atom, you mean? Penryn is indifferent to amount of unrolling here.
Can you unroll with %rep instead of copy/paste?
Maybe we're thinking of different things. I was referring to unrolling
by using more xmm registers for x86-64. This helps on atom and sandy
bridge, but doesn't seem to have a significant effect on athlon64. I
also don't see how I could do that cleanly with %rep.
I included both "interleaved copies" and "consecutive copies" in
"unrolling", under the prediction that they'd have the same effect. I was
wrong, but I still don't know why. 4x should be plenty to max out ILP
(regardless of whether it gets done by manual interleaving or instruction
reordering), so I don't see what 8x unroll could possibly do other than
have a tiny effect on loop overhead and increase code size.
That said, I can reproduce your result on sandybridge.
Also, document the limitations on min/max values due to the float
implementation.
Indeed. It's accurate for +/- 1<<24 right?
yes
--Loren Merritt
_______________________________________________
libav-devel mailing list
[email protected]
https://lists.libav.org/mailman/listinfo/libav-devel