On 06/17/2011 09:44 PM, Loren Merritt wrote: > On Thu, 16 Jun 2011, Justin Ruggles wrote: > >> On 06/12/2011 04:31 PM, Ronald S. Bultje wrote: >> >>> Hi, >>> >>> On Sat, Jun 11, 2011 at 10:35 AM, Justin Ruggles >>> <[email protected]> wrote: >>>> --- >>>> libavcodec/dsputil.c | 17 +++++++ >>>> libavcodec/dsputil.h | 14 ++++++ >>>> libavcodec/x86/dsputil_mmx.c | 15 +++++++ >>>> libavcodec/x86/dsputil_yasm.asm | 88 >>>> +++++++++++++++++++++++++++++++++++++++ >>>> 4 files changed, 134 insertions(+), 0 deletions(-) >>> [..] >>>> + CLIPD m0, m4, m5, m6 >>>> + CLIPD m1, m4, m5, m6 >>>> + CLIPD m2, m4, m5, m6 >>>> + CLIPD m3, m4, m5, m6 >>> >>> For something like Atom (or basically anything with out-of-order >>> execution), this could be interleaved (i.e. CLIPDx2 m0, m1, m4, m5, >>> m6). With that changed, looks good to me, feel free to apply. >> >> >> I tested that on Atom and it doesn't improve speed. But it doesn't hurt >> speed either. Should we do it anyway? >> >> Also, unrolling to 32 values per loop on x86-64 does help, so I'll send >> an updated patch to do that. > > On Atom, you mean? Penryn is indifferent to amount of unrolling here. > Can you unroll with %rep instead of copy/paste?
Maybe we're thinking of different things. I was referring to unrolling by using more xmm registers for x86-64. This helps on atom and sandy bridge, but doesn't seem to have a significant effect on athlon64. I also don't see how I could do that cleanly with %rep. The other thing I guess would be running the load/clip/store twice before looping, which can of coarse be done simply with %rep. And that actually does seem to improve speed slightly on athlon64 but I haven't tested it yet on other systems. > Also, document the limitations on min/max values due to the float > implementation. Indeed. It's accurate for +/- 1<<24 right? or is it 1<<25? -Justin _______________________________________________ libav-devel mailing list [email protected] https://lists.libav.org/mailman/listinfo/libav-devel
