http://d.puremagic.com/issues/show_bug.cgi?id=4393
Don <clugd...@yahoo.com.au> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |clugd...@yahoo.com.au --- Comment #4 from Don <clugd...@yahoo.com.au> 2011-04-28 21:36:38 PDT --- Did you test this on Intel, or AMD? Blas1 code is generally limited by memory access, and AMD has two load ports, so it has different bottlenecks and in these operations always does better than Intel. See also a discussion at: http://www.bealto.com/mp-dot_sse.html (I've talked to Eric Bealto before, he's happy for Phobos to use any of his stuff if we see anything we want). It's a bit misleading, though, because above a certain length, you become dominated by cache effects, so I don't know if unrolling by 4 is actually worthwhile in practice. I also have some optimized x87 dot product code (AMD 32 bit CPUs don't have SSE2, so they still need x87 for doubles). -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: -------