On Sun, Nov 23, 2008 at 8:29 AM, bearophile <[EMAIL PROTECTED]> wrote:
> Andrei Alexandrescu:
>> My guess is that if you turn that off, the differences won't be as large
>> (or even detectable for certain ranges of N).
>
> The array bounds aren't controlled, the code is compiled with -O -release 
> -inline.
> Do you see array bound controls in the asm code at the bottom of my post?
>
>
>> Probably blocking will bring even more mileage (but again that depends
>> on N).
>
> Yes, blocking may help. And using SSE instructions may help some more. The 
> end result may be hundred or more times faster than the naive code in D :-)

This is why I prefer to call on an optimized BLAS lib for all my large
matrix multiplication needs.  All that nonsense is already taken care
of.  And it's compiled with GCC which has better floating point
optimization to begin with.  I haven't done any benchmarks, though.
:-)
Might be interesting to try out my MingGW-compiled ATLAS BLAS matrix
mult against the numbers you're getting there.

--bb

Reply via email to