The tiled implementation appeared in PDL-2.007
so if you're using an older PDL version, you'll get
the extra slowdown for free!  :-)

On Sun, Sep 7, 2014 at 8:45 AM, Chris Marshall <devel.chm...@gmail.com> wrote:
> Performance was also helped by the fact that the original
> inner product matmult algorithm was replaced by a tile
> based one which leads to better cache reuse,
>
> ,,,and you have Craig to thank for the implementation.  :-)
>
> --Chris
>
>
> On Sun, Sep 7, 2014 at 7:03 AM, David Mertens <dcmertens.p...@gmail.com> 
> wrote:
>> It's also quite likely that Intel has heuristics to pre-fetch data, and
>> they're helping out. Maybe.
>>
>> David
>>
>>
>> On Sat, Sep 6, 2014 at 10:28 PM, Craig DeForest <defor...@boulder.swri.edu>
>> wrote:
>>>
>>> Cool!  Maybe there are more cache hits than I expected (which was none)...
>>>
>>> (Mobile)
>>>
>>>
>>> > On Sep 6, 2014, at 1:54 PM, Chris Marshall <devel.chm...@gmail.com>
>>> > wrote:
>>> >
>>> > On my PC (2.8GHZ i7) it takes about an hour for the multiply
>>> > just using $a x $b as Craig shows.  I haven't tried using the
>>> > autothreading support to see how that changes things.
>>> >
>>> > As discussed already, GPU acceleration could allow for much
>>> > faster computation.  For a start Nvidia has a cuBLAS library
>>> > which implements matrix multiply which could be used to
>>> > optimize the performance.
>>> >
>>> > --Chris

_______________________________________________
Perldl mailing list
Perldl@jach.hawaii.edu
http://mailman.jach.hawaii.edu/mailman/listinfo/perldl

Reply via email to