On Sat, Mar 22, 2008 at 12:54 PM, Anne Archibald <[EMAIL PROTECTED]> wrote:
> On 22/03/2008, Travis E. Oliphant <[EMAIL PROTECTED]> wrote: > > James Philbin wrote: > > > Personally, I think that the time would be better spent optimizing > > > routines for single-threaded code and relying on BLAS and LAPACK > > > libraries to use multiple cores for more complex calculations. In > > > particular, doing some basic loop unrolling and SSE versions of the > > > ufuncs would be beneficial. I have some experience writing SSE code > > > using intrinsics and would be happy to give it a shot if people tell > > > me what functions I should focus on. > > > > Fabulous! This is on my Project List of todo items for NumPy. See > > http://projects.scipy.org/scipy/numpy/wiki/ProjectIdeas I should spend > > some time refactoring the ufunc loops so that the templating does not > > get in the way of doing this on a case by case basis. > > > > 1) You should focus on the math operations: add, subtract, multiply, > > divide, and so forth. > > 2) Then for "combined operations" we should expose the functionality at > > a high-level. So, that somebody could write code to take advantage of > it. > > > > It would be easiest to use intrinsics which would then work for AMD, > > Intel, on multiple compilers. > > I think even heavier use of code generation would be a good idea here. > There are so many different versions of each loop, and the fastest way > to run each one is going to be different for different versions and > different platforms, that a routine that assembled the code from > chunks and picked the fastest combination for each instance might make > a big difference - this is roughly what FFTW and ATLAS do. > Maybe it's time to revisit the template subsystem I pulled out of Django. Chuck
_______________________________________________ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion