Charles R Harris wrote: > > > On Sat, Mar 22, 2008 at 11:43 AM, Neal Becker <[EMAIL PROTECTED] > <mailto:[EMAIL PROTECTED]>> wrote: > > James Philbin wrote: > > > Personally, I think that the time would be better spent optimizing > > routines for single-threaded code and relying on BLAS and LAPACK > > libraries to use multiple cores for more complex calculations. In > > particular, doing some basic loop unrolling and SSE versions of the > > ufuncs would be beneficial. I have some experience writing SSE code > > using intrinsics and would be happy to give it a shot if people tell > > me what functions I should focus on. > > > > James > > gcc keeps advancing autovectorization. Is manual vectorization > worth the > trouble? > > > The inner loop of a unary ufunc looks like > > /*UFUNC_API*/ > static void > PyUFunc_d_d(char **args, intp *dimensions, intp *steps, void *func) > { > intp i; > char *ip1=args[0], *op=args[1]; > for(i=0; i<*dimensions; i++, ip1+=steps[0], op+=steps[1]) { > *(double *)op = ((DoubleUnaryFunc *)func)(*(double *)ip1); > } > } > > > While it might help the compiler to put the steps on the stack as > constants, it is hard to see how the compiler could vectorize the loop > given the information available and the fact that the input data might > not be aligned or contiguous. I suppose one could make a small local > buffer, copy the data into it, and then use sse, and that might > actually help for some things. But it is also likely that the function > itself won't deal gracefully with vectorized data.
I think the thing to do is to special-case the code so that if the strides work for vectorization, then a different bit of code is executed and this current code is used as the final special-case. Something like this would be relatively straightforward, if a bit tedious, to do. -Travis _______________________________________________ Numpy-discussion mailing list [email protected] http://projects.scipy.org/mailman/listinfo/numpy-discussion
