Am 22.03.2008 um 19:20 schrieb Travis E. Oliphant: >I think the thing to do is to special-case the code so that if the >strides work for vectorization, then a different bit of code is executed >and this current code is used as the final special-case.
>Something like this would be relatively straightforward, if a bit >tedious, to do. I've experimented with branching the ufuncs into different constant strides and aligned/unaligned cases to be able to use SSE using compiler intrinsics. I expected a considerable gain as i was using float32 with stride 1 most of the time. However, profiling revealed that hardly anything was gained because of 1) non-alignment of the vectors.... this _could_ be handled by shuffled loading of the values though 2) the fact that my application used relatively large vectors that wouldn't fit into the CPU cache, hence the memory transfer slowed down the CPU. I found the latter to be a real showstopper for most of my experiments with SIMD. It's especially a problem for numpy because smaller vectors have a lot of Python/numpy overhead, and larger ones don't really benefit due to cache exhaustion. I'm curious whether OpenMP gives better results, as multi-cores often share their caches. greetings, Thomas _______________________________________________ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion