Nathaniel Smith <n...@pobox.com> wrote: > I unfortunately don't have the skills to actually lead such an effort > (I've never written a line of asm in my life...), but surely our > collective communities have people who do?
The assembly part in OpenBLAS/GotoBLAS is the major problem. Not just that it's AT&T syntax (i.e. it requires MinGW to build on Windows), but also that it sopports a wide range of processors. We just need a fast BLAS we can use on Windows binary wheels (and possibly Mac OS X). There is no need to support anything else than x86 and AMD64 architectures. So in theory one could throw out all assembly and rewrite the kernels with compiler intrinsics for various SIMD architectures. Or one just rely on the compiler to autovectorize. Just program the code so it is easily vectorized. If we manually unroll loops properly, and make sure the compiler is hinted about memory alignment and pointer aliasing, the compiler will know what to do. There is already a reference BLAS implementation at Netlib, which we could translate to C and optimize for SIMD. Then we need a fast threadpool. I have one I can donate, or we could use libxdispatch (a port of Apple's libdispatch, aka GCD, to Windows as Linux.) Even Intel could not make their TBB perform better than libdispatch. And that's about what we need. Or we could start with OpenBLAS and throw away everything we don't need. Making a totally new BLAS might seem like a crazy idea, but it might be the best solution in the long run. Sturla _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion