Yes, optimized BLAS implementations like MKL and OpenBLAS use vectorization heavily.
Note that matrix addition A+B is fundamentally a very different beast from matrix multiplication A*B. In the former you have O(N^2) work and O(N^2) data, so the ratio of work to data is O(1). It is very likely that the operation is memory bound, in which case there is little to gain from optimizing the computations. In the latter you have O(N^3) work and O(N^2) data, so the ratio of work to data is O(N). There exists a good possibility for the operation to be compute bound, and so there is a payoff to optimize such computations. Thanks, Jiahao Chen Research Scientist Julia Lab [jiahao.github.io ](http://jiahao.github.io "http://jiahao.github.io" ) On Apr 16 2016, at 11:13 pm, Chris Rackauckas <rackd...@gmail.com> wrote: > BLAS functions are painstakingly developed to be beautiful bastions of parallelism (because of how ubiquitous their use is). The closest I think you can get is ParallelAccelerator.jl's @acc which does a lot of optimizations all together. However, it still won't match BLAS in terms of its efficiency since BLAS is just really well optimized by hand. But give ParallelAccelerator a try, it's a great tool for getting things to run fast with little work. On Saturday, April 16, 2016 at 4:50:50 PM UTC-7, Jason Eckstein wrote: > >> I often use julia muticore features with pmap and @parallel for loops. So the best way to achieve this is to split the array up into parts for each core and then run SIMD loops on each parallel process? Will there ever by a time when you can add a tag like SIMD that will have the compiler automatically does this like it does for BLAS functions? On Saturday, April 16, 2016 at 3:26:22 AM UTC-6, Valentin Churavy wrote: >> >>> Blas is using a combination of SIMD and multi-core processing. Multi-core (threading) is coming in Julia v0.5 as an experimental feature. On Saturday, 16 April 2016 14:13:00 UTC+9, Jason Eckstein wrote: >>> >>>> I noticed in Julia 4 now if you call A+B where A and B are matrices of equal size, the llvm code shows vectorization indicating it is equivalent to if I wrote my own function with an @simd tagged for loop. I still notice though that it uses a single core to maximum capacity but never spreads an SIMD loop out over multiple cores. In contrast if I use BLAS functions like gemm! or even just A*B it will use every core of the processor. I'm not sure if these linear algebra operations also use simd vectorization but I imagine they do since BLAS is very optimized. Is there a way to write an SIMD loop that spreads the data out across all processor cores, not just the multiple functional units of a single core?