Yes, optimized BLAS implementations like MKL and OpenBLAS use vectorization


Note that matrix addition A+B is fundamentally a very different beast from
matrix multiplication A*B. In the former you have O(N^2) work and O(N^2) data,
so the ratio of work to data is O(1). It is very likely that the operation is
memory bound, in which case there is little to gain from optimizing the
computations. In the latter you have O(N^3) work and O(N^2) data, so the ratio
of work to data is O(N). There exists a good possibility for the operation to
be compute bound, and so there is a payoff to optimize such computations.




Jiahao Chen

Research Scientist

Julia Lab


]( ""; )  

On Apr 16 2016, at 11:13 pm, Chris Rackauckas <>

> BLAS functions are painstakingly developed to be beautiful bastions of
parallelism (because of how ubiquitous their use is). The closest I think you
can get is ParallelAccelerator.jl's @acc which does a lot of optimizations all
together. However, it still won't match BLAS in terms of its efficiency since
BLAS is just really well optimized by hand. But give ParallelAccelerator a
try, it's a great tool for getting things to run fast with little work.  
On Saturday, April 16, 2016 at 4:50:50 PM UTC-7, Jason Eckstein wrote:


>> I often use julia muticore features with pmap and @parallel for loops.  So
the best way to achieve this is to split the array up into parts for each core
and then run SIMD loops on each parallel process?  Will there ever by a time
when you can add a tag like SIMD that will have the compiler automatically
does this like it does for BLAS functions?  
On Saturday, April 16, 2016 at 3:26:22 AM UTC-6, Valentin Churavy wrote:


>>> Blas is using a combination of SIMD and multi-core processing. Multi-core
(threading) is coming in Julia v0.5 as an experimental feature.  
On Saturday, 16 April 2016 14:13:00 UTC+9, Jason Eckstein wrote:


>>>> I noticed in Julia 4 now if you call A+B where A and B are matrices of
equal size, the llvm code shows vectorization indicating it is equivalent to
if I wrote my own function with an @simd tagged for loop.  I still notice
though that it uses a single core to maximum capacity but never spreads an
SIMD loop out over multiple cores.  In contrast if I use BLAS functions like
gemm! or even just A*B it will use every core of the processor.  I'm not sure
if these linear algebra operations also use simd vectorization but I imagine
they do since BLAS is very optimized.  Is there a way to write an SIMD loop
that spreads the data out across all processor cores, not just the multiple
functional units of a single core?

Reply via email to