> Is that because Julia is calling a precompiled library and doesn't
directly see the byte code?
Yes
There's also a BLAS operation for a*X + Y which is axpy!(a, X, Y). I tried
it with the following lines.
X = rand(Float32, 5000, 5000)
Y = rand(Float32, 5000, 5000)
for i = 1:100 axpy!(a, X, Y) end
in a normal interactive session and noticed that all the cores were in use,
near 100% CPU utilizat
Yes, optimized BLAS implementations like MKL and OpenBLAS use vectorization
heavily.
Note that matrix addition A+B is fundamentally a very different beast from
matrix multiplication A*B. In the former you have O(N^2) work and O(N^2) data,
so the ratio of work to data is O(1). It is very likely
BLAS functions are painstakingly developed to be beautiful bastions of
parallelism (because of how ubiquitous their use is). The closest I think
you can get is ParallelAccelerator.jl's @acc which does a lot of
optimizations all together. However, it still won't match BLAS in terms of
its effici
I often use julia muticore features with pmap and @parallel for loops. So
the best way to achieve this is to split the array up into parts for each
core and then run SIMD loops on each parallel process? Will there ever by
a time when you can add a tag like SIMD that will have the compiler
aut
Blas is using a combination of SIMD and multi-core processing. Multi-core
(threading) is coming in Julia v0.5 as an experimental feature.
On Saturday, 16 April 2016 14:13:00 UTC+9, Jason Eckstein wrote:
>
> I noticed in Julia 4 now if you call A+B where A and B are matrices of
> equal size, the