There's also a BLAS operation for a*X + Y which is axpy!(a, X, Y). I tried it with the following lines. X = rand(Float32, 5000, 5000) Y = rand(Float32, 5000, 5000) for i = 1:100 axpy!(a, X, Y) end
in a normal interactive session and noticed that all the cores were in use, near 100% CPU utilization, so even for matrix addition BLAS uses parallel processes and SIMD. For that reason I think any SIMD for loop that applies a single simple function to an array would benefit from similar memory splits across the cores. Also when I use @code_llvm on BLAS operations I never see the vectorized instructions in the output but I do for native julia functions like X+Y. Is that because Julia is calling a precompiled library and doesn't directly see the byte code? On Saturday, April 16, 2016 at 9:22:15 PM UTC-6, Jiahao Chen wrote: > > Yes, optimized BLAS implementations like MKL and OpenBLAS use > vectorization heavily. > > Note that matrix addition A+B is fundamentally a very different beast from > matrix multiplication A*B. In the former you have O(N^2) work and O(N^2) > data, so the ratio of work to data is O(1). It is very likely that the > operation is memory bound, in which case there is little to gain from > optimizing the computations. In the latter you have O(N^3) work and O(N^2) > data, so the ratio of work to data is O(N). There exists a good possibility > for the operation to be compute bound, and so there is a payoff to optimize > such computations. > > Thanks, > > Jiahao Chen > Research Scientist > Julia Lab > jiahao.github.io > > On Apr 16 2016, at 11:13 pm, Chris Rackauckas <rack...@gmail.com > <javascript:>> wrote: > >> BLAS functions are painstakingly developed to be beautiful bastions of >> parallelism (because of how ubiquitous their use is). The closest I think >> you can get is ParallelAccelerator.jl's @acc which does a lot of >> optimizations all together. However, it still won't match BLAS in terms of >> its efficiency since BLAS is just really well optimized by hand. But give >> ParallelAccelerator a try, it's a great tool for getting things to run fast >> with little work. >> >> On Saturday, April 16, 2016 at 4:50:50 PM UTC-7, Jason Eckstein wrote: >> >> I often use julia muticore features with pmap and @parallel for loops. >> So the best way to achieve this is to split the array up into parts for >> each core and then run SIMD loops on each parallel process? Will there >> ever by a time when you can add a tag like SIMD that will have the compiler >> automatically does this like it does for BLAS functions? >> >> On Saturday, April 16, 2016 at 3:26:22 AM UTC-6, Valentin Churavy wrote: >> >> Blas is using a combination of SIMD and multi-core processing. Multi-core >> (threading) is coming in Julia v0.5 as an experimental feature. >> >> On Saturday, 16 April 2016 14:13:00 UTC+9, Jason Eckstein wrote: >> >> I noticed in Julia 4 now if you call A+B where A and B are matrices of >> equal size, the llvm code shows vectorization indicating it is equivalent >> to if I wrote my own function with an @simd tagged for loop. I still >> notice though that it uses a single core to maximum capacity but never >> spreads an SIMD loop out over multiple cores. In contrast if I use BLAS >> functions like gemm! or even just A*B it will use every core of the >> processor. I'm not sure if these linear algebra operations also use simd >> vectorization but I imagine they do since BLAS is very optimized. Is there >> a way to write an SIMD loop that spreads the data out across all processor >> cores, not just the multiple functional units of a single core? >> >>