There's also a BLAS operation for a*X + Y which is axpy!(a, X, Y).  I tried 
it with the following lines.
X = rand(Float32, 5000, 5000)
Y = rand(Float32, 5000, 5000)
for i = 1:100 axpy!(a, X, Y) end

in a normal interactive session and noticed that all the cores were in use, 
near 100% CPU utilization, so even for matrix addition BLAS uses parallel 
processes and SIMD.  For that reason I think any SIMD for loop that applies 
a single simple function to an array would benefit from similar memory 
splits across the cores.  Also when I use @code_llvm on BLAS operations I 
never see the vectorized instructions in the output but I do for native 
julia functions like X+Y.  Is that because Julia is calling a precompiled 
library and doesn't directly see the byte code?

On Saturday, April 16, 2016 at 9:22:15 PM UTC-6, Jiahao Chen wrote:
>
> Yes, optimized BLAS implementations like MKL and OpenBLAS use 
> vectorization heavily.
>
> Note that matrix addition A+B is fundamentally a very different beast from 
> matrix multiplication A*B. In the former you have O(N^2) work and O(N^2) 
> data, so the ratio of work to data is O(1). It is very likely that the 
> operation is memory bound, in which case there is little to gain from 
> optimizing the computations. In the latter you have O(N^3) work and O(N^2) 
> data, so the ratio of work to data is O(N). There exists a good possibility 
> for the operation to be compute bound, and so there is a payoff to optimize 
> such computations.
>
> Thanks,
>
> Jiahao Chen
> Research Scientist
> Julia Lab
> jiahao.github.io
>
> On Apr 16 2016, at 11:13 pm, Chris Rackauckas <rack...@gmail.com 
> <javascript:>> wrote: 
>
>> BLAS functions are painstakingly developed to be beautiful bastions of 
>> parallelism (because of how ubiquitous their use is). The closest I think 
>> you can get is ParallelAccelerator.jl's @acc which does a lot of 
>> optimizations all together. However, it still won't match BLAS in terms of 
>> its efficiency since BLAS is just really well optimized by hand. But give 
>> ParallelAccelerator a try, it's a great tool for getting things to run fast 
>> with little work.
>>
>> On Saturday, April 16, 2016 at 4:50:50 PM UTC-7, Jason Eckstein wrote:
>>
>> I often use julia muticore features with pmap and @parallel for loops. 
>>  So the best way to achieve this is to split the array up into parts for 
>> each core and then run SIMD loops on each parallel process?  Will there 
>> ever by a time when you can add a tag like SIMD that will have the compiler 
>> automatically does this like it does for BLAS functions?
>>
>> On Saturday, April 16, 2016 at 3:26:22 AM UTC-6, Valentin Churavy wrote:
>>
>> Blas is using a combination of SIMD and multi-core processing. Multi-core 
>> (threading) is coming in Julia v0.5 as an experimental feature. 
>>
>> On Saturday, 16 April 2016 14:13:00 UTC+9, Jason Eckstein wrote:
>>
>> I noticed in Julia 4 now if you call A+B where A and B are matrices of 
>> equal size, the llvm code shows vectorization indicating it is equivalent 
>> to if I wrote my own function with an @simd tagged for loop.  I still 
>> notice though that it uses a single core to maximum capacity but never 
>> spreads an SIMD loop out over multiple cores.  In contrast if I use BLAS 
>> functions like gemm! or even just A*B it will use every core of the 
>> processor.  I'm not sure if these linear algebra operations also use simd 
>> vectorization but I imagine they do since BLAS is very optimized.  Is there 
>> a way to write an SIMD loop that spreads the data out across all processor 
>> cores, not just the multiple functional units of a single core?
>>
>>

Reply via email to