I get similar results for OpenBLAS. I expect that axpy gains more from vectorization than dot.
On Fri, Sep 9, 2016 at 5:31 PM, Sheehan Olver <dlfivefi...@gmail.com> wrote: > I did blas_set_num_threads(1) with the same profile numbers. This is > using Apple’s BLAS. > > Maybe I’ll try 0.5 and OpenBLAS for comparison. > > On 10 Sep 2016, at 2:34 AM, Andreas Noack <andreasnoackjen...@gmail.com> > wrote: > > Try to time it again with threading disabled. Sometimes the > threading heuristics can cause unintuitive performance. > > On Friday, September 9, 2016 at 6:39:13 AM UTC-4, Sheehan Olver wrote: >> >> >> I have the following code that is part of a Householder routine, where >> j::Int64, >> N::Int64, R.cols::Vector{Int64}, wp::Ptr{Float64}, M::Int64, >> v::Ptr{Float64}: >> >> … >> for j=k:N >> v=r+(R.cols[j]+k-2)*sz >> dt=BLAS.dot(M,wp,1,v,1) >> BLAS.axpy!(M,-2*dt,wp,1,v,1) >> end >> … >> >> >> >> For some reason, the BLAS.dot call takes 3x as long as the BLAS.axpy! >> call. Is this expected, or is there something wrong? >> >> >> >