You are 100% right Andreas. Using ArrayViews seems to work and now code is 2.5x faster than my Matlab code! Any obvious problems with this?
function errprop!(w::Array{Float32,3}, d::Array{Float32,3}, deltas) deltas.d[:] = 0. for ti=1:size(w,3), ti2 = 1:size(d,3) Base.LinAlg.BLAS.gemm!('T', 'N', one(Float32), view(w,:,:,ti), view(d,:,:,ti2), one(Float32), view(deltas.d,:,:,ti+ti2-1)) end deltas.d end Still around 30% gc time, so I'll have to do some more profiling to determine the source/ On Sunday, September 14, 2014 12:29:11 PM UTC-7, Andreas Noack wrote: > > I don't think this version does what you wan't it to do. > deltas.d[:,:,ti+ti2-1] makes a copy so deltas.d will be unmodified by > gemm!. You can use sub as Viral mentions or you could try ArrayViews.jl. > Another thing you could consider is to use a Vector{Matrix{Float32}} > instead of Array{Float32,3}. It can be slightly unintuitive but if you > index a Vector{Matrix{Float32}} no copy is made. > > Med venlig hilsen > > Andreas Noack > > 2014-09-14 15:08 GMT-04:00 Michael Oliver <michael....@gmail.com > <javascript:>>: > >> I was using axpy to replace the += only and doing the matrix muliply in >> the argument to axpy. But you're right gemm! is actually what I should be >> using (I'm just starting to learn the BLAS library names). Using gemm! the >> code is now 1.68x faster than my Matlab code (I mean a whole epoch of >> backprop training)! And down to 40% gc time. My goal of 2x speed up is in >> sight! I'll look into subArrays next. >> >> function errprop!(w::Array{Float32,3}, d::Array{Float32,3}, deltas) >> deltas.d[:] = 0. >> for ti=1:size(w,3), ti2 = 1:size(d,3) >> Base.LinAlg.BLAS.gemm!('T', 'N', one(Float32), w[:,:,ti], >> d[:,:,ti2], one(Float32), deltas.d[:,:,ti+ti2-1]) >> end >> deltas.d >> end >> >> On Sunday, September 14, 2014 2:18:07 AM UTC-7, Viral Shah wrote: >>> >>> Oh never mind - I see that you have a matrix multiply there that >>> benefits from calling BLAS. If it is a matrix multiply, how come you can >>> get away with axpy? Shouldn’t you need a gemm? >>> >>> Another way to avoid creating temporary arrays with indexing is to use >>> subArrays, which the linear algebra routines can work with. >>> >>> -viral >>> >>> >>> >>> > On 14-Sep-2014, at 2:43 pm, Viral Shah <vi...@mayin.org> wrote: >>> > >>> > That is great! However, by devectorizing, I meant writing the loop >>> statement itself as two more loops, so that you end up with 3 nested loops >>> effectively. You basically do not want all those w[:,:,ti] calls that >>> create matrices every time. >>> > >>> > You could also potentially hoist the deltas.d out of the loop. Try >>> something like: >>> > >>> > >>> > function errprop!(w::Array{Float32,3}, d::Array{Float32,3}, deltas) >>> > deltas.d[:] = 0. >>> > dd = deltas.d >>> > for ti=1:size(w,3), ti2 = 1:size(d,3) >>> > for i=1:size(w,1) >>> > for j=size(w,2) >>> > dd[i,j,ti+ti2-1] += w[i,j,ti]'*d[i,j,ti2] >>> > end >>> > end >>> > end >>> > deltas.d >>> > end >>> > >>> > >>> > -viral >>> > >>> > >>> > >>> >> On 14-Sep-2014, at 12:47 pm, Michael Oliver <michael....@gmail.com> >>> wrote: >>> >> >>> >> Thanks Viral for the quick reply, that's good to know. I was able to >>> squeeze a little more performance out with axpy (see below). I tried >>> devectorizing the inner loop, but it was much slower, I believe because it >>> was no longer taking full advantage of MKL for the matrix multiply. So far >>> I've got the code running at 1.4x what I had in Matlab and according to >>> @time I still have 44.41% gc time. So 0.4 can't come soon enough! Great >>> work guys, I'm really enjoying learning Julia. >>> >> >>> >> function errprop!(w::Array{Float32,3}, d::Array{Float32,3}, deltas) >>> >> deltas.d[:] = 0. >>> >> rg =size(w,2)*size(d,2); >>> >> for ti=1:size(w,3), ti2 = 1:size(d,3) >>> >> Base.LinAlg.BLAS.axpy!(1,w[:,: >>> ,ti]'*d[:,:,ti2],range(1,rg),deltas.d[:,:,ti+ti2-1],range(1,rg)) >>> >> end >>> >> deltas.d >>> >> end >>> >> >>> >> On Saturday, September 13, 2014 10:10:25 PM UTC-7, Viral Shah wrote: >>> >> The garbage is generated from the indexing operations. In 0.4, we >>> should have array views that should solve this problem. For now, you can >>> either manually devectorize the inner loop, or use the @devectorize macros >>> in the Devectorize package, if they work out in this case. >>> >> >>> >> -viral >>> >> >>> >> On Sunday, September 14, 2014 10:34:45 AM UTC+5:30, Michael Oliver >>> wrote: >>> >> Hi all, >>> >> I've implemented a time delay neural network module and have been >>> trying to optimize it now. This function is for propagating the error >>> backwards through the network. >>> >> The deltas.d is just a container for holding the errors so I can do >>> things in place and don't have to keep initializing arrays. w and d are >>> collections of weights and errors respectively for different time lags. >>> >> This function gets called many many times and according to profiling, >>> there is a lot of garbage collection being induced by the fourth line, >>> specifically within multidimensional.jl getindex and setindex! and array.jl >>> + >>> >> >>> >> function errprop!(w::Array{Float32,3}, d::Array{Float32,3}, deltas) >>> >> deltas.d[:] = 0. >>> >> for ti=1:size(w,3), ti2 = 1:size(d,3) >>> >> deltas.d[:,:,ti+ti2-1] += w[:,:,ti]'*d[:,:,ti2]; >>> >> end >>> >> deltas.d >>> >> end >>> >> >>> >> Any advice would be much appreciated! >>> >> Best, >>> >> Michael >>> > >>> >>> >