Oh never mind - I see that you have a matrix multiply there that benefits from calling BLAS. If it is a matrix multiply, how come you can get away with axpy? Shouldn’t you need a gemm?
Another way to avoid creating temporary arrays with indexing is to use subArrays, which the linear algebra routines can work with. -viral > On 14-Sep-2014, at 2:43 pm, Viral Shah <vi...@mayin.org> wrote: > > That is great! However, by devectorizing, I meant writing the loop statement > itself as two more loops, so that you end up with 3 nested loops effectively. > You basically do not want all those w[:,:,ti] calls that create matrices > every time. > > You could also potentially hoist the deltas.d out of the loop. Try something > like: > > > function errprop!(w::Array{Float32,3}, d::Array{Float32,3}, deltas) > deltas.d[:] = 0. > dd = deltas.d > for ti=1:size(w,3), ti2 = 1:size(d,3) > for i=1:size(w,1) > for j=size(w,2) > dd[i,j,ti+ti2-1] += w[i,j,ti]'*d[i,j,ti2] > end > end > end > deltas.d > end > > > -viral > > > >> On 14-Sep-2014, at 12:47 pm, Michael Oliver <michael.d.oli...@gmail.com> >> wrote: >> >> Thanks Viral for the quick reply, that's good to know. I was able to squeeze >> a little more performance out with axpy (see below). I tried devectorizing >> the inner loop, but it was much slower, I believe because it was no longer >> taking full advantage of MKL for the matrix multiply. So far I've got the >> code running at 1.4x what I had in Matlab and according to @time I still >> have 44.41% gc time. So 0.4 can't come soon enough! Great work guys, I'm >> really enjoying learning Julia. >> >> function errprop!(w::Array{Float32,3}, d::Array{Float32,3}, deltas) >> deltas.d[:] = 0. >> rg =size(w,2)*size(d,2); >> for ti=1:size(w,3), ti2 = 1:size(d,3) >> >> Base.LinAlg.BLAS.axpy!(1,w[:,:,ti]'*d[:,:,ti2],range(1,rg),deltas.d[:,:,ti+ti2-1],range(1,rg)) >> end >> deltas.d >> end >> >> On Saturday, September 13, 2014 10:10:25 PM UTC-7, Viral Shah wrote: >> The garbage is generated from the indexing operations. In 0.4, we should >> have array views that should solve this problem. For now, you can either >> manually devectorize the inner loop, or use the @devectorize macros in the >> Devectorize package, if they work out in this case. >> >> -viral >> >> On Sunday, September 14, 2014 10:34:45 AM UTC+5:30, Michael Oliver wrote: >> Hi all, >> I've implemented a time delay neural network module and have been trying to >> optimize it now. This function is for propagating the error backwards >> through the network. >> The deltas.d is just a container for holding the errors so I can do things >> in place and don't have to keep initializing arrays. w and d are collections >> of weights and errors respectively for different time lags. >> This function gets called many many times and according to profiling, there >> is a lot of garbage collection being induced by the fourth line, >> specifically within multidimensional.jl getindex and setindex! and array.jl + >> >> function errprop!(w::Array{Float32,3}, d::Array{Float32,3}, deltas) >> deltas.d[:] = 0. >> for ti=1:size(w,3), ti2 = 1:size(d,3) >> deltas.d[:,:,ti+ti2-1] += w[:,:,ti]'*d[:,:,ti2]; >> end >> deltas.d >> end >> >> Any advice would be much appreciated! >> Best, >> Michael >