On Thursday, March 12, 2015 at 2:14:34 AM UTC-7, Mauro wrote: > > Julia is not yet very good with producing fast vectorized code which > does not allocate temporaries. The temporaries is what gets you here. > > However, running your example, I get a slightly different a different > *.mem file (which makes more sense to me): > > - function forward_propagate(nl::NeuralLayer,x::Vector{Float32}) > 0 nl.hx = x > 248832000 wx = nl.w * nl.hx > 348364800 nl.pa = nl.b+wx > 1094864752 nl.pr = tanh(nl.pa).*nl.scale > - end >
I would have guessed it should look more like that; why would the multiplication not result in temporaries (in my case)? That was a bit mysterious. > > (what version of julia are you running, me 0.3.6). 0.3.4 in my case. > So everytime > forward_propagate is called some temporaries are allocated. So in > performance critical code you have write loops instead: > Will this always be the case or is this a current limitation of the Julia compiler? It seems like the more idiomatic, compact code should be handled more efficiently. Having to break this out into nested for-loops definitely hurts both readability as well as productivity. > > function forward_propagate(nl::NeuralLayer,x::Vector{Float32}) > nl.hx = x # note: nl.hx now point to the same junk of memory > for i=1:size(nl.w,1) > nl.pa[i] = 0.; > for j=1:size(nl.w,2) > nl.pa[i] += nl.w[i,j]*nl.hx[j] > end > nl.pa[i] += nl.b[i] > nl.pr[i] = tanh(nl.pa[i])*nl.scale[i] > end > end > > > This does not allocate any memory and runs your test case at about 2x > the speed. > > Also a note on the code in your first email. Instead of: > > for y in 1:img.height > @simd for x in 1:img.wid > if 1 < x < img.wid > @inbounds left = img.data[x-1,y] > @inbounds center = img.data[x,y] > @inbounds right = img.data[x+1,y] > > you should be able to write: > > @inbounds for y in 1:img.height > @simd for x in 1:img.wid > if 1 < x < img.wid > left = img.data[x-1,y] > center = img.data[x,y] > @inbounds right = img.data[x+1,y] > > Also, did you check that the @simd works? I'm no expert on that but my > understanding is that most of the time it doesn't work with if-else. If > that is the case, maybe special-case the first and last iteration and > run the loop like: @simd for x in 2:img.wid-1 . In fact that would save > you a comparisons in each iteration irrespective of @simd. > > On Thu, 2015-03-12 at 02:17, Phil Tomson <philt...@gmail.com <javascript:>> > wrote: > > I transformed it into a single-file testcase: > > > > ######################################################### > > type NeuralLayer > > w::Matrix{Float32} # weights > > cm::Matrix{Float32} # connection matrix > > b::Vector{Float32} # biases > > scale::Vector{Float32} # > > a_func::Symbol # activation function > > hx::Vector{Float32} # input values > > pa::Vector{Float32} # pre activation values > > pr::Vector{Float32} # predictions (activation values) > > frozen::Bool > > end > > > > function forward_propagate(nl::NeuralLayer,x::Vector{Float32}) > > nl.hx = x > > wx = nl.w * nl.hx > > nl.pa = nl.b+wx > > nl.pr = tanh(nl.pa).*nl.scale > > end > > > > out_dim = 10 > > in_dim = 10 > > b = sqrt(6) / sqrt(in_dim + out_dim) > > > > nl = NeuralLayer( > > float32(2.0b * rand(Float32,out_dim,in_dim) - b), #setup rand > weights > > ones(Float32,out_dim,in_dim), #connection matrix > > float32(map(x->x*(randbool()?-1:1),rand(out_dim)*rand(1:4))), > > #biases > > rand(Float32,out_dim), # scale > > :tanh, > > rand(Float32,in_dim), > > rand(Float32,out_dim), > > rand(Float32,out_dim), > > false > > ) > > > > x = ones(Float32,in_dim) > > forward_propagate(nl,x) > > clear_malloc_data() > > for i in 1:(1920*1080) > > forward_propagate(nl,x) > > end > > println("nl.pr is: $(nl.pr)") > > > ############################################################################# > > > > > Now the interesting part of the .mem file looks like this: > > > > - function forward_propagate(nl::NeuralLayer,x::Vector{Float32}) > > 0 nl.hx = x > > 0 wx = nl.w * nl.hx > > 348368752 nl.pa = nl.b+wx > > 0 nl.pr = tanh(nl.pa).*nl.scale > > - end > > > > I split up the matrix multiply and the addition of bias vector into two > > separate lines and it looks like it's the vector addition that's > allocating > > all of the memory (which seems surprising, but maybe I'm missing > something). > > > > Phil > >