On Thursday, March 12, 2015 at 2:14:34 AM UTC-7, Mauro wrote:
>
> Julia is not yet very good with producing fast vectorized code which 
> does not allocate temporaries.  The temporaries is what gets you here. 
>
> However, running your example, I get a slightly different a different 
> *.mem file (which makes more sense to me): 
>
>         - function forward_propagate(nl::NeuralLayer,x::Vector{Float32}) 
>         0   nl.hx = x 
> 248832000   wx = nl.w * nl.hx 
> 348364800   nl.pa = nl.b+wx 
> 1094864752   nl.pr = tanh(nl.pa).*nl.scale 
>         - end 
>
> (what version of julia are you running, me 0.3.6).  So everytime 
> forward_propagate is called some temporaries are allocated.  So in 
> performance critical code you have write loops instead: 
>
> function forward_propagate(nl::NeuralLayer,x::Vector{Float32}) 
>     nl.hx = x # note: nl.hx now point to the same junk of memory 
>     for i=1:size(nl.w,1) 
>         nl.pa[i] = 0.; 
>         for j=1:size(nl.w,2) 
>             nl.pa[i] += nl.w[i,j]*nl.hx[j] 
>         end 
>         nl.pa[i] += nl.b[i] 
>         nl.pr[i] = tanh(nl.pa[i])*nl.scale[i] 
>     end 
> end 
>
> This does not allocate any memory and runs your test case at about 2x 
> the speed. 
>

Just tried that, I'm seeing a much bigger improvement. went from 8 seconds 
to 0.5 seconds per image evaluation. Nice improvment!

>
> Also a note on the code in your first email.  Instead of: 
>
>   for y in 1:img.height 
>     @simd for x in 1:img.wid 
>       if 1 < x < img.wid 
>         @inbounds left   = img.data[x-1,y] 
>         @inbounds center = img.data[x,y] 
>         @inbounds right  = img.data[x+1,y] 
>
> you should be able to write: 
>
>   @inbounds for y in 1:img.height 
>     @simd for x in 1:img.wid 
>       if 1 < x < img.wid 
>         left   = img.data[x-1,y] 
>         center = img.data[x,y] 
>         @inbounds right  = img.data[x+1,y] 
>
> Just curious, why did you get rid of the @inbounds on the assignments to 
left and center, but not right?
 

> Also, did you check that the @simd works?  I'm no expert on that but my 
> understanding is that most of the time it doesn't work with if-else.  If 
> that is the case, maybe special-case the first and last iteration and 
> run the loop like: @simd for x in 2:img.wid-1 . 


I just did that and I don't see a huge difference there. I'm not sure @simd 
is doing much there, in fact I took it out and nothing changed. Probably 
have to look at the LLVM IR output to see what's happening there.

 In fact that would save 
> you a comparisons in each iteration irrespective of @simd. 
>

Yes, that's a good point.  I think I'll just pre-load those two columns 
(the 1st and last columns of the matrix)

>
> On Thu, 2015-03-12 at 02:17, Phil Tomson <philt...@gmail.com <javascript:>> 
> wrote: 
> > I transformed it into a single-file testcase: 
> > 
> > ######################################################### 
> > type NeuralLayer 
> >     w::Matrix{Float32}   # weights 
> >     cm::Matrix{Float32}  # connection matrix 
> >     b::Vector{Float32}   # biases 
> >     scale::Vector{Float32}  # 
> >     a_func::Symbol     # activation function 
> >     hx::Vector{Float32}  # input values 
> >     pa::Vector{Float32}  # pre activation values 
> >     pr::Vector{Float32}  # predictions (activation values) 
> >     frozen::Bool 
> > end 
> > 
> > function forward_propagate(nl::NeuralLayer,x::Vector{Float32}) 
> >   nl.hx = x 
> >   wx = nl.w * nl.hx 
> >   nl.pa = nl.b+wx 
> >   nl.pr = tanh(nl.pa).*nl.scale 
> > end 
> > 
> > out_dim = 10 
> > in_dim = 10 
> > b = sqrt(6) / sqrt(in_dim + out_dim) 
> > 
> > nl = NeuralLayer( 
> >        float32(2.0b * rand(Float32,out_dim,in_dim) - b), #setup rand 
> weights 
> >        ones(Float32,out_dim,in_dim), #connection matrix 
> >          float32(map(x->x*(randbool()?-1:1),rand(out_dim)*rand(1:4))), 
> > #biases 
> >        rand(Float32,out_dim),  # scale 
> >        :tanh, 
> >        rand(Float32,in_dim), 
> >        rand(Float32,out_dim), 
> >        rand(Float32,out_dim), 
> >        false 
> >     ) 
> > 
> > x = ones(Float32,in_dim) 
> > forward_propagate(nl,x) 
> > clear_malloc_data() 
> > for i in 1:(1920*1080) 
> >   forward_propagate(nl,x) 
> > end 
> > println("nl.pr is: $(nl.pr)") 
> > 
> ############################################################################# 
>
> > 
> > Now the interesting part of the  .mem file looks like this: 
> > 
> >        - function forward_propagate(nl::NeuralLayer,x::Vector{Float32}) 
> >         0   nl.hx = x 
> >         0   wx = nl.w * nl.hx 
> >   348368752   nl.pa = nl.b+wx 
> >         0   nl.pr = tanh(nl.pa).*nl.scale 
> >         - end 
> > 
> > I split up the matrix multiply and the addition of bias vector into two 
> > separate lines and it looks like it's the vector addition that's 
> allocating 
> > all of the memory (which seems surprising, but maybe I'm missing 
> something). 
> > 
> > Phil 
>
>

Reply via email to