Re: [julia-users] Avoid memory allocations when reading from matrices

David Gold Mon, 25 May 2015 07:06:06 -0700

I can't pretend to be able to explain the full story, but you may find this 
section helpful: 
http://docs.julialang.org/en/release-0.3/manual/performance-tips/#avoid-containers-with-abstract-type-parameters


That whole section on performance tips is worth reading.

On Monday, May 25, 2015 at 9:31:49 AM UTC-4, Dom Luna wrote:
>
>   Thanks Mauro! I made the change it works wonderfully.
>
> I’m still a little confused why this change in particular makes such a big 
> difference. Is it
> when using l1, l2 to index (they still have the constraint of  <: Int)? Or 
> is it that v isn’t concrete and that gets propagated to everything else 
> through
>
> diff = dot(vec(vm), vec(vc)) + m.b_main[l1] + m.b_ctx[l2] - log(v)
>
> ?
>
> Lastly, are there tools I could use check this sort of thing in the future?
>
>
> On Sun, May 24, 2015 at 2:38 PM, Mauro <maur...@runbox.com <javascript:>> 
> wrote:
>
>> The problem is in:
>> type Model{T}
>>     W_main::Matrix{T}
>>     W_ctx::Matrix{T}
>>     b_main::Vector{T}
>>     b_ctx::Vector{T}
>>     W_main_grad::Matrix{T}
>>     W_ctx_grad::Matrix{T}
>>     b_main_grad::Vector{T}
>>     b_ctx_grad::Vector{T}
>>     covec::Vector{Cooccurence} # <- needs to be concrete type
>> end
>>
>> Instead use
>>     covec::Vector{Cooccurence{Int,Int,T}}
>> or some more complicated parameterisation.
>>
>> Then, when testing timings you usually do one warm-up to exclude
>> compilation time:
>>
>> GloVe.train!(model, solver) # warm up
>> @time 1                     # @time needs a warm up too
>> @time GloVe.train!(model, solver)
>>
>> Timings I get:
>>
>> stock clone from github:
>> elapsed time: 0.001617218 seconds (2419024 bytes allocated)
>>
>> with improvements mentioned above:
>> elapsed time: 0.001344645 seconds (2335552 bytes allocated)
>>
>> with improvements mentioned above and your loop-version:
>> elapsed time: 0.00030488 seconds (3632 bytes allocated)
>>
>> Hope that helps.
>>
>> On Sun, 2015-05-24 at 19:21, Dominique Luna <dlun...@gmail.com 
>> <javascript:>> wrote:
>> > Loop code
>> >
>> >
>> > # TODO: figure out memory issue
>> > function train!(m::Model, s::Adagrad; xmax=100, alpha=0.75)
>> >     J = 0.0
>> >     shuffle!(m.covec)
>> >     vecsize = size(m.W_main, 1)
>> >     eltype = typeof(m.b_main[1])
>> >     vm = zeros(eltype, vecsize)
>> >     vc = zeros(eltype, vecsize)
>> >     grad_main = zeros(eltype, vecsize)
>> >     grad_ctx = zeros(eltype, vecsize)
>> >     for n=1:s.niter
>> >         # shuffle indices
>> >         for i = 1:length(m.covec)
>> >             @inbounds l1 = m.covec[i].i # main index
>> >             @inbounds l2 = m.covec[i].j # context index
>> >             @inbounds v = m.covec[i].v
>> >             #= vm[:] = m.W_main[:, l1] =#
>> >             #= vc[:] = m.W_ctx[:, l2] =#
>> >             @inbounds for j = 1:vecsize
>> >                 vm[j] = m.W_main[j, l1]
>> >                 vc[j] = m.W_ctx[j, l2]
>> >             end
>> >             diff = dot(vec(vm), vec(vc)) + m.b_main[l1] + m.b_ctx[l2] - 
>> log(v)
>> >             fdiff = ifelse(v < xmax, (v / xmax) ^ alpha, 1.0) * diff
>> >             J += 0.5 * fdiff * diff
>> >             fdiff *= s.lrate
>> >             # inc memory by ~200 MB && running time by 2x
>> >             #= grad_main[:] = fdiff * m.W_ctx[:, l2] =#
>> >             #= grad_ctx[:] = fdiff * m.W_main[:, l1] =#
>> >             @inbounds for j = 1:vecsize
>> >                 grad_main[j] = fdiff * m.W_ctx[j, l2]
>> >                 grad_ctx[j] = fdiff * m.W_main[j, l1]
>> >             end
>> >             # Adaptive learning
>> >             # inc ~ 600MB + 0.75s
>> >             #= m.W_main[:, l1] -= grad_main ./ sqrt(m.W_main_grad[:, 
>> l1]) =#
>> >             #= m.W_ctx[:, l2] -= grad_ctx ./ sqrt(m.W_ctx_grad[:, l2]) 
>> =#
>> >             #= m.b_main[l1] -= fdiff ./ sqrt(m.b_main_grad[l1]) =#
>> >             #= m.b_ctx[l2] -= fdiff ./ sqrt(m.b_ctx_grad[l2]) =#
>> >             @inbounds for j = 1:vecsize
>> >                 m.W_main[j, l1] -= grad_main[j] / sqrt(m.W_main_grad[j, 
>> l1])
>> >                 m.W_ctx[j, l2] -= grad_ctx[j] / sqrt(m.W_ctx_grad[j, 
>> l2])
>> >             end
>> >             m.b_main[l1] -= fdiff ./ sqrt(m.b_main_grad[l1])
>> >             m.b_ctx[l2] -= fdiff ./ sqrt(m.b_ctx_grad[l2])
>> >             # Gradients
>> >             fdiff *= fdiff
>> >             #= m.W_main_grad[:, l1] += grad_main .^ 2 =#
>> >             #= m.W_ctx_grad[:, l2] += grad_ctx .^ 2 =#
>> >             #= m.b_main_grad[l1] += fdiff =#
>> >             #= m.b_ctx_grad[l2] += fdiff =#
>> >             @inbounds for j = 1:vecsize
>> >                 m.W_main_grad[j, l1] += grad_main[j] ^ 2
>> >                 m.W_ctx_grad[j, l2] += grad_ctx[j] ^ 2
>> >             end
>> >             m.b_main_grad[l1] += fdiff
>> >             m.b_ctx_grad[l2] += fdiff
>> >         end
>> >         #= if n % 10 == 0 =#
>> >         #=     println("iteration $n, cost $J") =#
>> >         #= end =#
>> >     end
>> > end
>> >
>> >
>> >
>> > Mixmax Not using Mixmax yet?
>> >
>> > And the respective timings
>> >
>> > @time GloVe.train!(model, GloVe.Adagrad(500))
>> >    7.097 seconds      (96237 k allocations: 1468 MB, 7.01% gc time)
>> >
>> > Slower and more memory.
>> >
>> > On Sun, May 24, 2015 at 4:21 AM, Mauro <maur...@runbox.com 
>> <javascript:>> wrote:
>> >
>> >     Loops should run without allocations. Can you post your loop-code?
>> >
>> >     > A[i,  :] = 0.5 * B[i, :]
>> >
>> >     To state the obvious, as loop:
>> >
>> >     for j=1:size(A,2)
>> >         A[i,j] = 0.5 * B[i,j]
>> >     end
>> >
>> >     this shouldn't allocate, if i is an integer. Unless A and B have
>> >     different type, then allocation might happen.
>> >
>> >     On Sun, 2015-05-24 at 05:00, Dom Luna <dlun...@gmail.com 
>> <javascript:>> wrote:
>> >     > Reposting this from Gitter chat since it seems this is more 
>> active.
>> >     >
>> >     > I'm writing a GloVe module to learn Julia.
>> >     >
>> >     > How can I avoid memory allocations? My main function deals with a 
>> lot of
>> >     > random indexing in Matrices.
>> >     >
>> >     > A[i,  :] = 0.5 * B[i, :]
>> >     >
>> >     > In this case* i* isn't from a linear sequence. I'm not sure that 
>> matters.
>> >     > Anyway, I’ve done analysis and I know B[i, :]  is the issue here 
>> since
>> >     it’s
>> >     > creating a copy.
>> >     >
>> >     > https://github.com/JuliaLang/julia/blob/master/base/array.jl#L309 
>> makes
>> >     the
>> >     > copy
>> >     >
>> >     >
>> >     > I tried to do it via loop but it looks like that doesn’t help 
>> either. In
>> >     > fact, it seems to allocate slight more memory which seems really 
>> odd.
>> >     >
>> >     > Here’s some of the code, it’s a little messy since I’m commenting
>> >     different
>> >     > approaches I’m trying out.
>> >     >
>> >     > type Model{T}
>> >     >     W_main::Matrix{T}
>> >     >     W_ctx::Matrix{T}
>> >     >     b_main::Vector{T}
>> >     >     b_ctx::Vector{T}
>> >     >     W_main_grad::Matrix{T}
>> >     >     W_ctx_grad::Matrix{T}
>> >     >     b_main_grad::Vector{T}
>> >     >     b_ctx_grad::Vector{T}
>> >     >     covec::Vector{Cooccurence}
>> >     > end
>> >     >
>> >     > # Each vocab word in associated with a main vector and a context 
>> vector.
>> >     > # The paper initializes the to values [-0.5, 0.5] / vecsize+1 and
>> >     > # the gradients to 1.0.
>> >     > #
>> >     > # The +1 term is for the bias.
>> >     > function Model(comatrix; vecsize=100)
>> >     >     vs = size(comatrix, 1)
>> >     >     Model(
>> >     >         (rand(vecsize, vs) - 0.5) / (vecsize + 1),
>> >     >         (rand(vecsize, vs) - 0.5) / (vecsize + 1),
>> >     >         (rand(vs) - 0.5) / (vecsize + 1),
>> >     >         (rand(vs) - 0.5) / (vecsize + 1),
>> >     >         ones(vecsize, vs),
>> >     >         ones(vecsize, vs),
>> >     >         ones(vs),
>> >     >         ones(vs),
>> >     >         CoVector(comatrix), # not required in 0.4
>> >     >     )
>> >     > end
>> >     >
>> >     > # TODO: figure out memory issue
>> >     > # the memory comments are from 500 loop test with vecsize=100
>> >     > function train!(m::Model, s::Adagrad; xmax=100, alpha=0.75)
>> >     >     J = 0.0
>> >     >     shuffle!(m.covec)
>> >     >
>> >     >     vecsize = size(m.W_main, 1)
>> >     >     eltype = typeof(m.b_main[1])
>> >     >     vm = zeros(eltype, vecsize)
>> >     >     vc = zeros(eltype, vecsize)
>> >     >     grad_main = zeros(eltype, vecsize)
>> >     >     grad_ctx = zeros(eltype, vecsize)
>> >     >
>> >     >     for n=1:s.niter
>> >     >         # shuffle indices
>> >     >         for i = 1:length(m.covec)
>> >     >             @inbounds l1 = m.covec[i].i # main index
>> >     >             @inbounds l2 = m.covec[i].j # context index
>> >     >             @inbounds v = m.covec[i].v
>> >     >
>> >     >             vm[:] = m.W_main[:, l1]
>> >     >             vc[:] = m.W_ctx[:, l2]
>> >     >
>> >     >             diff = dot(vec(vm), vec(vc)) + m.b_main[l1] + 
>> m.b_ctx[l2] -
>> >     > log(v)
>> >     >             fdiff = ifelse(v < xmax, (v / xmax) ^ alpha, 1.0) * 
>> diff
>> >     >             J += 0.5 * fdiff * diff
>> >     >
>> >     >             fdiff *= s.lrate
>> >     >             # inc memory by ~200 MB && running time by 2x
>> >     >             grad_main[:] = fdiff * m.W_ctx[:, l2]
>> >     >             grad_ctx[:] = fdiff * m.W_main[:, l1]
>> >     >
>> >     >             # Adaptive learning
>> >     >             # inc ~ 600MB + 0.75s
>> >     >             #= @inbounds for ii = 1:vecsize =#
>> >     >             #=     m.W_main[ii, l1] -= grad_main[ii] /
>> >     > sqrt(m.W_main_grad[ii, l1]) =#
>> >     >             #=     m.W_ctx[ii, l2] -= grad_ctx[ii] / 
>> sqrt(m.W_ctx_grad
>> >     [ii,
>> >     > l2]) =#
>> >     >             #=     m.b_main[l1] -= fdiff ./ 
>> sqrt(m.b_main_grad[l1]) =#
>> >     >             #=     m.b_ctx[l2] -= fdiff ./ sqrt(m.b_ctx_grad[l2]) 
>> =#
>> >     >             #= end =#
>> >     >
>> >     >             m.W_main[:, l1] -= grad_main ./ sqrt(m.W_main_grad[:, 
>> l1])
>> >     >             m.W_ctx[:, l2] -= grad_ctx ./ sqrt(m.W_ctx_grad[:, 
>> l2])
>> >     >             m.b_main[l1] -= fdiff ./ sqrt(m.b_main_grad[l1])
>> >     >             m.b_ctx[l2] -= fdiff ./ sqrt(m.b_ctx_grad[l2])
>> >     >
>> >     >             # Gradients
>> >     >             fdiff *= fdiff
>> >     >             m.W_main_grad[:, l1] += grad_main .^ 2
>> >     >             m.W_ctx_grad[:, l2] += grad_ctx .^ 2
>> >     >             m.b_main_grad[l1] += fdiff
>> >     >             m.b_ctx_grad[l2] += fdiff
>> >     >         end
>> >     >
>> >     >         #= if n % 10 == 0 =#
>> >     >         #=     println(“iteration $n, cost $J”) =#
>> >     >         #= end =#
>> >     >     end
>> >     > end
>> >     >
>> >     >
>> >     > Here’s the entire repo https://github.com/domluna/GloVe.jl. 
>> Might be
>> >     > helpful.
>> >     >
>> >     > I tried doing some loops but it allocates more memory (oddly 
>> enough) and
>> >     > gets slower.
>> >     >
>> >     > You’ll notice the word vectors are indexed by column, I changed 
>> the
>> >     > representation to that
>> >     > seeing if it would make a difference during the loop. It didn’t 
>> seem to.
>> >     >
>> >     > The memory analysis showed
>> >     >
>> >     > Julia Version 0.4.0-dev+4893
>> >     > Commit eb5da26* (2015-05-19 11:51 UTC)
>> >     > Platform Info:
>> >     >   System: Darwin (x86_64-apple-darwin14.4.0)
>> >     >   CPU: Intel(R) Core(TM) i5-2557M CPU @ 1.70GHz
>> >     >   WORD_SIZE: 64
>> >     >   BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY 
>> Sandybridge)
>> >     >   LAPACK: libopenblas
>> >     >   LIBM: libopenlibm
>> >     >   LLVM: libLLVM-3.3
>> >     >
>> >     > Here model consists of 100x19 Matrices and 100 element vectors, 
>> 19 words
>> >     in
>> >     > the vocab, 100 element word vector.
>> >     >
>> >     > @time GloVe.train!(model, GloVe.Adagrad(500))
>> >     >    1.990 seconds      (6383 k allocations: 1162 MB, 10.82% gc 
>> time)
>> >     >
>> >     > 0.3 has is a bit slower due to worse gc but same memory.
>> >     >
>> >     > Any help would be greatly appreciated!
>> >
>> >
>> >
>> >
>> > cheers,
>> >
>> > dom
>> >
>> >
>> > Sent with Mixmax
>> >
>> > *
>>
>>
>
> cheers,
>
> dom
>
>
> Sent with Mixmax <https://mixmax.com>
>

Re: [julia-users] Avoid memory allocations when reading from matrices

Reply via email to