I can't pretend to be able to explain the full story, but you may find this section helpful: http://docs.julialang.org/en/release-0.3/manual/performance-tips/#avoid-containers-with-abstract-type-parameters
That whole section on performance tips is worth reading. On Monday, May 25, 2015 at 9:31:49 AM UTC-4, Dom Luna wrote: > > Thanks Mauro! I made the change it works wonderfully. > > I’m still a little confused why this change in particular makes such a big > difference. Is it > when using l1, l2 to index (they still have the constraint of <: Int)? Or > is it that v isn’t concrete and that gets propagated to everything else > through > > diff = dot(vec(vm), vec(vc)) + m.b_main[l1] + m.b_ctx[l2] - log(v) > > ? > > Lastly, are there tools I could use check this sort of thing in the future? > > > On Sun, May 24, 2015 at 2:38 PM, Mauro <maur...@runbox.com <javascript:>> > wrote: > >> The problem is in: >> type Model{T} >> W_main::Matrix{T} >> W_ctx::Matrix{T} >> b_main::Vector{T} >> b_ctx::Vector{T} >> W_main_grad::Matrix{T} >> W_ctx_grad::Matrix{T} >> b_main_grad::Vector{T} >> b_ctx_grad::Vector{T} >> covec::Vector{Cooccurence} # <- needs to be concrete type >> end >> >> Instead use >> covec::Vector{Cooccurence{Int,Int,T}} >> or some more complicated parameterisation. >> >> Then, when testing timings you usually do one warm-up to exclude >> compilation time: >> >> GloVe.train!(model, solver) # warm up >> @time 1 # @time needs a warm up too >> @time GloVe.train!(model, solver) >> >> Timings I get: >> >> stock clone from github: >> elapsed time: 0.001617218 seconds (2419024 bytes allocated) >> >> with improvements mentioned above: >> elapsed time: 0.001344645 seconds (2335552 bytes allocated) >> >> with improvements mentioned above and your loop-version: >> elapsed time: 0.00030488 seconds (3632 bytes allocated) >> >> Hope that helps. >> >> On Sun, 2015-05-24 at 19:21, Dominique Luna <dlun...@gmail.com >> <javascript:>> wrote: >> > Loop code >> > >> > >> > # TODO: figure out memory issue >> > function train!(m::Model, s::Adagrad; xmax=100, alpha=0.75) >> > J = 0.0 >> > shuffle!(m.covec) >> > vecsize = size(m.W_main, 1) >> > eltype = typeof(m.b_main[1]) >> > vm = zeros(eltype, vecsize) >> > vc = zeros(eltype, vecsize) >> > grad_main = zeros(eltype, vecsize) >> > grad_ctx = zeros(eltype, vecsize) >> > for n=1:s.niter >> > # shuffle indices >> > for i = 1:length(m.covec) >> > @inbounds l1 = m.covec[i].i # main index >> > @inbounds l2 = m.covec[i].j # context index >> > @inbounds v = m.covec[i].v >> > #= vm[:] = m.W_main[:, l1] =# >> > #= vc[:] = m.W_ctx[:, l2] =# >> > @inbounds for j = 1:vecsize >> > vm[j] = m.W_main[j, l1] >> > vc[j] = m.W_ctx[j, l2] >> > end >> > diff = dot(vec(vm), vec(vc)) + m.b_main[l1] + m.b_ctx[l2] - >> log(v) >> > fdiff = ifelse(v < xmax, (v / xmax) ^ alpha, 1.0) * diff >> > J += 0.5 * fdiff * diff >> > fdiff *= s.lrate >> > # inc memory by ~200 MB && running time by 2x >> > #= grad_main[:] = fdiff * m.W_ctx[:, l2] =# >> > #= grad_ctx[:] = fdiff * m.W_main[:, l1] =# >> > @inbounds for j = 1:vecsize >> > grad_main[j] = fdiff * m.W_ctx[j, l2] >> > grad_ctx[j] = fdiff * m.W_main[j, l1] >> > end >> > # Adaptive learning >> > # inc ~ 600MB + 0.75s >> > #= m.W_main[:, l1] -= grad_main ./ sqrt(m.W_main_grad[:, >> l1]) =# >> > #= m.W_ctx[:, l2] -= grad_ctx ./ sqrt(m.W_ctx_grad[:, l2]) >> =# >> > #= m.b_main[l1] -= fdiff ./ sqrt(m.b_main_grad[l1]) =# >> > #= m.b_ctx[l2] -= fdiff ./ sqrt(m.b_ctx_grad[l2]) =# >> > @inbounds for j = 1:vecsize >> > m.W_main[j, l1] -= grad_main[j] / sqrt(m.W_main_grad[j, >> l1]) >> > m.W_ctx[j, l2] -= grad_ctx[j] / sqrt(m.W_ctx_grad[j, >> l2]) >> > end >> > m.b_main[l1] -= fdiff ./ sqrt(m.b_main_grad[l1]) >> > m.b_ctx[l2] -= fdiff ./ sqrt(m.b_ctx_grad[l2]) >> > # Gradients >> > fdiff *= fdiff >> > #= m.W_main_grad[:, l1] += grad_main .^ 2 =# >> > #= m.W_ctx_grad[:, l2] += grad_ctx .^ 2 =# >> > #= m.b_main_grad[l1] += fdiff =# >> > #= m.b_ctx_grad[l2] += fdiff =# >> > @inbounds for j = 1:vecsize >> > m.W_main_grad[j, l1] += grad_main[j] ^ 2 >> > m.W_ctx_grad[j, l2] += grad_ctx[j] ^ 2 >> > end >> > m.b_main_grad[l1] += fdiff >> > m.b_ctx_grad[l2] += fdiff >> > end >> > #= if n % 10 == 0 =# >> > #= println("iteration $n, cost $J") =# >> > #= end =# >> > end >> > end >> > >> > >> > >> > Mixmax Not using Mixmax yet? >> > >> > And the respective timings >> > >> > @time GloVe.train!(model, GloVe.Adagrad(500)) >> > 7.097 seconds (96237 k allocations: 1468 MB, 7.01% gc time) >> > >> > Slower and more memory. >> > >> > On Sun, May 24, 2015 at 4:21 AM, Mauro <maur...@runbox.com >> <javascript:>> wrote: >> > >> > Loops should run without allocations. Can you post your loop-code? >> > >> > > A[i, :] = 0.5 * B[i, :] >> > >> > To state the obvious, as loop: >> > >> > for j=1:size(A,2) >> > A[i,j] = 0.5 * B[i,j] >> > end >> > >> > this shouldn't allocate, if i is an integer. Unless A and B have >> > different type, then allocation might happen. >> > >> > On Sun, 2015-05-24 at 05:00, Dom Luna <dlun...@gmail.com >> <javascript:>> wrote: >> > > Reposting this from Gitter chat since it seems this is more >> active. >> > > >> > > I'm writing a GloVe module to learn Julia. >> > > >> > > How can I avoid memory allocations? My main function deals with a >> lot of >> > > random indexing in Matrices. >> > > >> > > A[i, :] = 0.5 * B[i, :] >> > > >> > > In this case* i* isn't from a linear sequence. I'm not sure that >> matters. >> > > Anyway, I’ve done analysis and I know B[i, :] is the issue here >> since >> > it’s >> > > creating a copy. >> > > >> > > https://github.com/JuliaLang/julia/blob/master/base/array.jl#L309 >> makes >> > the >> > > copy >> > > >> > > >> > > I tried to do it via loop but it looks like that doesn’t help >> either. In >> > > fact, it seems to allocate slight more memory which seems really >> odd. >> > > >> > > Here’s some of the code, it’s a little messy since I’m commenting >> > different >> > > approaches I’m trying out. >> > > >> > > type Model{T} >> > > W_main::Matrix{T} >> > > W_ctx::Matrix{T} >> > > b_main::Vector{T} >> > > b_ctx::Vector{T} >> > > W_main_grad::Matrix{T} >> > > W_ctx_grad::Matrix{T} >> > > b_main_grad::Vector{T} >> > > b_ctx_grad::Vector{T} >> > > covec::Vector{Cooccurence} >> > > end >> > > >> > > # Each vocab word in associated with a main vector and a context >> vector. >> > > # The paper initializes the to values [-0.5, 0.5] / vecsize+1 and >> > > # the gradients to 1.0. >> > > # >> > > # The +1 term is for the bias. >> > > function Model(comatrix; vecsize=100) >> > > vs = size(comatrix, 1) >> > > Model( >> > > (rand(vecsize, vs) - 0.5) / (vecsize + 1), >> > > (rand(vecsize, vs) - 0.5) / (vecsize + 1), >> > > (rand(vs) - 0.5) / (vecsize + 1), >> > > (rand(vs) - 0.5) / (vecsize + 1), >> > > ones(vecsize, vs), >> > > ones(vecsize, vs), >> > > ones(vs), >> > > ones(vs), >> > > CoVector(comatrix), # not required in 0.4 >> > > ) >> > > end >> > > >> > > # TODO: figure out memory issue >> > > # the memory comments are from 500 loop test with vecsize=100 >> > > function train!(m::Model, s::Adagrad; xmax=100, alpha=0.75) >> > > J = 0.0 >> > > shuffle!(m.covec) >> > > >> > > vecsize = size(m.W_main, 1) >> > > eltype = typeof(m.b_main[1]) >> > > vm = zeros(eltype, vecsize) >> > > vc = zeros(eltype, vecsize) >> > > grad_main = zeros(eltype, vecsize) >> > > grad_ctx = zeros(eltype, vecsize) >> > > >> > > for n=1:s.niter >> > > # shuffle indices >> > > for i = 1:length(m.covec) >> > > @inbounds l1 = m.covec[i].i # main index >> > > @inbounds l2 = m.covec[i].j # context index >> > > @inbounds v = m.covec[i].v >> > > >> > > vm[:] = m.W_main[:, l1] >> > > vc[:] = m.W_ctx[:, l2] >> > > >> > > diff = dot(vec(vm), vec(vc)) + m.b_main[l1] + >> m.b_ctx[l2] - >> > > log(v) >> > > fdiff = ifelse(v < xmax, (v / xmax) ^ alpha, 1.0) * >> diff >> > > J += 0.5 * fdiff * diff >> > > >> > > fdiff *= s.lrate >> > > # inc memory by ~200 MB && running time by 2x >> > > grad_main[:] = fdiff * m.W_ctx[:, l2] >> > > grad_ctx[:] = fdiff * m.W_main[:, l1] >> > > >> > > # Adaptive learning >> > > # inc ~ 600MB + 0.75s >> > > #= @inbounds for ii = 1:vecsize =# >> > > #= m.W_main[ii, l1] -= grad_main[ii] / >> > > sqrt(m.W_main_grad[ii, l1]) =# >> > > #= m.W_ctx[ii, l2] -= grad_ctx[ii] / >> sqrt(m.W_ctx_grad >> > [ii, >> > > l2]) =# >> > > #= m.b_main[l1] -= fdiff ./ >> sqrt(m.b_main_grad[l1]) =# >> > > #= m.b_ctx[l2] -= fdiff ./ sqrt(m.b_ctx_grad[l2]) >> =# >> > > #= end =# >> > > >> > > m.W_main[:, l1] -= grad_main ./ sqrt(m.W_main_grad[:, >> l1]) >> > > m.W_ctx[:, l2] -= grad_ctx ./ sqrt(m.W_ctx_grad[:, >> l2]) >> > > m.b_main[l1] -= fdiff ./ sqrt(m.b_main_grad[l1]) >> > > m.b_ctx[l2] -= fdiff ./ sqrt(m.b_ctx_grad[l2]) >> > > >> > > # Gradients >> > > fdiff *= fdiff >> > > m.W_main_grad[:, l1] += grad_main .^ 2 >> > > m.W_ctx_grad[:, l2] += grad_ctx .^ 2 >> > > m.b_main_grad[l1] += fdiff >> > > m.b_ctx_grad[l2] += fdiff >> > > end >> > > >> > > #= if n % 10 == 0 =# >> > > #= println(“iteration $n, cost $J”) =# >> > > #= end =# >> > > end >> > > end >> > > >> > > >> > > Here’s the entire repo https://github.com/domluna/GloVe.jl. >> Might be >> > > helpful. >> > > >> > > I tried doing some loops but it allocates more memory (oddly >> enough) and >> > > gets slower. >> > > >> > > You’ll notice the word vectors are indexed by column, I changed >> the >> > > representation to that >> > > seeing if it would make a difference during the loop. It didn’t >> seem to. >> > > >> > > The memory analysis showed >> > > >> > > Julia Version 0.4.0-dev+4893 >> > > Commit eb5da26* (2015-05-19 11:51 UTC) >> > > Platform Info: >> > > System: Darwin (x86_64-apple-darwin14.4.0) >> > > CPU: Intel(R) Core(TM) i5-2557M CPU @ 1.70GHz >> > > WORD_SIZE: 64 >> > > BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY >> Sandybridge) >> > > LAPACK: libopenblas >> > > LIBM: libopenlibm >> > > LLVM: libLLVM-3.3 >> > > >> > > Here model consists of 100x19 Matrices and 100 element vectors, >> 19 words >> > in >> > > the vocab, 100 element word vector. >> > > >> > > @time GloVe.train!(model, GloVe.Adagrad(500)) >> > > 1.990 seconds (6383 k allocations: 1162 MB, 10.82% gc >> time) >> > > >> > > 0.3 has is a bit slower due to worse gc but same memory. >> > > >> > > Any help would be greatly appreciated! >> > >> > >> > >> > >> > cheers, >> > >> > dom >> > >> > >> > Sent with Mixmax >> > >> > * >> >> > > cheers, > > dom > > > Sent with Mixmax <https://mixmax.com> >