I can do it using pmap, but it doesn't run any faster (still 3 workers): julia> @time h10 = Calculus.hessian(myf, RB.troubleParamsSim1.raw) 186.704029 seconds (410.30 k allocations: 29.503 MB, 0.01% gc time)
That was using this code: """ Return a function that takes one argument, the optimization parameters. It evaluates the -LL in parallel """ function makeParallelLL() np = nworkers() cuts = chunks(obs[:id], np) function myLL(x) args = [(cuts[i, 1], cuts[i,2], x) for i = 1:np] ll= pmap(x->cutLL(x...), args) -sum(ll) end end ________________________________ From: julia-users@googlegroups.com [julia-users@googlegroups.com] on behalf of Christopher Fisher [fishe...@miamioh.edu] Sent: Monday, June 20, 2016 4:45 AM To: julia-users Subject: [julia-users] Re: performance of parallel code Would using pmap() be suitable for your application? Usually when I use that I get slightly more than a 2X speed up with 4 cores. On Monday, June 20, 2016 at 2:27:41 AM UTC-4, Boylan, Ross wrote: I think I've taken steps to minimize parallel overhead by providing only one function call per process and passing really minimal arguments to the functions. But the gains in speed don't seem commensurate with the number of processors. I know that pure linear speedup is too much to hope for, but I suspect I'm doing something wrong--for example that large data is getting passed around despite my efforts. All my code is defined inside a module, though I exercise it from the main REPL. Single processor (times are representative of multiple samples and exclude burn-in runs): julia> @time h=Calculus.hessian(RB.mylike, RB.troubleParamsSim1.raw) 206.422562 seconds (2.43 G allocations: 83.867 GB, 10.14% gc time) #with 3 workers julia> myf = RB.makeParallelLL() # code below julia> @time h10 = Calculus.hessian(myf, RB.troubleParamsSim1.raw) 182.567647 seconds (1.48 M allocations: 111.622 MB, 0.02% gc time) #with 7 workers julia> @time h10 = Calculus.hessian(myf, RB.troubleParamsSim1.raw) 82.453033 seconds (3.43 M allocations: 259.838 MB, 0.08% gc time) Any suggestions? This on an 8 CPU VM; the underlying hardware has >>8 processors. Here's some of the code: mylike(v) = -logitlite(RB.obs, v, 7) """ return a matrix whose columns are the start and the end of range of observations, inclusive. these chunks are roughly equal sizes, and no id is split between chunks. n target number of chunks, and thus of rows returned id id for each row; generally one id may appear in several rows. all rows with the same id must be contiguous. """ function chunks(id, n::Int) nTot = length(id) s = nTot/n cuts = Vector{Int}(n) cuts[1] = 1 for i = 1:(n-1) cut::Int = floor(s*i) while id[cut] == id[cut-1] cut += 1 end cuts[i+1] = cut end stops = vcat(cuts[2:n]-1, nTot) hcat(cuts, stops) end """ Parallel helper: evaluate likelihood over limited range i0..i1, inclusive, is the range. This is an effort to avoid passing obs between processes. """ function cutLL(i0::Int, i1::Int, x) logitlite(obs[i0:i1,:], x, 7) #obs is a 6413x8 data.frame. Maybe this is doing an extra copy? end """ Return a function that takes one argument, the optimization parameters. It evaluates the -LL in parallel """ function makeParallelLL() np = nworkers() cuts = chunks(obs[:id], np) function myLL(x) ll= @parallel (+) for i=1:np cutLL(cuts[i,1], cuts[i,2], x) end -ll end end It might be relevant that logitlite uses a by() to process groups of observations. So I think the only thing I'm passing into the workers is a function call, cutLL, 2 integers, and a 10-element Vector{Float64}, and the only thing going back is 1 Float64. Or perhaps fact that I'm using an inner function def (myLL in the last function definition) is doing something like causing transmission of all variables in lexical scope?