Would using pmap() be suitable for your application? Usually when I use that I get slightly more than a 2X speed up with 4 cores.
On Monday, June 20, 2016 at 2:27:41 AM UTC-4, Boylan, Ross wrote: > > I think I've taken steps to minimize parallel overhead by providing only > one function call per process and passing really minimal arguments to the > functions. But the gains in speed don't seem commensurate with the number > of processors. I know that pure linear speedup is too much to hope for, > but I suspect I'm doing something wrong--for example that large data is > getting passed around despite my efforts. > > All my code is defined inside a module, though I exercise it from the main > REPL. > > Single processor (times are representative of multiple samples and exclude > burn-in runs): > julia> @time h=Calculus.hessian(RB.mylike, RB.troubleParamsSim1.raw) > 206.422562 seconds (2.43 G allocations: 83.867 GB, 10.14% gc time) > > > #with 3 workers > julia> myf = RB.makeParallelLL() # code below > julia> @time h10 = Calculus.hessian(myf, RB.troubleParamsSim1.raw) > 182.567647 seconds (1.48 M allocations: 111.622 MB, 0.02% gc time) > > #with 7 workers > julia> @time h10 = Calculus.hessian(myf, RB.troubleParamsSim1.raw) > 82.453033 seconds (3.43 M allocations: 259.838 MB, 0.08% gc time) > > Any suggestions? This on an 8 CPU VM; the underlying hardware has >>8 > processors. > > Here's some of the code: > > mylike(v) = -logitlite(RB.obs, v, 7) > > """ > > > return a matrix whose columns are the start and the end of range of > observations, inclusive. > > these chunks are roughly equal sizes, and no id is split between chunks. > > > n target number of chunks, and thus of rows returned > > > id id for each row; generally one id may appear in several rows. > > > all rows with the same id must be contiguous. > > > """ > function chunks(id, n::Int) > nTot = length(id) > s = nTot/n > cuts = Vector{Int}(n) > cuts[1] = 1 > for i = 1:(n-1) > cut::Int = floor(s*i) > while id[cut] == id[cut-1] > cut += 1 > end > cuts[i+1] = cut > end > stops = vcat(cuts[2:n]-1, nTot) > hcat(cuts, stops) > end > > """ > > > Parallel helper: evaluate likelihood over limited range > > > i0..i1, inclusive, is the range. > > > This is an effort to avoid passing obs between processes. > > > """ > function cutLL(i0::Int, i1::Int, x) > logitlite(obs[i0:i1,:], x, 7) #obs is a 6413x8 data.frame. Maybe > this is doing an extra copy? > end > > """ > > > Return a function that takes one argument, the optimization parameters. > > > It evaluates the -LL in parallel > > > """ > function makeParallelLL() > np = nworkers() > cuts = chunks(obs[:id], np) > function myLL(x) > ll= @parallel (+) for i=1:np > cutLL(cuts[i,1], cuts[i,2], x) > end > -ll > end > end > > > It might be relevant that logitlite uses a by() to process groups of > observations. > > So I think the only thing I'm passing into the workers is a function > call, cutLL, 2 integers, and a 10-element Vector{Float64}, and the only > thing going back is 1 Float64. > > Or perhaps fact that I'm using an inner function def (myLL in the last > function definition) is doing something like causing transmission of all > variables in lexical scope? > >