Would using pmap() be suitable for your application? Usually when I use 
that I get slightly more than a 2X speed up with 4 cores. 

On Monday, June 20, 2016 at 2:27:41 AM UTC-4, Boylan, Ross wrote:
> I think I've taken steps to minimize parallel overhead by providing only 
> one function call per process and passing really minimal arguments to the 
> functions.  But the gains in speed don't seem commensurate with the number 
> of processors.  I know that pure linear speedup is too much to hope for, 
> but I suspect I'm doing something wrong--for example that large data is 
> getting passed around despite my efforts.
> All my code is defined inside a module, though I exercise it from the main 
> Single processor (times are representative of multiple samples and exclude 
> burn-in runs):
> julia> @time h=Calculus.hessian(RB.mylike, RB.troubleParamsSim1.raw)
>  206.422562 seconds (2.43 G allocations: 83.867 GB, 10.14% gc time)
> #with 3 workers
> julia> myf = RB.makeParallelLL()  # code below
> julia> @time h10 = Calculus.hessian(myf, RB.troubleParamsSim1.raw)
>  182.567647 seconds (1.48 M allocations: 111.622 MB, 0.02% gc time)
> #with 7 workers
> julia> @time h10 = Calculus.hessian(myf, RB.troubleParamsSim1.raw)
>   82.453033 seconds (3.43 M allocations: 259.838 MB, 0.08% gc time)
> Any suggestions?  This on an 8 CPU VM; the underlying hardware has >>8 
> processors.
> Here's some of the code:
> mylike(v) = -logitlite(RB.obs, v, 7)
> """                                                                       
> return a matrix whose columns are the start and the end of range of 
> observations, inclusive.                                                   
> these chunks are roughly equal sizes, and no id is  split between chunks. 
> n  target number of chunks, and thus of rows returned                     
> id id for each row; generally one id may appear in several rows.           
>    all rows with the same id must be contiguous.                           
> """
> function chunks(id, n::Int)
>     nTot = length(id)
>     s = nTot/n
>     cuts = Vector{Int}(n)
>     cuts[1] = 1
>     for i = 1:(n-1)
>         cut::Int = floor(s*i)
>         while id[cut] == id[cut-1]
>             cut += 1
>         end
>         cuts[i+1] = cut
>     end
>     stops = vcat(cuts[2:n]-1, nTot)
>     hcat(cuts, stops)
> end
> """                                                                       
> Parallel helper: evaluate likelihood over limited range                   
> i0..i1, inclusive, is the range.                                           
> This is an effort to avoid passing obs between processes.                 
> """
> function cutLL(i0::Int, i1::Int, x)
>     logitlite(obs[i0:i1,:], x, 7)  #obs is a 6413x8 data.frame.  Maybe 
> this is doing an extra copy?
> end
> """                                                                       
> Return a function that takes one argument, the optimization parameters.   
> It evaluates the -LL in parallel                                           
> """
> function makeParallelLL()
>     np = nworkers()
>     cuts = chunks(obs[:id], np)
>     function myLL(x)
>        ll= @parallel (+) for i=1:np
>             cutLL(cuts[i,1], cuts[i,2], x)
>         end
>         -ll
>     end
> end
> It might be relevant that logitlite uses a by() to process groups of 
> observations.
> So I think the only thing I'm passing into the  workers is a function 
> call, cutLL, 2 integers, and a 10-element Vector{Float64}, and the only 
> thing going back is 1 Float64.
> Or perhaps fact that I'm using an inner function def (myLL in the last 
> function definition) is doing something like causing transmission of all 
> variables in lexical scope?

