Mark -
I did not realize you were talking about FPI, rather than MPI.  Your multi 
millisecond latencies are right for disk i/o , but  I was referring to direct 
memory to memory message passing, which is orders of magnitude  faster than 
going thru a disk.  Why would anyone use FPI if MPI on an SMP were available?
(at least for parallelizing a single job - not talking about 'embarassingly 
parallel' tasks such as
bootstrap where MPI, as several in this thread have correctly remarked, should 
normally  not be used since it just  introduces extra overhead.)

From: Mark Sale
Sent: Wednesday, December 09, 2015 1:56 PM
To: Bob Leary; Faelens, Ruben (Belgium); Pavel Belo;
Subject: Re: [NMusers] setup of parallel processing and supporting software - help wanted 
help wanted


For what it is worth, the 20 msec was when we first started developing this, 
which has been about 10 years now.  Disc seek time is currently about 8 msec 
and latency

about 4 msec, so we thought 20 was reasonable.

That is not the I/O time for a "single sweep through the data" this is not data 
(i.e., FDATA) reading, each process gets it's own copy of the data, MPI does 
not send data (i.e. FDATA) back and forth. The I/O time is for:

1. Manager to write the required information (THETA and a whole lot of other 

2. worker to read that information

3. worker to write the information back

4. Manager to read the required information.

In between steps 2 and 3 is the "calculation" part.

This is disc read/write, so MPI should be much better at it than FPI, since it 
doesn't have to write all of this to disc (and I assume very rarely does, I 
believe that MPI is very good at doing all of this in memory).

You're right, there are other points at which non linear regression can be 
parallelized, although NONMEM only does it at the function evaluation level.

WRT any model running faster parallel than single processor, at least with 
NONMEM that is not my experience, again threshold for meaningful gain is a 
function evaluation time of 500 msec, in my experience, but haven't benchmarked 
it recently, may be less now. I suspect you still won't get a 2 minute 4000 
function evaluation run down to 30 seconds on 4 cores, but would look forward 
to learning about other peoples experience.


Empower your Pipeline

From: Bob Leary
Sent: Wednesday, December 9, 2015 2:04 PM
To: Mark Sale; Faelens, Ruben (Belgium); Pavel Belo;
Subject: RE: [NMusers] setup of parallel processing and supporting software - help wanted 
help wanted


a) I have to disagree with you that the efficiency of MPI implementation does 
not depend on the size of the
data set for a single desktop SMP machine with multiple processors - larger 
data sets mean higher granularity and more cpu-bound work between stoppages for 
This assumes the NLME MPI implementation is done efficiently  - I don't know 
the details of the NONMEM MPI implementation, particularly those of how 
communications are handled.

b) your I/O timings seem horrendously large (if by msec you mean milliseconds)

 I/O times of 40 milliseconds per function evaluation (assuming 1 function 
evaluation is a single sweep
 through  all Nsub subjects, evaluating and summing the likelihood contribution 
from each subject) seem very high.   I have been
running MPI since its original release in 1994 (I was a member of the committee 
that designed the first release of MPI during 1992-1994 ) -
these communications timings would seem more appropriate for machines from that 

I/O timings for MPI are usually modeled by a latency (startup time - typically 
on current SMP single desktop machines on the
order of 1 microsecond) , and a bandwidth (on the order of 10's of 
gigabytes/sec for current era SMPs, but much lower for clusters).
Based on the latency/bandwidth model, the conventional wisdom is to manage the 
message processing so as to
favor a few large messages as opposed to many small messages to minimize the 
latency contribution.
If possible, small messages should be concatenated into  larger messages.   I 
don't know the details of the MPI implementation in NONMEM, but for FOCE-like
NLME algorithms, it is possible to limit the number of messages to just a few 
per function evaluation.
 If the data set size is expanded by adding more subjects, then more work (more 
subjects processed)
will be done between stoppages for communication at the function evaluation 

In the MPI implementation for Phoenix NLME, I find it almost impossible to  
find a  model where I/O dominates to such an
extent that the MPI version runs slower than the single processor version on a 
4-processor Intel i7 desktop.  For example,
I just tested (FOCE) the classic simple closed form Emax model used in the 
INSERM estimation method comparison exercise from 2004
 (Girard and Mentre', PAGE 2005, abstract 234)  with Phoenix NLME.   It would 
be hard to find a simpler model -
E=E0 + EMAX*DOSE/(ED50 + DOSE) +EPS, with random effects on each of the  three 
parameters E0, EMAX, and ED50,
and three observations per subject.  If I expand the data set to around 1600 
from the original 100 subjects
and run on a four processor i7, the internally reported cpu time is 72 sec for 
four processors vs  18 sec for one processor (a speedup of 4).
  Wall clock times were a few seconds longer for each run.  If I make the data 
set smaller, down to the original size of 100,
 the speedup clearly suffers a decrease but I still observe a reported cpu time 
speedup of 2.5x for the four processors (times are
well under 1 sec, so reliable wall clock times are not available).
(this was done on a relatively old i7 desktop, so more current machines may do  

c) It is not always necessary to parallelize over function evaluations (i.e. 
over subjects).  In importance sampling EM methods, (IMP in NONMEM,
QRPEM in Phoenix NLME), in principle the parallelization can be done over the 
sample points used in the monte carlo or quasi-monte carlo integral evaluations 
there are usually many more of these than processors available.  In PHX QRPEM,
we actually do it this way and it works fine.  Now all processors are working 
on the same subject at the same time, so
load balancing problems tend to go away, but communications overhead increases 
since now you have to pass separate messages for
each  subject, whereas in FOCE-like algorithms you only have to pass messages 
at the end of a sweep through all the subjects.
One thing we have noticed is that QRPEM parallelized this way is much more 
reproducible - single processor results almost always
match multiprocessor results exactly, which is not always the case with some of 
the FOCE-like methods.

Bob Leary
Fellow, Pharsight Corporation

From: Mark Sale 
Mark Sale []
Sent: Wednesday, December 09, 2015 7:42 AM
To: Faelens, Ruben (Belgium); Pavel Belo;
Subject: Re: [NMusers] setup of parallel processing and supporting software - help wanted 
help wanted

Maybe a little more clarification:

Thanks to Bob for pointing out that the


option implements some code for load balancing, and there really is no 
downside, so should probably always be used.

Contrary to other comments, NONMEM 7.3 (and 7.2) does parallelize the 
covariance step.  Ruben is correct that the $TABLE step is not parallelize in 

WRT sometimes it works and sometimes it doesn't, we can be more specific than 
this. The parallelization takes place at the level of the calculation of the 
objective function.  The data are split up and the OBJ for the subsets of the 
data is sent to multiple processes.  When all processes are done, the results 
are compiled by the manager program.   The total round trip time for one 
process then is the calculation time + I/O time.  Without parallelization, 
there is no I/O time.  For each parallel process, the I/O time is essentially 
fixed (in our benchmarks maybe 20-40 msec per process on a single machine). The 
variable of interest then is the calculation time.   If the calculation time is 
1 msec and the I/O time is 20 msec, if you parallelize to 2 cores, you cut the 
calculation time to 0.5 msec, now have 40 msec (2*20 msec) of I/O time, for a 
total of 40.5 msec, much slower.  If the calculation time is 500 msec, and you 
parallelize to 2 cores, the total time is 250 msec (for calculation) + 2*20 
msec (for I/O) = 290 msec.  If The key parameter then is the time for a single 
objective function evaluation (not the total run time).  If the time for a 
single function evaluation is > 500 msec, parallelization will be helpful (on a 
single machine).  There really isn't anything very mystical about when it helps 
and when it doesn't. The efficiency depends very little on the size of the data 
set, except that the limit of parallelization is the number of subjects (the 
data set must be split up by subject).

