Mark a) I have to disagree with you that the efficiency of MPI implementation does not depend on the size of the data set for a single desktop SMP machine with multiple processors - larger data sets mean higher granularity and more cpu-bound work between stoppages for communication. This assumes the NLME MPI implementation is done efficiently - I don't know the details of the NONMEM MPI implementation, particularly those of how communications are handled.
b) your I/O timings seem horrendously large (if by msec you mean milliseconds) I/O times of 40 milliseconds per function evaluation (assuming 1 function evaluation is a single sweep through all Nsub subjects, evaluating and summing the likelihood contribution from each subject) seem very high. I have been running MPI since its original release in 1994 (I was a member of the committee that designed the first release of MPI during 1992-1994 ) - these communications timings would seem more appropriate for machines from that era. I/O timings for MPI are usually modeled by a latency (startup time - typically on current SMP single desktop machines on the order of 1 microsecond) , and a bandwidth (on the order of 10's of gigabytes/sec for current era SMPs, but much lower for clusters). Based on the latency/bandwidth model, the conventional wisdom is to manage the message processing so as to favor a few large messages as opposed to many small messages to minimize the latency contribution. If possible, small messages should be concatenated into larger messages. I don't know the details of the MPI implementation in NONMEM, but for FOCE-like NLME algorithms, it is possible to limit the number of messages to just a few per function evaluation. If the data set size is expanded by adding more subjects, then more work (more subjects processed) will be done between stoppages for communication at the function evaluation boundaries. In the MPI implementation for Phoenix NLME, I find it almost impossible to find a model where I/O dominates to such an extent that the MPI version runs slower than the single processor version on a 4-processor Intel i7 desktop. For example, I just tested (FOCE) the classic simple closed form Emax model used in the INSERM estimation method comparison exercise from 2004 (Girard and Mentre', PAGE 2005, abstract 234) with Phoenix NLME. It would be hard to find a simpler model - E=E0 + EMAX*DOSE/(ED50 + DOSE) +EPS, with random effects on each of the three parameters E0, EMAX, and ED50, and three observations per subject. If I expand the data set to around 1600 from the original 100 subjects and run on a four processor i7, the internally reported cpu time is 72 sec for four processors vs 18 sec for one processor (a speedup of 4). Wall clock times were a few seconds longer for each run. If I make the data set smaller, down to the original size of 100, the speedup clearly suffers a decrease but I still observe a reported cpu time speedup of 2.5x for the four processors (times are well under 1 sec, so reliable wall clock times are not available). (this was done on a relatively old i7 desktop, so more current machines may do better). c) It is not always necessary to parallelize over function evaluations (i.e. over subjects). In importance sampling EM methods, (IMP in NONMEM, QRPEM in Phoenix NLME), in principle the parallelization can be done over the sample points used in the monte carlo or quasi-monte carlo integral evaluations - there are usually many more of these than processors available. In PHX QRPEM, we actually do it this way and it works fine. Now all processors are working on the same subject at the same time, so load balancing problems tend to go away, but communications overhead increases since now you have to pass separate messages for each subject, whereas in FOCE-like algorithms you only have to pass messages at the end of a sweep through all the subjects. One thing we have noticed is that QRPEM parallelized this way is much more reproducible - single processor results almost always match multiprocessor results exactly, which is not always the case with some of the FOCE-like methods. Bob Leary Fellow, Pharsight Corporation ________________________________ From: owner-nmus...@globomaxnm.com [owner-nmus...@globomaxnm.com] on behalf of Mark Sale [ms...@nuventra.com] Sent: Wednesday, December 09, 2015 7:42 AM To: Faelens, Ruben (Belgium); Pavel Belo; nmusers@globomaxnm.com Subject: Re: [NMusers] setup of parallel processing and supporting software - help wanted Maybe a little more clarification: Thanks to Bob for pointing out that the PARSE_TYPE=2 or 4 option implements some code for load balancing, and there really is no downside, so should probably always be used. Contrary to other comments, NONMEM 7.3 (and 7.2) does parallelize the covariance step. Ruben is correct that the $TABLE step is not parallelize in 7.3. WRT sometimes it works and sometimes it doesn't, we can be more specific than this. The parallelization takes place at the level of the calculation of the objective function. The data are split up and the OBJ for the subsets of the data is sent to multiple processes. When all processes are done, the results are compiled by the manager program. The total round trip time for one process then is the calculation time + I/O time. Without parallelization, there is no I/O time. For each parallel process, the I/O time is essentially fixed (in our benchmarks maybe 20-40 msec per process on a single machine). The variable of interest then is the calculation time. If the calculation time is 1 msec and the I/O time is 20 msec, if you parallelize to 2 cores, you cut the calculation time to 0.5 msec, now have 40 msec (2*20 msec) of I/O time, for a total of 40.5 msec, much slower. If the calculation time is 500 msec, and you parallelize to 2 cores, the total time is 250 msec (for calculation) + 2*20 msec (for I/O) = 290 msec. If The key parameter then is the time for a single objective function evaluation (not the total run time). If the time for a single function evaluation is > 500 msec, parallelization will be helpful (on a single machine). There really isn't anything very mystical about when it helps and when it doesn't. The efficiency depends very little on the size of the data set, except that the limit of parallelization is the number of subjects (the data set must be split up by subject). Mark Sale M.D. Vice President, Modeling and Simulation Nuventra, Inc. ™ 2525 Meridian Parkway, Suite 280 Research Triangle Park, NC 27713 Office (919)-973-0383 ms...@nuventra.com<UrlBlockedError.aspx> www.nuventra.com<http://www.nuventra.com> NOTICE: The information contained in this electronic mail message is intended only for the personal and confidential use of the designated recipient(s) named above. This message may be an attorney-client communication, may be protected by the work product doctrine, and may be subject to a protective order. As such, this message is privileged and confidential. If the reader of this message is not the intended recipient or an agent responsible for delivering it to the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify us immediately by telephone and e-mail and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies). Thank you. buSp9xeMeKEbrUze