Hi Mark,
Your analyses are quite reasonable. The low-temperature replicas are indeed doing much more work than the high-temperature replicas. As you said, the lowest temperature replica in the 24-replica should take an amount of time comparable to that of the lowest in the 42-replica. So for my case, the load imbalance across replicas is only partly to blame. Now I can exclude factors from the REMD parameters themselves. I will ask the system admin for possible explanations. May I ask when you do REMD with NVT ensemble, is it right that all your replicas are running with the same Volume as the lowest-temperature replica? Or do you equilibrate each replica with NPT ensemble, then NVT ensemble, and then feed the equilibrated structures to NVT REMD simulations? Thank you for all your helpful suggestions! Qiong Hi Mark, Many thanks for your fast response! What's the network hardware? Can other machine load influence your network performance? The supercomputer system is based on the Cray Gemini interconnect technology. I suppose this is a fast network hardware... Are the systems in the NVT ensemble? Use diff to check the .mdp files differ only how you think they do. The systems are in NPT ensemble. I saw some discussions on the mailing list that NPT ensemble is superior to NVT ensemble for REMD. And the .mdp files differ only in the temperature. Maybe so, but under NPT the density varies with T, and so with replica. This means the size of neighbour lists varies, and the cost of the computation (PME or not) varies. The generalized ensemble is limited by the progress of the slowest replica. If using PME, in theory, you can juggle the contribution of the various terms to balance the computation load across the replicas, but this is not easy to do. What are the values of nstlist and nstcalcenergy? Previously, nstlist=5, nstcalcenergy=1 Thank you for pointing this out. I checked the manual again that this option affects the performance in parallel simulations because calculating energies requires global communication between all processes. So I have set this option to -1 this time. This should be one reason for the low parallel efficiency. And after I changed nstcalcenergy=-1, I found there was a 3% improvement on the efficiency compared with those when nstcalcenergy=1. Yep. nstpcouple and nsttcouple also influence this. Take a look at the execution time breakdown at the end of the .log files, and do so for more than one replica. With the current implementation, every simulation has to synchronize and communicate every handful of steps, which means that large scale parallelism won't work efficiently unless you have fast network hardware that is dedicated to your job. This effect shows up in the "Rest" row of the time breakdown. With Infiniband, I'd expect you should only be losing about 10% of the run time total. The 30-fold loss you have upon going from 24->42 replicas keeping 4 CPUs/replica suggests some other contribution, however. I checked the time breakdown in the log files for short REMD simulations. For the REMD simulaiton with 168 cores for 42 replicas, as you see below, the “Rest” makes up as surprisingly high as 96.6% of the time for one of the replicas. This parameter is almost the same level for the other replicas. For the REMD simulation with 96 cores for 24 replicas, the “Rest” takes up about 24%. I was also aware of your post: http://www.mail-archive.com/gmx-users@gromacs.org/msg37507.html As you suggested such big loss should be ascribed to other factors. Do you think it is the network hardware to blame or there are other reasons please? Any suggestion would be greatly appreciated I expect the load imbalance across replicas is partly to blame. Look at the sum of Force + PME mesh (in seconds) across the generalized ensemble. That's where the simulation work is all done, and I expect your low-temperature replicas are doing much more work than your high-temperature replicas. Unfortunately 4.5.3 doesn't allow the user to know enough detail here. Future versions of GROMACS will - work in progress. Strictly, though, your rate-limiting lowest temperature replica in the 24-replica regime should take an amount of time comparable to that of the lowest in the 42-replica regime (22K difference is not that significant) - and similar to a run other than as part of a replica-exchange simulation. Your reported data is not consistent with that, so I think your jobs are also experiencing differing degrees of network or filesystem contention at different times. Your sysadmins can comment on that. Mark Computing: Nodes Number G-Cycles Seconds % ----------------------------------------------------------------------- Domain decomp. 4 442 2.604 1.2 0.0 DD comm. load 4 6 0.001 0.0 0.0 Comm. coord. 4 2201 1.145 0.5 0.0 Neighbor search 4 442 14.964 7.1 0.2 Force 4 2201 175.303 83.5 2.0 Wait + Comm. F 4 2201 1.245 0.6 0.0 PME mesh 4 2201 30.314 14.4 0.3 Write traj. 4 11 17.346 8.3 0.2 Update 4 2201 2.004 1.0 0.0 Constraints 4 2201 26.593 12.7 0.3 Comm. energies 4 442 28.722 13.7 0.3 Rest 4 8426.029 4012.4 96.6 ----------------------------------------------------------------------- Total 4 8726.270 4155.4 100.0 Qiong
-- gmx-users mailing list gmx-users@gromacs.org http://lists.gromacs.org/mailman/listinfo/gmx-users Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/Search before posting! Please don't post (un)subscribe requests to the list. Use the www interface or send it to gmx-users-requ...@gromacs.org. Can't post? Read http://www.gromacs.org/Support/Mailing_Lists