On Fri, 2011-01-28 at 16:46 +1100, Mark Abraham wrote: > Hi, > > I compared the .log file time accounting for same .tpr file run alone in > serial or as part of an REMD simulation (with each replica on a single > proessor). It ran about 5-10% slower in the latter. The effect was a bit > larger when comparing the same .tpr on 8 processors with REMD with 8 > processers per replica. The effect seems fairly independent of whether I > compare the lowest or highest replica. > > The system is 1ns of Ace-(Ala)_10-NME in CHARMM27 with GROMACS 4.5.3 > using NVT, PME, virtual sites, 4fs timesteps, rlist=rvdw=rcoulomb=1.0nm > with REMD ranging over 20 replicas distributed exponentially from 298K > to 431.57K using v-rescale T-coupling. The machine has two quad-core > processors per node with Inifiniband connection. The Infiniband switch > is shared with other users' calculations, so some load-based variability > can and does occur, but this should have shown up in a named part of the > time accounting. > > My first thought was that REMD exchange latency was to blame, so I > quickly hacked in a change to report the length of time spent in the > REMD initialization routine, and then each call to the REMD > exchange-attempt routine. > > Comparing the performance between REMD and serial of the lowest replica > on a single processor, I saw with diff: > Computing: Nodes Number G-Cycles Seconds % > 7394,7403c6910,6918 > < Vsite constr. 1 250001 40.271 13.8 0.7 > < Neighbor search 1 25011 434.982 148.7 7.1 > < Force 1 250001 3607.375 1232.8 59.1 > < PME mesh 1 250001 1270.407 434.1 20.8 > < Vsite spread 1 500002 41.671 14.2 0.7 > < Write traj. 1 3 7.873 2.7 0.1 > < Update 1 250001 82.822 28.3 1.4 > < Constraints 1 250001 154.231 52.7 2.5 > < REMD 1 100 59.070 20.2 1.0 > < Rest 1 409.862 140.1 6.7 > --- > > Vsite constr. 1 250001 40.526 13.8 0.7 > > Neighbor search 1 25001 434.871 148.6 7.5 > > Force 1 250001 3601.463 1230.8 62.2 > > PME mesh 1 250001 1292.675 441.8 22.3 > > Vsite spread 1 500002 41.479 14.2 0.7 > > Write traj. 1 3 17.153 5.9 0.3 > > Update 1 250001 82.114 28.1 1.4 > > Constraints 1 250001 154.426 52.8 2.7 > > Rest 1 122.023 41.7 2.1 > 7405c6920 > < Total 1 6108.562 2087.5 100.0 > --- > > Total 1 5786.731 1977.5 100.0 > > So "Rest" goes up from 122 s to 409 s under REMD, even after factoring > out the 59 s actually spent in REMD. With the highest replica: > > Computing: Nodes Number G-Cycles Seconds % > 7394,7403c6910,6918 > < Vsite constr. 1 250001 40.261 13.8 0.7 > < Neighbor search 1 25016 434.878 148.6 7.1 > < Force 1 250001 3606.913 1232.6 59.0 > < PME mesh 1 250001 1264.716 432.2 20.7 > < Vsite spread 1 500002 41.268 14.1 0.7 > < Write traj. 1 3 7.113 2.4 0.1 > < Update 1 250001 82.491 28.2 1.4 > < Constraints 1 250001 153.207 52.4 2.5 > < REMD 1 100 60.272 20.6 1.0 > < Rest 1 417.399 142.6 6.8 > --- > > Vsite constr. 1 250001 40.518 13.8 0.7 > > Neighbor search 1 25001 435.069 148.7 7.6 > > Force 1 250001 3609.196 1233.4 62.6 > > PME mesh 1 250001 1283.082 438.5 22.3 > > Vsite spread 1 500002 41.825 14.3 0.7 > > Write traj. 1 3 13.063 4.5 0.2 > > Update 1 250001 82.011 28.0 1.4 > > Constraints 1 250001 154.350 52.7 2.7 > > Rest 1 102.249 34.9 1.8 > 7405c6920 > < Total 1 6108.520 2087.5 100.0 > --- > > Total 1 5761.363 1968.8 100.0 > > Here 102 s becomes 417 s despite factoring out 60 s for REMD. So the > time spent doing the exchange is just noticeable, but quite a bit less > than the observed increase in total time. > > For the lowest replica in parallel: > > 8481,8496c7971,7985 > < Domain decomp. 8 25010 152.338 52.1 1.8 > < DD comm. load 8 24226 1.085 0.4 0.0 > < DD comm. bounds 8 24219 4.167 1.4 0.0 > < Vsite constr. 8 250001 62.857 21.5 0.8 > < Comm. coord. 8 250001 132.068 45.1 1.6 > < Neighbor search 8 25010 367.001 125.4 4.4 > < Force 8 250001 3446.528 1177.8 41.2 > < Wait + Comm. F 8 250001 252.245 86.2 3.0 > < PME mesh 8 250001 2113.009 722.1 25.3 > < Vsite spread 8 500002 102.749 35.1 1.2 > < Write traj. 8 1 1.206 0.4 0.0 > < Update 8 250001 85.793 29.3 1.0 > < Constraints 8 250001 464.294 158.7 5.5 > < Comm. energies 8 250002 73.343 25.1 0.9 > < REMD 8 100 162.661 55.6 1.9 > < Rest 8 945.642 323.2 11.3 > --- > > Domain decomp. 8 25001 146.561 50.1 2.0 > > DD comm. load 8 22943 0.989 0.3 0.0 > > DD comm. bounds 8 22901 3.768 1.3 0.1 > > Vsite constr. 8 250001 64.035 21.9 0.9 > > Comm. coord. 8 250001 124.487 42.5 1.7 > > Neighbor search 8 25001 367.342 125.5 5.0 > > Force 8 250001 3443.161 1176.7 46.9 > > Wait + Comm. F 8 250001 237.697 81.2 3.2 > > PME mesh 8 250001 2119.205 724.2 28.9 > > Vsite spread 8 500002 95.092 32.5 1.3 > > Write traj. 8 1 0.920 0.3 0.0 > > Update 8 250001 85.529 29.2 1.2 > > Constraints 8 250001 391.469 133.8 5.3 > > Comm. energies 8 250002 120.291 41.1 1.6 > > Rest 8 139.127 47.5 1.9 > 8498c7987 > < Total 8 8366.984 2859.3 100.0 > --- > > Total 8 7339.674 2508.3 100.0 > > Again REMD exchanges are only a small fraction of the increase (139 s to > 946 s despite 163 s accounted for). > > Does anyone have a theory on what could be causing this? > > Mark >
No theory, but some more data. I've been running REMD on a fairly large system, with 48 replicas between 300K and 400K. I have runs using Gromacs 4.5.3 and 2, 4 or 16 processors per replica. As a general statement, it all seems to scale fine, and no great delays from the RE. However, I did do some quick timing checks for the 2 procs per replica case. I simply hacked in a few timing statements, so nothing so polished as your hack :) An average MD step takes about 0.3 s. The time spent in the replica exchange attempt (which I took to be the time in the call to replica_exchange() from md.c) was around 0.003 s, i.e. about 1% for a RE cycle. Given that I only attempt an exchange every 1000 cycles, I took this to be negligible. The only odd thing I saw was that on a RE cycle it appears to spend 0.6s in do_force() which is twice the average MD step time. I didn't print this out for non-RE cycles, so no sanity check I am afraid. For time lost in REMD, I guess the issue is when the replicas get synchronised. There seems to be an MPI_Allreduce called as part of get_replica_exchange() (when it collects the potential energies) which is within my timings, but I am not sure if there is anything else. Sorry that these figures are a bit rough and ready, but they do seem to support your finding that the calls to REMD aren't to blame. Cheers Martyn -- *********************************************************************** * * * Dr. Martyn Winn * * * * STFC Daresbury Laboratory, Daresbury, Warrington, WA4 4AD, U.K. * * Tel: +44 1925 603455 E-mail: martyn.w...@stfc.ac.uk * * Fax: +44 1925 603634 Skype name: martyn.winn * * URL: http://www.ccp4.ac.uk/martyn/ * *********************************************************************** -- gmx-users mailing list gmx-users@gromacs.org http://lists.gromacs.org/mailman/listinfo/gmx-users Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/Search before posting! Please don't post (un)subscribe requests to the list. Use the www interface or send it to gmx-users-requ...@gromacs.org. Can't post? Read http://www.gromacs.org/Support/Mailing_Lists