Re: [gmx-users] GROMACS not scaling well with Core4 Quad technology CPUs

Yang Ye Sun, 27 May 2007 18:35:55 -0700

Marshall, would you like to give LAM-MPI a try?

Also, a patch to improve PME part's communication which has beenfeatured on Gromacs' homepage.

http://wwwuser.gwdg.de/~ckutzne/


Regards,
Yang Ye

On 5/28/2007 9:07 AM, Mark Abraham wrote:

Trevor Marshall wrote:
Can anybody give me any ideas which might help me optimize my newcluster for a more linear speed increase as I add computing cores?The new intel Core2 CPUs are inherently very fast, and my mdrunsimulation performance is becoming asymptotic to a value only abouttwice the speed I can get from a single core.
The throughput rate is a better measure of performance than the Gflopsreported by GROMACS internal accounting. Seehttp://www.gromacs.org/gromacs/benchmark/benchmarks.html
With mdrun_mpi I am calculating a 240aa protein and ligand for 10,000time intervals. Here are the results for various combinations of one,two, three, four and five cores.
One local core only running mdrun:      18.3 hr/nsec    2.61 Gflops
Two local cores: 9.98 hr/nsec 4.83GflopsThree local cores: 7.35 hr/nsec 6.65Gflops
Four local cores (one also controlling) 7.72 hr/nsec    6.42 Gflops
Three local cores and two remote cores: 7.59 hr/nsec    6.72 GFlops
One local and 2 remote cores:           9.76 hr/nsec    5.02 GFlops
Here, the best you can expect three local cores to return is 6.1 h/ns,*if* there's no limitations from memory or I/O - and that 18.3 h/nsnumber is probably with the rest of the machine unloaded, and so isnot demonstrably realistic. Given Erik's suggestion, how is 7.35 h/nsso bad?
I get good performance with one local core doing control, and threedoing calculations, giving 6.66 Gflops. However, adding two extraremote cores only increases the speed a very small amount to 6.72Gflops, even though the log (below) shows good task distribution (Ithink).
Not really... you're spending nearly half your simulation time (45.6%in Coul(T) + LJ [W3-W3] which are the nonbonded loops optimized forinteractions between 3-point water) getting 86% scaling because CPU0is only doing about half of the work of the others. That's because it's
got the whole protein/ligand on it.
To fix this, particularly for heterogeneous cluster examples, I thinkyou should be using the -sort and -shuffle to grompp - see the man page.
Is there some problem with scaling when using these new fast CPUs?Can I tweak anything in mdrun_mpi to give better scaling?
In short, no :-) More comments below.
        M E G A - F L O P S   A C C O U N T I N G

        Parallel run - timing based on wallclock.
   RF=Reaction-Field  FE=Free Energy  SCFE=Soft-Core/Free Energy
   T=Tabulated        W3=SPC/TIP3p    W4=TIP4p (single or pairs)
   NF=No Forces

 Computing:                        M-Number         M-Flops  % of Flops
-----------------------------------------------------------------------
 LJ                              928.067418    30626.224794     1.1
 Coul(T)                         886.762558    37244.027436     1.4
 Coul(T) [W3]                     92.882138    11610.267250     0.4
 Coul(T) + LJ                    599.004388    32945.241340     1.2
 Coul(T) + LJ [W3]               243.730360    33634.789680     1.2
 Coul(T) + LJ [W3-W3]           3292.173000  1257610.086000    45.6
 Outer nonbonded loop            945.783063     9457.830630     0.3
 1,4 nonbonded interactions       41.184118     3706.570620     0.1
 Spread Q Bspline              51931.592640   103863.185280     3.8
 Gather F Bspline              51931.592640   623179.111680    22.6
 3D-FFT                        40498.449440   323987.595520    11.7
 Solve PME                      3000.300000   192019.200000     7.0
 NS-Pairs                       1044.424912    21932.923152     0.8
 Reset In Box                     24.064040      216.576360     0.0
 Shift-X                         961.696160     5770.176960     0.2
 CG-CoM                            8.242234      239.024786     0.0
 Sum Forces                      721.272120      721.272120     0.0
 Bonds                            25.022502     1075.967586     0.0
 Angles                           36.343634     5924.012342     0.2
 Propers                          13.411341     3071.197089     0.1
 Impropers                        12.171217     2531.613136     0.1
 Virial                          241.774175     4351.935150     0.2
 Ext.ens. Update                 240.424040    12982.898160     0.5
 Stop-CM                         240.400000     2404.000000     0.1
 Calc-Ekin                       240.448080     6492.098160     0.2
 Constraint-V                    240.424040     1442.544240     0.1
 Constraint-Vir                  215.884746     5181.233904     0.2
 Settle                           71.961582    23243.590986     0.8
-----------------------------------------------------------------------
 Total                                       2757465.194361   100.0
-----------------------------------------------------------------------

               NODE (s)   Real (s)      (%)
       Time:    408.000    408.000    100.0
                       6:48
A 6 minute simulation is pushing the low end for a benchmark. Nobodysimulates for only 10ps... I would go at least a factor of ten longerfor benchmarking.
               (Mnbf/s)   (GFlops)   (ns/day)  (hour/ns)
Performance:     14.810      6.758      3.176      7.556

Detailed load balancing info in percentage of average
Type        NODE:  0   1   2   3   4 Scaling
-------------------------------------------
             LJ:423   0   3  41  32     23%
        Coul(T):500   0   0   0   0     20%
   Coul(T) [W3]:  0   0  32 291 176     34%
   Coul(T) + LJ:500   0   0   0   0     20%
Coul(T) + LJ [W3]:  0   0  24 296 178     33%
Coul(T) + LJ [W3-W3]: 60 116 108 106 107     86%
Outer nonbonded loop:246  42  45  79  85     40%
1,4 nonbonded interactions:500   0   0   0   0     20%
Spread Q Bspline: 98 100 102 100  97     97%
Gather F Bspline: 98 100 102 100  97     97%
         3D-FFT:100 100 100 100 100    100%
      Solve PME:100 100 100 100 100    100%
       NS-Pairs:107  96  91 103 100     93%
   Reset In Box: 99 100 100 100  99     99%
        Shift-X: 99 100 100 100  99     99%
         CG-CoM:110  97  97  97  97     90%
     Sum Forces:100 100 100  99  99     99%
          Bonds:499   0   0   0   0     20%
         Angles:500   0   0   0   0     20%
        Propers:499   0   0   0   0     20%
      Impropers:500   0   0   0   0     20%
         Virial: 99 100 100 100  99     99%
Ext.ens. Update: 99 100 100 100  99     99%
        Stop-CM: 99 100 100 100  99     99%
      Calc-Ekin: 99 100 100 100  99     99%
   Constraint-V: 99 100 100 100  99     99%
 Constraint-Vir: 54 111 111 111 111     89%
         Settle: 54 111 111 111 111     89%

    Total Force: 93 102  97 104 102     95%


    Total Shake: 56 110 110 110 110     90%


Total Scaling: 95% of max performance

Finished mdrun on node 0 Sun May 27 07:29:57 2007
Erik,
I also have older systems which use Opteron 165 CPUs. I have runtests of the AMD Opteron 165 CPUs (2.18GHz) against the Intel Core2Duos (3GHz). Twelve concurrent AutoDock jobs on each machine show theCore2 duos outperforming the Opterons by a factor of two.
One worthwhile test is running four copies of the same single-cpu jobon the same new node. Now the memory and disk access willde-synchronise and you might see whether either of these are going tobe rate-limiting for a four-cpu job. These numbers are a much bettercomparison for scaling than a one-cpu job with the rest of the boxunloaded (presumably).
The data I posted showed inconsistencies which have nothing to dowith memory bandwidth, and I was rather hoping for an analysis basedupon the manner in which GROMACS mdrun distributes its computing tasks.
They're also confounded with the interconnect performance in same cases.
I don't believe my data shows memory bandwidth-limiting effects. Forexample, three 'local' CPUs on the quad core are faster (6.65Gflops)than one of the Quads (5.02 Gflops) and two from the cluster. Howdoes that support the memory bandwidth hypothesis?
So here you've got 3 faster CPUs out-performing 1 faster CPU and 2slower CPUs across a Gigabit network? That's not a huge surprise.You'd need a strong memory bandwidth effect for the former to get hurtenough to overcome the two limitations in the latter.
I figured that it might be possible that the GAMMA MP software iscausing overhead, but when I examined the distribution of tasks byGROMACS (in the log I provided) it would seem that the tasks whichmdrun distributed to GAMMA actually were distributed well, but thatthat the manner in which CPU0 hogged most of the mdrun calculationsmight be a bottleneck. It was insight into GROMACS' mdrundistribution methodology which I was seeking. Is there anyquantitative data available for me to review?
CPU0 is not hogging - it's underloaded, if anything.

Mark
_______________________________________________
gmx-users mailing list    [email protected]
http://www.gromacs.org/mailman/listinfo/gmx-users
Please search the archive at http://www.gromacs.org/search beforeposting!Please don't post (un)subscribe requests to the list. Use the wwwinterface or send it to [EMAIL PROTECTED]
Can't post? Read http://www.gromacs.org/mailing_lists/users.php

_______________________________________________
gmx-users mailing list    [email protected]
http://www.gromacs.org/mailman/listinfo/gmx-users
Please search the archive at http://www.gromacs.org/search before posting!

Please don't post (un)subscribe requests to the list. Use thewww interface or send it to [EMAIL PROTECTED]

Can't post? Read http://www.gromacs.org/mailing_lists/users.php

Re: [gmx-users] GROMACS not scaling well with Core4 Quad technology CPUs

Reply via email to