Marshall, would you like to give LAM-MPI a try?
Also, a patch to improve PME part's communication which has been
featured on Gromacs' homepage.
http://wwwuser.gwdg.de/~ckutzne/
Regards,
Yang Ye
On 5/28/2007 9:07 AM, Mark Abraham wrote:
Trevor Marshall wrote:
Can anybody give me any ideas which might help me optimize my new
cluster for a more linear speed increase as I add computing cores?
The new intel Core2 CPUs are inherently very fast, and my mdrun
simulation performance is becoming asymptotic to a value only about
twice the speed I can get from a single core.
The throughput rate is a better measure of performance than the Gflops
reported by GROMACS internal accounting. See
http://www.gromacs.org/gromacs/benchmark/benchmarks.html
With mdrun_mpi I am calculating a 240aa protein and ligand for 10,000
time intervals. Here are the results for various combinations of one,
two, three, four and five cores.
One local core only running mdrun: 18.3 hr/nsec 2.61 Gflops
Two local cores: 9.98 hr/nsec 4.83
Gflops
Three local cores: 7.35 hr/nsec 6.65
Gflops
Four local cores (one also controlling) 7.72 hr/nsec 6.42 Gflops
Three local cores and two remote cores: 7.59 hr/nsec 6.72 GFlops
One local and 2 remote cores: 9.76 hr/nsec 5.02 GFlops
Here, the best you can expect three local cores to return is 6.1 h/ns,
*if* there's no limitations from memory or I/O - and that 18.3 h/ns
number is probably with the rest of the machine unloaded, and so is
not demonstrably realistic. Given Erik's suggestion, how is 7.35 h/ns
so bad?
I get good performance with one local core doing control, and three
doing calculations, giving 6.66 Gflops. However, adding two extra
remote cores only increases the speed a very small amount to 6.72
Gflops, even though the log (below) shows good task distribution (I
think).
Not really... you're spending nearly half your simulation time (45.6%
in Coul(T) + LJ [W3-W3] which are the nonbonded loops optimized for
interactions between 3-point water) getting 86% scaling because CPU0
is only doing about half of the work of the others. That's because it's
got the whole protein/ligand on it.
To fix this, particularly for heterogeneous cluster examples, I think
you should be using the -sort and -shuffle to grompp - see the man page.
Is there some problem with scaling when using these new fast CPUs?
Can I tweak anything in mdrun_mpi to give better scaling?
In short, no :-) More comments below.
M E G A - F L O P S A C C O U N T I N G
Parallel run - timing based on wallclock.
RF=Reaction-Field FE=Free Energy SCFE=Soft-Core/Free Energy
T=Tabulated W3=SPC/TIP3p W4=TIP4p (single or pairs)
NF=No Forces
Computing: M-Number M-Flops % of Flops
-----------------------------------------------------------------------
LJ 928.067418 30626.224794 1.1
Coul(T) 886.762558 37244.027436 1.4
Coul(T) [W3] 92.882138 11610.267250 0.4
Coul(T) + LJ 599.004388 32945.241340 1.2
Coul(T) + LJ [W3] 243.730360 33634.789680 1.2
Coul(T) + LJ [W3-W3] 3292.173000 1257610.086000 45.6
Outer nonbonded loop 945.783063 9457.830630 0.3
1,4 nonbonded interactions 41.184118 3706.570620 0.1
Spread Q Bspline 51931.592640 103863.185280 3.8
Gather F Bspline 51931.592640 623179.111680 22.6
3D-FFT 40498.449440 323987.595520 11.7
Solve PME 3000.300000 192019.200000 7.0
NS-Pairs 1044.424912 21932.923152 0.8
Reset In Box 24.064040 216.576360 0.0
Shift-X 961.696160 5770.176960 0.2
CG-CoM 8.242234 239.024786 0.0
Sum Forces 721.272120 721.272120 0.0
Bonds 25.022502 1075.967586 0.0
Angles 36.343634 5924.012342 0.2
Propers 13.411341 3071.197089 0.1
Impropers 12.171217 2531.613136 0.1
Virial 241.774175 4351.935150 0.2
Ext.ens. Update 240.424040 12982.898160 0.5
Stop-CM 240.400000 2404.000000 0.1
Calc-Ekin 240.448080 6492.098160 0.2
Constraint-V 240.424040 1442.544240 0.1
Constraint-Vir 215.884746 5181.233904 0.2
Settle 71.961582 23243.590986 0.8
-----------------------------------------------------------------------
Total 2757465.194361 100.0
-----------------------------------------------------------------------
NODE (s) Real (s) (%)
Time: 408.000 408.000 100.0
6:48
A 6 minute simulation is pushing the low end for a benchmark. Nobody
simulates for only 10ps... I would go at least a factor of ten longer
for benchmarking.
(Mnbf/s) (GFlops) (ns/day) (hour/ns)
Performance: 14.810 6.758 3.176 7.556
Detailed load balancing info in percentage of average
Type NODE: 0 1 2 3 4 Scaling
-------------------------------------------
LJ:423 0 3 41 32 23%
Coul(T):500 0 0 0 0 20%
Coul(T) [W3]: 0 0 32 291 176 34%
Coul(T) + LJ:500 0 0 0 0 20%
Coul(T) + LJ [W3]: 0 0 24 296 178 33%
Coul(T) + LJ [W3-W3]: 60 116 108 106 107 86%
Outer nonbonded loop:246 42 45 79 85 40%
1,4 nonbonded interactions:500 0 0 0 0 20%
Spread Q Bspline: 98 100 102 100 97 97%
Gather F Bspline: 98 100 102 100 97 97%
3D-FFT:100 100 100 100 100 100%
Solve PME:100 100 100 100 100 100%
NS-Pairs:107 96 91 103 100 93%
Reset In Box: 99 100 100 100 99 99%
Shift-X: 99 100 100 100 99 99%
CG-CoM:110 97 97 97 97 90%
Sum Forces:100 100 100 99 99 99%
Bonds:499 0 0 0 0 20%
Angles:500 0 0 0 0 20%
Propers:499 0 0 0 0 20%
Impropers:500 0 0 0 0 20%
Virial: 99 100 100 100 99 99%
Ext.ens. Update: 99 100 100 100 99 99%
Stop-CM: 99 100 100 100 99 99%
Calc-Ekin: 99 100 100 100 99 99%
Constraint-V: 99 100 100 100 99 99%
Constraint-Vir: 54 111 111 111 111 89%
Settle: 54 111 111 111 111 89%
Total Force: 93 102 97 104 102 95%
Total Shake: 56 110 110 110 110 90%
Total Scaling: 95% of max performance
Finished mdrun on node 0 Sun May 27 07:29:57 2007
Erik,
I also have older systems which use Opteron 165 CPUs. I have run
tests of the AMD Opteron 165 CPUs (2.18GHz) against the Intel Core2
Duos (3GHz). Twelve concurrent AutoDock jobs on each machine show the
Core2 duos outperforming the Opterons by a factor of two.
One worthwhile test is running four copies of the same single-cpu job
on the same new node. Now the memory and disk access will
de-synchronise and you might see whether either of these are going to
be rate-limiting for a four-cpu job. These numbers are a much better
comparison for scaling than a one-cpu job with the rest of the box
unloaded (presumably).
The data I posted showed inconsistencies which have nothing to do
with memory bandwidth, and I was rather hoping for an analysis based
upon the manner in which GROMACS mdrun distributes its computing tasks.
They're also confounded with the interconnect performance in same cases.
I don't believe my data shows memory bandwidth-limiting effects. For
example, three 'local' CPUs on the quad core are faster (6.65Gflops)
than one of the Quads (5.02 Gflops) and two from the cluster. How
does that support the memory bandwidth hypothesis?
So here you've got 3 faster CPUs out-performing 1 faster CPU and 2
slower CPUs across a Gigabit network? That's not a huge surprise.
You'd need a strong memory bandwidth effect for the former to get hurt
enough to overcome the two limitations in the latter.
I figured that it might be possible that the GAMMA MP software is
causing overhead, but when I examined the distribution of tasks by
GROMACS (in the log I provided) it would seem that the tasks which
mdrun distributed to GAMMA actually were distributed well, but that
that the manner in which CPU0 hogged most of the mdrun calculations
might be a bottleneck. It was insight into GROMACS' mdrun
distribution methodology which I was seeking. Is there any
quantitative data available for me to review?
CPU0 is not hogging - it's underloaded, if anything.
Mark
_______________________________________________
gmx-users mailing list [email protected]
http://www.gromacs.org/mailman/listinfo/gmx-users
Please search the archive at http://www.gromacs.org/search before
posting!
Please don't post (un)subscribe requests to the list. Use the www
interface or send it to [EMAIL PROTECTED]
Can't post? Read http://www.gromacs.org/mailing_lists/users.php
_______________________________________________
gmx-users mailing list [email protected]
http://www.gromacs.org/mailman/listinfo/gmx-users
Please search the archive at http://www.gromacs.org/search before posting!
Please don't post (un)subscribe requests to the list. Use the
www interface or send it to [EMAIL PROTECTED]
Can't post? Read http://www.gromacs.org/mailing_lists/users.php