Re: [gmx-users] GROMACS not scaling well with Core4 Quad technology CPUs

Erik Lindahl Sun, 27 May 2007 12:46:12 -0700

Hi Trevor,

It's probably due to memory bandwidth limitations, as well as Intel'sdesign.

Intel managed to get quad cores to market by gluing together two dual-core chips. All communication between them has to go over the frontside bus though, and all eight cores in a system share the bandwidthto memory.

This can become a problem when you're running in parallel, since alleight processes are communicating (=using the bus bandwidth) at once,and have to share it. You will probably get much better performanceby running multiple (8) independent simulations.

Essentially, there's no such thing as a free lunch. Intel's quad-corechips are cheap, but have the same drawback as their first generationdual-core chips. AMD's solution with real quad-cores and on-chipmemory controllers in Barcelona is looking a whole lot better, but Ialso expect it to be quite a bit more expensive.

You might want to test the CVS version for better scaling. The loweramount of data communicated there might improve performance a bit foryou.


Cheers,

Erik


On May 27, 2007, at 6:28 PM, Trevor Marshall wrote:

Can anybody give me any ideas which might help me optimize my newcluster for a more linear speed increase as I add computing cores?The new intel Core2 CPUs are inherently very fast, and my mdrunsimulation performance is becoming asymptotic to a value only abouttwice the speed I can get from a single core.
I have included the log output from mdrun_mpi when using 5 cores atthe foot of this email. But here is the system overview
My cluster system which comprises two computers running Fedora Core6 and MPI-GAMMA. Both have Intel Core2 CPUs running at 3GHz corespeed (overclocked). The main machine now has a sparkling new Core2Quad 4-processor CPU and the remote still has a Core2-duo dual coreCPU.
Networking hardware is crossover CAT6 cables. The GAMMA software isconnected thru one Intel PRO/1000 board in each computer, with MTU9000. A Gigabit adapter with Realtek chipset is the primary Linuxnetwork in each machine, with MTU 1500. For the common filesystem Iam running NFS on a mounted filesystem with "async" declared in theexports file. The mount is /dev/hde1 to /media and then /media isexported via NFS to the cluster machine. File I/O does not seem tobe a bottleneck.
With mdrun_mpi I am calculating a 240aa protein and ligand for10,000 time intervals. Here are the results for variouscombinations of one, two, three, four and five cores.
One local core only running mdrun:      18.3 hr/nsec    2.61 Gflops
Two local cores: 9.98 hr/nsec4.83 GflopsThree local cores: 7.35 hr/nsec6.65 Gflops
Four local cores (one also controlling) 7.72 hr/nsec    6.42 Gflops
Three local cores and two remote cores: 7.59 hr/nsec    6.72 GFlops
One local and 2 remote cores:           9.76 hr/nsec    5.02 GFlops
I get good performance with one local core doing control, and threedoing calculations, giving 6.66 Gflops. However, adding two extraremote cores only increases the speed a very small amount to 6.72Gflops, even though the log (below) shows good task distribution (Ithink).
Is there some problem with scaling when using these new fast CPUs?Can I tweak anything in mdrun_mpi to give better scaling?
Sincerely
Trevor
------------------------------------------
Trevor G Marshall, PhD
School of Biological Sciences and Biotechnology, MurdochUniversity, Western Australia
Director, Autoimmunity Research Foundation, Thousand Oaks, California
Patron, Australian Autoimmunity Foundation.
------------------------------------------

        M E G A - F L O P S   A C C O U N T I N G

        Parallel run - timing based on wallclock.
   RF=Reaction-Field  FE=Free Energy  SCFE=Soft-Core/Free Energy
   T=Tabulated        W3=SPC/TIP3p    W4=TIP4p (single or pairs)
   NF=No Forces
Computing: M-Number M-Flops % ofFlops-----------------------------------------------------------------------
 LJ                              928.067418    30626.224794     1.1
 Coul(T)                         886.762558    37244.027436     1.4
 Coul(T) [W3]                     92.882138    11610.267250     0.4
 Coul(T) + LJ                    599.004388    32945.241340     1.2
 Coul(T) + LJ [W3]               243.730360    33634.789680     1.2
 Coul(T) + LJ [W3-W3]           3292.173000  1257610.086000    45.6
 Outer nonbonded loop            945.783063     9457.830630     0.3
 1,4 nonbonded interactions       41.184118     3706.570620     0.1
 Spread Q Bspline              51931.592640   103863.185280     3.8
 Gather F Bspline              51931.592640   623179.111680    22.6
 3D-FFT                        40498.449440   323987.595520    11.7
 Solve PME                      3000.300000   192019.200000     7.0
 NS-Pairs                       1044.424912    21932.923152     0.8
 Reset In Box                     24.064040      216.576360     0.0
 Shift-X                         961.696160     5770.176960     0.2
 CG-CoM                            8.242234      239.024786     0.0
 Sum Forces                      721.272120      721.272120     0.0
 Bonds                            25.022502     1075.967586     0.0
 Angles                           36.343634     5924.012342     0.2
 Propers                          13.411341     3071.197089     0.1
 Impropers                        12.171217     2531.613136     0.1
 Virial                          241.774175     4351.935150     0.2
 Ext.ens. Update                 240.424040    12982.898160     0.5
 Stop-CM                         240.400000     2404.000000     0.1
 Calc-Ekin                       240.448080     6492.098160     0.2
 Constraint-V                    240.424040     1442.544240     0.1
 Constraint-Vir                  215.884746     5181.233904     0.2
 Settle                           71.961582    23243.590986     0.8
-----------------------------------------------------------------------
 Total                                       2757465.194361   100.0
-----------------------------------------------------------------------
               NODE (s)   Real (s)      (%)
       Time:    408.000    408.000    100.0
                       6:48
               (Mnbf/s)   (GFlops)   (ns/day)  (hour/ns)
Performance:     14.810      6.758      3.176      7.556

Detailed load balancing info in percentage of average
Type        NODE:  0   1   2   3   4 Scaling
-------------------------------------------
             LJ:423   0   3  41  32     23%
        Coul(T):500   0   0   0   0     20%
   Coul(T) [W3]:  0   0  32 291 176     34%
   Coul(T) + LJ:500   0   0   0   0     20%
Coul(T) + LJ [W3]:  0   0  24 296 178     33%
Coul(T) + LJ [W3-W3]: 60 116 108 106 107     86%
Outer nonbonded loop:246  42  45  79  85     40%
1,4 nonbonded interactions:500   0   0   0   0     20%
Spread Q Bspline: 98 100 102 100  97     97%
Gather F Bspline: 98 100 102 100  97     97%
         3D-FFT:100 100 100 100 100    100%
      Solve PME:100 100 100 100 100    100%
       NS-Pairs:107  96  91 103 100     93%
   Reset In Box: 99 100 100 100  99     99%
        Shift-X: 99 100 100 100  99     99%
         CG-CoM:110  97  97  97  97     90%
     Sum Forces:100 100 100  99  99     99%
          Bonds:499   0   0   0   0     20%
         Angles:500   0   0   0   0     20%
        Propers:499   0   0   0   0     20%
      Impropers:500   0   0   0   0     20%
         Virial: 99 100 100 100  99     99%
Ext.ens. Update: 99 100 100 100  99     99%
        Stop-CM: 99 100 100 100  99     99%
      Calc-Ekin: 99 100 100 100  99     99%
   Constraint-V: 99 100 100 100  99     99%
 Constraint-Vir: 54 111 111 111 111     89%
         Settle: 54 111 111 111 111     89%

    Total Force: 93 102  97 104 102     95%


    Total Shake: 56 110 110 110 110     90%


Total Scaling: 95% of max performance

Finished mdrun on node 0 Sun May 27 07:29:57 2007

_______________________________________________
gmx-users mailing list    [email protected]
http://www.gromacs.org/mailman/listinfo/gmx-users
Please search the archive at http://www.gromacs.org/search beforeposting!
Please don't post (un)subscribe requests to the list. Use the
www interface or send it to [EMAIL PROTECTED]
Can't post? Read http://www.gromacs.org/mailing_lists/users.php

_______________________________________________
gmx-users mailing list    [email protected]
http://www.gromacs.org/mailman/listinfo/gmx-users
Please search the archive at http://www.gromacs.org/search before posting!
Please don't post (un)subscribe requests to the list. Use the 
www interface or send it to [EMAIL PROTECTED]
Can't post? Read http://www.gromacs.org/mailing_lists/users.php

Re: [gmx-users] GROMACS not scaling well with Core4 Quad technology CPUs

Reply via email to