Re: [gmx-users] gpu cluster explanation

Richard Broadbent Fri, 12 Jul 2013 08:42:37 -0700


On 12/07/13 13:26, Francesco wrote:

Hi all,
I'm working with a 200K atoms system (protein + explicit water) and
after a while using a cpu cluster I had to switch to a gpu cluster.
I read both Acceleration and parallelization and Gromacs-gpu
documentation pages
(http://www.gromacs.org/Documentation/Acceleration_and_parallelization
and
http://www.gromacs.org/Documentation/Installation_Instructions_4.5/GROMACS-OpenMM)
but it's a bit confusing and I need help to understand if I really have
understood correctly. :)
I have 2 type of nodes:
3gpu ( NVIDIA Tesla M2090) and 2 cpu 6cores each (Intel Xeon E5649 @
2.53GHz)
8gpu and 2 cpu (6 cores each)

1) I can only have 1 MPI per gpu, meaning that with 3 gpu I can have 3
MPI max.
2) because I have 12 cores I can open 4 OPenMP threads x MPI, because
4x3= 12

now if I have a node with 8 gpu, I can use 4 gpu:
4 MPI and 3 OpenMP
is it right?
is it possible to use 8 gpu and 8 cores only?

you could set -ntomp 0, however and setup mpi/thread mpi to use 8 cores.However, a system that unbalanced (huge amount of gpu power tocomparatively little cpu power) is unlikely to get great performance.


Using gromacs 4.6.2 and 144 cpu cores I reach 35 ns/day, while with 3
gpu  and 12 cores I get 9-11 ns/day.

That slowdown is in line with what I got when I tried a similar cpu-gpusetup. That said other's might have some advice that will improve yourperformance.

the command that I use is:
mdrun -dlb yes -s input_50.tpr -deffnm 306s_50 -v
with n° gpu set via script :
#BSUB -n 3

I also tried to set -npme / -nt / -ntmpi / -ntomp, but nothing changes.

The mdp file and some statistics are following:

-------- START MDP --------

title             = G6PD wt molecular dynamics (2bhl.pdb) - NPT MD

; Run parameters
integrator              = md                    ; Algorithm options
nsteps                  = 25000000      ; maximum number of steps to
perform [50 ns]
dt                      = 0.002         ; 2 fs = 0.002 ps

; Output control
nstxout            = 10000     ; [steps] freq to write coordinates to
trajectory, the last coordinates are always written
nstvout            = 10000     ; [steps] freq to write velocities to
trajectory, the last velocities are always written
nstlog              = 10000     ; [steps] freq to write energies to log
file, the last energies are always written
nstenergy         = 10000          ; [steps] write energies to disk
every nstenergy steps
nstxtcout          = 10000     ; [steps] freq to write coordinates to
xtc trajectory
xtc_precision   = 1000          ; precision to write to xtc trajectory
(1000 = default)
xtc_grps                = system                ; which coordinate
group(s) to write to disk
energygrps      = system                ; or System / which energy
group(s) to writk

; Bond parameters
continuation    = yes                   ; restarting from npt
constraints     = all-bonds     ; Bond types to replace by constraints
constraint_algorithm    = lincs         ; holonomic constraints
lincs_iter              = 1                     ; accuracy of LINCS
lincs_order             = 4                     ; also related to
accuracy
lincs_warnangle  = 30        ; [degrees] maximum angle that a bond can
rotate before LINCS will complain

That seems a little loose for constraints but setting that up andchecking it's conserving energy and preserving bond lengths is somethingyou'll have to do yourself


Richard

; Neighborsearching
ns_type                 = grid      ; method of updating neighbor list
cutoff-scheme     = Verlet
nstlist                 = 10        ; [steps] frequence to update
neighbor list (10)
rlist                 = 1.0       ; [nm] cut-off distance for the
short-range neighbor list  (1 default)
rcoulomb          = 1.0       ; [nm] long range electrostatic cut-off
rvdw              = 1.0       ; [nm]  long range Van der Waals cut-off

; Electrostatics
coulombtype    = PME          ; treatment of long range electrostatic
interactions
vdwtype         = cut-off       ; treatment of Van der Waals
interactions

; Periodic boundary conditions
pbc                     = xyz

; Dispersion correction
DispCorr                                = EnerPres      ; appling long
range dispersion corrections

; Ewald
fourierspacing    = 0.12                ; grid spacing for FFT  -
controll the higest magnitude of wave vectors (0.12)
pme_order         = 4         ; interpolation order for PME, 4 = cubic
ewald_rtol        = 1e-5      ; relative strength of Ewald-shifted
potential at rcoulomb

; Temperature coupling
tcoupl          = nose-hoover                           ; temperature
coupling with Nose-Hoover ensemble
tc_grps         = Protein Non-Protein
tau_t                   = 0.4        0.4                        ; [ps]
time constant
ref_t                   = 310        310                        ; [K]
reference temperature for coupling [310 = 28°C

; Pressure coupling
pcoupl          = parrinello-rahman
pcoupltype        = isotropic                                   ;
uniform scaling of box vect
tau_p           = 2.0
; [ps] time constant
ref_p                   = 1.0
        ; [bar] reference pressure for coupling
compressibility = 4.5e-5
; [bar^-1] isothermal compressibility of water
refcoord_scaling        = com
        ; have a look at GROMACS documentation 7.

; Velocity generation
gen_vel         = no                     ; generate velocities in grompp
according to a Maxwell distribution

-------- END MDP --------

-------- START STATISTICS  --------

        P P   -   P M E   L O A D   B A L A N C I N G

  PP/PME load balancing changed the cut-off and PME settings:
            particle-particle                    PME
             rcoulomb  rlist            grid      spacing   1/beta
    initial  1.000 nm  1.155 nm     100 128  96   0.120 nm  0.320 nm
    final    1.201 nm  1.356 nm      96 100  80   0.144 nm  0.385 nm
  cost-ratio           1.62             0.62
  (note that these numbers concern only part of the total PP and PME
  load)

    M E G A - F L O P S   A C C O U N T I N G

  NB=Group-cutoff nonbonded kernels    NxN=N-by-N cluster Verlet kernels
  RF=Reaction-Field  VdW=Van der Waals  QSTab=quadratic-spline table
  W3=SPC/TIP3p  W4=TIP4p (single or pairs)
  V&F=Potential and force  V=Potential only  F=Force only

     D O M A I N   D E C O M P O S I T I O N   S T A T I S T I C S

  av. #atoms communicated per step for force:  2 x 54749.0
  av. #atoms communicated per step for LINCS:  2 x 5418.4

  Average load imbalance: 12.8 %
  Part of the total run time spent waiting due to load imbalance: 1.4 %
  Steps where the load balancing was limited by -rdd, -rcon and/or -dds:
  Y 0 %


      R E A L   C Y C L E   A N D   T I M E   A C C O U N T I N G

  Computing:         Nodes   Th.     Count  Wall t (s)     G-Cycles
  %
-----------------------------------------------------------------------------
  Domain decomp.         3    4     625000   10388.307   315806.805
  2.3
  DD comm. load          3    4     625000     129.908     3949.232
  0.0
  DD comm. bounds        3    4     625001     267.204     8123.069
  0.1
  Neighbor search        3    4     625001    7756.651   235803.900
  1.7
  Launch GPU ops.        3    4   50000002    3376.764   102654.354
  0.8
  Comm. coord.           3    4   24375000   10651.213   323799.209
  2.4
  Force                  3    4   25000001   35732.146  1086265.102
  8.0
  Wait + Comm. F         3    4   25000001   19866.403   603943.033
  4.5
  PME mesh               3    4   25000001  235964.754  7173380.387
  53.0
  Wait GPU nonlocal      3    4   25000001   12055.970   366504.140
  2.7
  Wait GPU local         3    4   25000001     106.179     3227.866
  0.0
  NB X/F buffer ops.     3    4   98750002   10256.750   311807.459
  2.3
  Write traj.            3    4       2994     249.770     7593.073
  0.1
  Update                 3    4   25000001   33108.852  1006516.379
  7.4
  Constraints            3    4   25000001   51671.482  1570824.423
  11.6
  Comm. energies         3    4    2500001     463.135    14079.404
  0.1
  Rest                   3                   13290.037   404020.040
  3.0
-----------------------------------------------------------------------------
  Total                  3                  445335.526 13538297.876
  100.0
-----------------------------------------------------------------------------
-----------------------------------------------------------------------------
  PME redist. X/F        3    4   50000002   40747.165  1238722.760
  9.1
  PME spread/gather      3    4   50000002  122026.128  3709621.109
  27.4
  PME 3D-FFT             3    4   50000002   46613.023  1417046.140
  10.5
  PME 3D-FFT Comm.       3    4   50000002   20934.134   636402.285
  4.7
  PME solve              3    4   25000001    5465.690   166158.163
  1.2
-----------------------------------------------------------------------------

                Core t (s)   Wall t (s)        (%)
        Time:  5317976.200   445335.526     1194.2
                          5d03h42:15
                  (ns/day)    (hour/ns)
Performance:        9.701        2.474

-------- END STATISTICS  --------

thank you very much for the help.
cheers,
Fra

--
gmx-users mailing list    gmx-users@gromacs.org
http://lists.gromacs.org/mailman/listinfo/gmx-users
* Please search the archive at 
http://www.gromacs.org/Support/Mailing_Lists/Search before posting!

* Please don't post (un)subscribe requests to the list. Use thewww interface or send it to gmx-users-requ...@gromacs.org.

* Can't post? Read http://www.gromacs.org/Support/Mailing_Lists

Re: [gmx-users] gpu cluster explanation

Reply via email to