Re: [gmx-users] gpu cluster explanation

2013-07-23 Thread Francesco
Hi Richard,
Thank you for the help and sorry for the delay in my reply.
I tried some test run changing some parameters (e.g. removing PME) and I
was able to reach 20ns/day, so I think that 9-11 ns/day it's the max
that I can obtain for my setting.

thank your again for your help.

cheers,

Fra

On Fri, 12 Jul 2013, at 03:41 PM, Richard Broadbent wrote:
> 
> 
> On 12/07/13 13:26, Francesco wrote:
> > Hi all,
> > I'm working with a 200K atoms system (protein + explicit water) and
> > after a while using a cpu cluster I had to switch to a gpu cluster.
> > I read both Acceleration and parallelization and Gromacs-gpu
> > documentation pages
> > (http://www.gromacs.org/Documentation/Acceleration_and_parallelization
> > and
> > http://www.gromacs.org/Documentation/Installation_Instructions_4.5/GROMACS-OpenMM)
> > but it's a bit confusing and I need help to understand if I really have
> > understood correctly. :)
> > I have 2 type of nodes:
> > 3gpu ( NVIDIA Tesla M2090) and 2 cpu 6cores each (Intel Xeon E5649 @
> > 2.53GHz)
> > 8gpu and 2 cpu (6 cores each)
> >
> > 1) I can only have 1 MPI per gpu, meaning that with 3 gpu I can have 3
> > MPI max.
> > 2) because I have 12 cores I can open 4 OPenMP threads x MPI, because
> > 4x3= 12
> >
> > now if I have a node with 8 gpu, I can use 4 gpu:
> > 4 MPI and 3 OpenMP
> > is it right?
> > is it possible to use 8 gpu and 8 cores only?
> 
> you could set -ntomp 0, however and setup mpi/thread mpi to use 8 cores. 
> However, a system that unbalanced (huge amount of gpu power to 
> comparatively little cpu power) is unlikely to get great performance.
> >
> > Using gromacs 4.6.2 and 144 cpu cores I reach 35 ns/day, while with 3
> > gpu  and 12 cores I get 9-11 ns/day.
> >
> That slowdown is in line with what I got when I tried a similar cpu-gpu 
> setup. That said other's might have some advice that will improve your 
> performance.
> 
> > the command that I use is:
> > mdrun -dlb yes -s input_50.tpr -deffnm 306s_50 -v
> > with n° gpu set via script :
> > #BSUB -n 3
> >
> > I also tried to set -npme / -nt / -ntmpi / -ntomp, but nothing changes.
> >
> > The mdp file and some statistics are following:
> >
> >  START MDP 
> >
> > title = G6PD wt molecular dynamics (2bhl.pdb) - NPT MD
> >
> > ; Run parameters
> > integrator  = md; Algorithm options
> > nsteps  = 2500  ; maximum number of steps to
> > perform [50 ns]
> > dt  = 0.002 ; 2 fs = 0.002 ps
> >
> > ; Output control
> > nstxout= 1 ; [steps] freq to write coordinates to
> > trajectory, the last coordinates are always written
> > nstvout= 1 ; [steps] freq to write velocities to
> > trajectory, the last velocities are always written
> > nstlog  = 1 ; [steps] freq to write energies to log
> > file, the last energies are always written
> > nstenergy = 1  ; [steps] write energies to disk
> > every nstenergy steps
> > nstxtcout  = 1 ; [steps] freq to write coordinates to
> > xtc trajectory
> > xtc_precision   = 1000  ; precision to write to xtc trajectory
> > (1000 = default)
> > xtc_grps= system; which coordinate
> > group(s) to write to disk
> > energygrps  = system; or System / which energy
> > group(s) to writk
> >
> > ; Bond parameters
> > continuation= yes   ; restarting from npt
> > constraints = all-bonds ; Bond types to replace by constraints
> > constraint_algorithm= lincs ; holonomic constraints
> > lincs_iter  = 1 ; accuracy of LINCS
> > lincs_order = 4 ; also related to
> > accuracy
> > lincs_warnangle  = 30; [degrees] maximum angle that a bond can
> > rotate before LINCS will complain
> >
> 
> That seems a little loose for constraints but setting that up and 
> checking it's conserving energy and preserving bond lengths is something 
> you'll have to do yourself
> 
> Richard
> > ; Neighborsearching
> > ns_type = grid  ; method of updating neighbor list
> > cutoff-scheme = Verlet
> > nstlist = 10; [steps] frequence to update
> > neighbor list (10)
> > rlist = 1.0   ; [nm] cut-off distance for the
> > short-range neighbor list  (1 default)
> > rcoulomb  = 1.0   ; [nm] long range electrostatic cut-off
> > rvdw  = 1.0   ; [nm]  long range Van der Waals cut-off
> >
> > ; Electrostatics
> > coulombtype= PME  ; treatment of long range electrostatic
> > interactions
> > vdwtype = cut-off   ; treatment of Van der Waals
> > interactions
> >
> > ; Periodic boundary conditions
> > pbc = xyz
> >
> > ; Dispersion correction
> > DispCorr= EnerPres  ; appling long
> > range dispersion

Re: [gmx-users] gpu cluster explanation

2013-07-12 Thread Richard Broadbent



On 12/07/13 13:26, Francesco wrote:

Hi all,
I'm working with a 200K atoms system (protein + explicit water) and
after a while using a cpu cluster I had to switch to a gpu cluster.
I read both Acceleration and parallelization and Gromacs-gpu
documentation pages
(http://www.gromacs.org/Documentation/Acceleration_and_parallelization
and
http://www.gromacs.org/Documentation/Installation_Instructions_4.5/GROMACS-OpenMM)
but it's a bit confusing and I need help to understand if I really have
understood correctly. :)
I have 2 type of nodes:
3gpu ( NVIDIA Tesla M2090) and 2 cpu 6cores each (Intel Xeon E5649 @
2.53GHz)
8gpu and 2 cpu (6 cores each)

1) I can only have 1 MPI per gpu, meaning that with 3 gpu I can have 3
MPI max.
2) because I have 12 cores I can open 4 OPenMP threads x MPI, because
4x3= 12

now if I have a node with 8 gpu, I can use 4 gpu:
4 MPI and 3 OpenMP
is it right?
is it possible to use 8 gpu and 8 cores only?


you could set -ntomp 0, however and setup mpi/thread mpi to use 8 cores. 
However, a system that unbalanced (huge amount of gpu power to 
comparatively little cpu power) is unlikely to get great performance.


Using gromacs 4.6.2 and 144 cpu cores I reach 35 ns/day, while with 3
gpu  and 12 cores I get 9-11 ns/day.

That slowdown is in line with what I got when I tried a similar cpu-gpu 
setup. That said other's might have some advice that will improve your 
performance.



the command that I use is:
mdrun -dlb yes -s input_50.tpr -deffnm 306s_50 -v
with n° gpu set via script :
#BSUB -n 3

I also tried to set -npme / -nt / -ntmpi / -ntomp, but nothing changes.

The mdp file and some statistics are following:

 START MDP 

title = G6PD wt molecular dynamics (2bhl.pdb) - NPT MD

; Run parameters
integrator  = md; Algorithm options
nsteps  = 2500  ; maximum number of steps to
perform [50 ns]
dt  = 0.002 ; 2 fs = 0.002 ps

; Output control
nstxout= 1 ; [steps] freq to write coordinates to
trajectory, the last coordinates are always written
nstvout= 1 ; [steps] freq to write velocities to
trajectory, the last velocities are always written
nstlog  = 1 ; [steps] freq to write energies to log
file, the last energies are always written
nstenergy = 1  ; [steps] write energies to disk
every nstenergy steps
nstxtcout  = 1 ; [steps] freq to write coordinates to
xtc trajectory
xtc_precision   = 1000  ; precision to write to xtc trajectory
(1000 = default)
xtc_grps= system; which coordinate
group(s) to write to disk
energygrps  = system; or System / which energy
group(s) to writk

; Bond parameters
continuation= yes   ; restarting from npt
constraints = all-bonds ; Bond types to replace by constraints
constraint_algorithm= lincs ; holonomic constraints
lincs_iter  = 1 ; accuracy of LINCS
lincs_order = 4 ; also related to
accuracy
lincs_warnangle  = 30; [degrees] maximum angle that a bond can
rotate before LINCS will complain



That seems a little loose for constraints but setting that up and 
checking it's conserving energy and preserving bond lengths is something 
you'll have to do yourself


Richard

; Neighborsearching
ns_type = grid  ; method of updating neighbor list
cutoff-scheme = Verlet
nstlist = 10; [steps] frequence to update
neighbor list (10)
rlist = 1.0   ; [nm] cut-off distance for the
short-range neighbor list  (1 default)
rcoulomb  = 1.0   ; [nm] long range electrostatic cut-off
rvdw  = 1.0   ; [nm]  long range Van der Waals cut-off

; Electrostatics
coulombtype= PME  ; treatment of long range electrostatic
interactions
vdwtype = cut-off   ; treatment of Van der Waals
interactions

; Periodic boundary conditions
pbc = xyz

; Dispersion correction
DispCorr= EnerPres  ; appling long
range dispersion corrections

; Ewald
fourierspacing= 0.12; grid spacing for FFT  -
controll the higest magnitude of wave vectors (0.12)
pme_order = 4 ; interpolation order for PME, 4 = cubic
ewald_rtol= 1e-5  ; relative strength of Ewald-shifted
potential at rcoulomb

; Temperature coupling
tcoupl  = nose-hoover   ; temperature
coupling with Nose-Hoover ensemble
tc_grps = Protein Non-Protein
tau_t   = 0.40.4; [ps]
time constant
ref_t   = 310310; [K]
reference temperature for coupling [310 = 28°C

; Pressure coupling
pcoupl  = parrinello-rahman
pcoupltype= isotro

[gmx-users] gpu cluster explanation

2013-07-12 Thread Francesco
Hi all,
I'm working with a 200K atoms system (protein + explicit water) and
after a while using a cpu cluster I had to switch to a gpu cluster.
I read both Acceleration and parallelization and Gromacs-gpu
documentation pages
(http://www.gromacs.org/Documentation/Acceleration_and_parallelization
and
http://www.gromacs.org/Documentation/Installation_Instructions_4.5/GROMACS-OpenMM)
but it's a bit confusing and I need help to understand if I really have
understood correctly. :)
I have 2 type of nodes:
3gpu ( NVIDIA Tesla M2090) and 2 cpu 6cores each (Intel Xeon E5649 @
2.53GHz)
8gpu and 2 cpu (6 cores each)

1) I can only have 1 MPI per gpu, meaning that with 3 gpu I can have 3
MPI max.
2) because I have 12 cores I can open 4 OPenMP threads x MPI, because
4x3= 12

now if I have a node with 8 gpu, I can use 4 gpu:
4 MPI and 3 OpenMP 
is it right?
is it possible to use 8 gpu and 8 cores only?

Using gromacs 4.6.2 and 144 cpu cores I reach 35 ns/day, while with 3
gpu  and 12 cores I get 9-11 ns/day.

the command that I use is:
mdrun -dlb yes -s input_50.tpr -deffnm 306s_50 -v
with n° gpu set via script :
#BSUB -n 3

I also tried to set -npme / -nt / -ntmpi / -ntomp, but nothing changes.

The mdp file and some statistics are following:

 START MDP 

title = G6PD wt molecular dynamics (2bhl.pdb) - NPT MD

; Run parameters
integrator  = md; Algorithm options
nsteps  = 2500  ; maximum number of steps to
perform [50 ns]
dt  = 0.002 ; 2 fs = 0.002 ps

; Output control
nstxout= 1 ; [steps] freq to write coordinates to
trajectory, the last coordinates are always written
nstvout= 1 ; [steps] freq to write velocities to
trajectory, the last velocities are always written
nstlog  = 1 ; [steps] freq to write energies to log
file, the last energies are always written
nstenergy = 1  ; [steps] write energies to disk
every nstenergy steps
nstxtcout  = 1 ; [steps] freq to write coordinates to
xtc trajectory
xtc_precision   = 1000  ; precision to write to xtc trajectory
(1000 = default)
xtc_grps= system; which coordinate
group(s) to write to disk 
energygrps  = system; or System / which energy
group(s) to writk

; Bond parameters
continuation= yes   ; restarting from npt
constraints = all-bonds ; Bond types to replace by constraints
constraint_algorithm= lincs ; holonomic constraints
lincs_iter  = 1 ; accuracy of LINCS
lincs_order = 4 ; also related to
accuracy
lincs_warnangle  = 30; [degrees] maximum angle that a bond can
rotate before LINCS will complain

; Neighborsearching
ns_type = grid  ; method of updating neighbor list
cutoff-scheme = Verlet
nstlist = 10; [steps] frequence to update
neighbor list (10)
rlist = 1.0   ; [nm] cut-off distance for the
short-range neighbor list  (1 default)
rcoulomb  = 1.0   ; [nm] long range electrostatic cut-off
rvdw  = 1.0   ; [nm]  long range Van der Waals cut-off

; Electrostatics
coulombtype= PME  ; treatment of long range electrostatic
interactions  
vdwtype = cut-off   ; treatment of Van der Waals
interactions

; Periodic boundary conditions
pbc = xyz  

; Dispersion correction
DispCorr= EnerPres  ; appling long
range dispersion corrections

; Ewald
fourierspacing= 0.12; grid spacing for FFT  -
controll the higest magnitude of wave vectors (0.12)
pme_order = 4 ; interpolation order for PME, 4 = cubic
ewald_rtol= 1e-5  ; relative strength of Ewald-shifted
potential at rcoulomb

; Temperature coupling
tcoupl  = nose-hoover   ; temperature
coupling with Nose-Hoover ensemble
tc_grps = Protein Non-Protein
tau_t   = 0.40.4; [ps]
time constant
ref_t   = 310310; [K]
reference temperature for coupling [310 = 28°C

; Pressure coupling
pcoupl  = parrinello-rahman 
pcoupltype= isotropic   ;
uniform scaling of box vect
tau_p   = 2.0  
; [ps] time constant
ref_p   = 1.0   
   ; [bar] reference pressure for coupling
compressibility = 4.5e-5   
; [bar^-1] isothermal compressibility of water
refcoord_scaling= com   
   ; have a look at GROMACS documentation 7.

; Velocity generation
gen_vel = no