Am 13.11.2012 06:16, schrieb gmx-users-requ...@gromacs.org:
Dear all,
>i did some scaling tests for a cluster and i'm a little bit clueless about the 
results.
>So first the setup:
>
>Cluster:
>Saxonid 6100, Opteron 6272 16C 2.100GHz, Infiniband QDR
>GROMACS version: 4.0.7 and 4.5.5
>Compiler:   GCC 4.7.0
>MPI: Intel MPI 4.0.3.008
>FFT-library: ACML 5.1.0 fma4
>
>System:
>895 spce water molecules
this is a somewhat small system I would say.

>Simulation time: 750 ps (0.002 fs timestep)
>Cut-off: 1.0 nm
>but with long-range correction ( DispCorr = EnerPres ; PME (standard settings) 
- but in each case no extra CPU solely for PME)
>V-rescale thermostat and Parrinello-Rahman barostat
>
>I get the following timings (seconds), whereas is calculated as the time which 
would be needed for 1 CPU (so if a job on 2 CPUs took X s the time would be 2 * X 
s).
>These timings were taken from the *.log file, at the end of the
>'real cycle and time accounting' - section.
>
>Timings:
>gmx-version 1cpu    2cpu    4cpu
>4.0.7               4223    3384    3540
>4.5.5               3780    3255    2878
Do you mean CPUs or CPU cores? Are you using the IB network or are you running 
single-node?

Meant number of cores and all cores are on the same node.


>
>I'm a little bit clueless about the results. I always thought, that if i have 
a non-interacting system and double the amount of CPUs, i
You do use PME, which means a global interaction of all charges.

>would get a simulation which takes only half the time (so the times as defined 
above would be equal). If the system does have interactions, i would lose some 
performance due to communication. Due to node imbalance there could be a further 
loss of performance.
>
>Keeping this in mind, i can only explain the timings for version 4.0.7 2cpu -> 
4cpu (2cpu a little bit faster, since going to 4cpu leads to more communication -> 
loss of performance).
>
>All the other timings, especially that 1cpu takes in each case longer than the 
other cases, i do not understand.
>Probalby the system is too small and / or the simulation time is too short for 
a scaling test. But i would assume that the amount of time to setup the simulation 
would be equal for all three cases of one GROMACS-version.
>Only other explaination, which comes to my mind, would be that something went 
wrong during the installation of the programs?
You might want to take a closer look at the timings in the md.log output files, 
this will
give you a clue where the bottleneck is, and also tell you about the 
communication-computation
ratio.

Best,
   Carsten


>
>Please, can somebody enlighten me?
>

Here are the timings from the log-file:

#cores:                 1       2       4       (all cores are on the same node)
 Computing:
--------------------
 Domain decomp.                 41.7    47.8    up
 DD comm. load                  0.0     0.0     -
 Comm. coord.                   17.8    30.5    up
 Neighbor search        614.1   355.4   323.7   down
 Force                  2401.6  1968.7  1676.0  down
 Wait + Comm. F                 15.1    31.4    up
 PME mesh               596.3   710.4   639.1   -
 Write traj.            1.2     0.8     0.6     down
 Update                 49.7    44.0    37.6    down
 Constraints            79.3    70.4    60.0    down
 Comm. energies                 3.2     5.3     up
 Rest                   38.3    27.1    25.4    down
--------------------
 Total                  3780.5  3254.6  2877.5  down
--------------------
--------------------
 PME redist. X/F                133.0   120.5   down
 PME spread/gather      511.3   465.7   396.8   down
 PME 3D-FFT             59.4    88.9    102.2   up
 PME solve              25.2    22.2    18.9    down
--------------------

The two calculations-parts for which the most time is saved for going parallel are:
1) Forces
2) Neighbor search (ok, going from 2cores to 4cores does not make a big differences, but from 1core to 2 or 4 saves much time)

Is there any good explains for this time saving?
I would have thought that the system has a set number of interaction and one has to calculate all these interactions. If i divide the set in 2 or 4 smaller sets, the number of interactions shouldn't change and so the calculation time shouldn't change?

Or is something fancy in the algorithm, which reducces the time spent for calling up the arrays if the calculation is for a smaller set of interactions?
--
gmx-users mailing list    gmx-users@gromacs.org
http://lists.gromacs.org/mailman/listinfo/gmx-users
* Please search the archive at 
http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
* Please don't post (un)subscribe requests to the list. Use the www interface or send it to gmx-users-requ...@gromacs.org.
* Can't post? Read http://www.gromacs.org/Support/Mailing_Lists

Reply via email to