
make a scaling test and run on a single node only at first. So you can
estimate what performance you can at most expect when going to more nodes.

On a single node, you can also run with Gromacs' thread-MPI, thus 
eliminating the possibility that something with your MPI is wrong.

There are lots of reasons why your parallel performance could be bad.
Can you check that actually the Infiniband interconnect is used and
not the Ethernet? It could also be that a single process is still
running on any of your cores and eating up CPU time. Or maybe the
pinning of threads to cores is not correct (what does md.log say
about that?).

Just a few ideas.

Good luck!


> Hi
> I have been trying to run simulation on a cluster consisting of 24 nodes
> Intel(R) Xeon(R) CPU X5670 @ 2.93GHz. Each node has 12 processors and they
> are connected via 1Gbit Ethernet and Infiniband interconnect. The batch
> system is TORQUE. However due to some issues with the parallel queue I have
> been trying to run the simulations directly on the cluster using mpdboot
> and mpirun.
> Following is the mdp.out file that I am using for simulation
> The system has 250853 atoms. I used g_tune_pme in order to check the
> performance with different number of processors
> Following are the perf.out for 48 and 160 processors respectively
> Summary of successful runs:
> Line tpr PME nodes  Gcycles Av.     Std.dev.       ns/day        PME/f
> DD grid
>   0   0    8           181.713        7.698        0.952        1.334
> 8   5   1
>   1   0    6           156.720        4.086        1.104        1.420
> 6   7   1
>   2   0    4           196.320       16.161        0.885        0.916
> 4  11   1
>   3   0    3           195.312        1.127        0.886        0.840
> 3   5   3
>   4   0    0           370.539       12.942        0.468          -
> 8   6   1
>   5   0   -1(  8)      185.688        0.839        0.932        1.322
> 8   5   1
>   6   1    8           185.651       14.798        0.934        1.294
> 8   5   1
>   7   1    6           155.970        3.320        1.110        1.157
> 6   7   1
>   8   1    4           177.021       15.459        0.980        1.005
> 4  11   1
>   9   1    3           190.704       22.673        0.914        0.931
> 3   5   3
>  10   1    0           293.676        5.460        0.589          -
> 8   6   1
>  11   1   -1(  8)      188.978        3.686        0.915        1.266
> 8   5   1
>  12   2    8           210.631       17.457        0.824        1.176
> 8   5   1
>  13   2    6           171.926       10.462        1.008        1.186
> 6   7   1
>  14   2    4           200.015        6.696        0.865        0.839
> 4  11   1
>  15   2    3           215.013        5.881        0.804        0.863
> 3   5   3
>  16   2    0           298.363        7.187        0.580          -
> 8   6   1
>  17   2   -1(  8)      208.821       34.409        0.840        1.088
> 8   5   1
> ------------------------------------------------------------
> Best performance was achieved with 6 PME nodes (see line 7)
> Optimized PME settings:
>   New Coulomb radius: 1.100000 nm (was 1.000000 nm)
>   New Van der Waals radius: 1.100000 nm (was 1.000000 nm)
>   New Fourier grid xyz: 80 80 80 (was 96 96 96)
> Please use this command line to launch the simulation:
> mpirun -np 48 mdrun_mpi -npme 6 -s tuned.tpr -pin on
> Summary of successful runs:
> Line tpr PME nodes  Gcycles Av.     Std.dev.       ns/day        PME/f
> DD grid
>   0   0   25           283.628        2.191        0.610        1.749
> 5   9   3
>   1   0   20           240.888        9.132        0.719        1.618
> 5   4   7
>   2   0   16           166.570        0.394        1.038        1.239
> 8   6   3
>   3   0    0           435.389        3.399        0.397          -
> 10   8   2
>   4   0   -1( 20)      237.623        6.298        0.729        1.406
> 5   4   7
>   5   1   25           286.990        1.662        0.603        1.813
> 5   9   3
>   6   1   20           235.818        0.754        0.734        1.495
> 5   4   7
>   7   1   16           167.888        3.028        1.030        1.256
> 8   6   3
>   8   1    0           284.264        3.775        0.609          -
> 8   5   4
>   9   1   -1( 16)      167.858        1.924        1.030        1.303
> 8   6   3
>  10   2   25           298.637        1.660        0.579        1.696
> 5   9   3
>  11   2   20           281.647        1.074        0.614        1.296
> 5   4   7
>  12   2   16           184.012        4.022        0.941        1.244
> 8   6   3
>  13   2    0           304.658        0.793        0.568          -
> 8   5   4
>  14   2   -1( 16)      183.084        2.203        0.945        1.188
> 8   6   3
> ------------------------------------------------------------
> Best performance was achieved with 16 PME nodes (see line 2)
> and original PME settings.
> Please use this command line to launch the simulation:
> mpirun -np 160 /data1/shashi/localbin/gromacs/bin/mdrun_mpi -npme 16 -s
> 4icl.tpr -pin on
> Both of these outcomes(1.110ns/day and 1.038ns/day) are lower than what I
> get on my workstation with Xeon W3550 3.07 GHz using 8 thread (1.431ns/day)
> for a similar system.
> The bench.log file generated by g_tune PME shows very high load imbalance
> (>60% -100 %). I have tried several combinations of np and npme but the
> perfomance is always in this range only.
> Can someone please tell me what is it that I am doing wrong or how can I
> decrease the simulation time.
> -- 
