Hi Dwey,

On 05/11/13 22:00, Dwey Kauffman wrote:
Hi Szilard,

    Thanks for your suggestions. I am  indeed aware of this page. In a 8-core
AMD with 1GPU, I am very happy about its performance. See below. My
intention is to obtain a even better one because we have multiple nodes.

### 8 core AMD with  1 GPU,
Force evaluation time GPU/CPU: 4.006 ms/2.578 ms = 1.554
For optimal performance this ratio should be close to 1!


NOTE: The GPU has >20% more load than the CPU. This imbalance causes
       performance loss, consider using a shorter cut-off and a finer PME
grid.

                Core t (s)   Wall t (s)        (%)
        Time:   216205.510    27036.812      799.7
                          7h30:36
                  (ns/day)    (hour/ns)
Performance:       31.956        0.751

### 8 core AMD with 2 GPUs

                Core t (s)   Wall t (s)        (%)
        Time:   178961.450    22398.880      799.0
                          6h13:18
                  (ns/day)    (hour/ns)
Performance:       38.573        0.622
Finished mdrun on node 0 Sat Jul 13 09:24:39 2013


I'm almost certain that Szilard meant the lines above this that give the breakdown of where the time is spent in the simulation.

Richard

However, in your case I suspect that the
bottleneck is multi-threaded scaling on the AMD CPUs and you should
probably decrease the number of threads per MPI rank and share GPUs
between 2-4 ranks.


OK but can you give a example of mdrun command ? given a 8 core AMD with 2
GPUs.
I will try to run it again.


Regarding scaling across nodes, you can't expect much from gigabit
ethernet - especially not from the cheaper cards/switches, in my
experience even reaction field runs don't scale across nodes with 10G
ethernet if you have more than 4-6 ranks per node trying to
communicate (let alone with PME). However, on infiniband clusters we
have seen scaling to 100 atoms/core (at peak).

From your comments, it sounds like a cluster of AMD cpus is difficult to
scale across nodes in our current setup.

Let's assume we install Infiniband (20 or 40GB/s) in the same system of 16
nodes of 8 core AMD with 1 GPU only. Considering the same AMD system, what
is a good way to obtain better performance  when we run a task across nodes
? in other words, what dose mudrun_mpi look like ?

Thanks,
Dwey




--
View this message in context: 
http://gromacs.5086.x6.nabble.com/Gromacs-4-6-on-two-Titans-GPUs-tp5012186p5012279.html
Sent from the GROMACS Users Forum mailing list archive at Nabble.com.

--
gmx-users mailing list    gmx-users@gromacs.org
http://lists.gromacs.org/mailman/listinfo/gmx-users
* Please search the archive at 
http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
* Please don't post (un)subscribe requests to the list. Use the www interface or send it to gmx-users-requ...@gromacs.org.
* Can't post? Read http://www.gromacs.org/Support/Mailing_Lists

Reply via email to