Re: mdrun on 8-core AMD + GTX TITAN (was: Re: [gmx-users] Re: Gromacs-4.6 on two Titans GPUs)

2013-11-12 Thread Szilárd Páll
As Mark said, please share the *entire* log file. Among other
important things, the result of PP-PME tuning is not included above.

However, I suspect that in this case scaling is strongly affected or
by the small size of the system you are simulating.
--
Szilárd


On Sun, Nov 10, 2013 at 5:28 AM, Dwey Kauffman mpi...@gmail.com wrote:
 Hi Szilard,

  Thank you very much for your suggestions.

Actually, I was jumping to conclusions too early, as you mentioned AMD
cluster, I assumed you must have 12-16-core Opteron CPUs. If you
have an 8-core (desktop?) AMD CPU, than you may not need to run more
than one rank per GPU.

 Yes, we do have independent clusters of AMD, AMD opteron, Intel Corei7. All
 nodes of three clusters are  installed with (at least) 1 GPU card.   I have
 run the same test on these three clusters.

 Let's focus on a basic scaling issue:  One GPU  v.s Two GPUs within the same
 node of 8-core AMD cpu.
 Using 1 GPU, we  can  have a performance of ~32 ns/day.  Using two GPU, we
 gain not much more ( ~38.5 ns/day ).  It is about ~20% more performance.
 However, this is not really true because in some tests, I also saw only 2-5%
 more, which really surprised me.

 As you can see, this test was made on the same node regardless of
 networking.  Can the performance be improved  say 50% more when 2 GPUs are
 used on a general task ?  If yes, how ?

Indeed, as Richard pointed out, I was asking for *full* logs, these
summaries can't tell much, the table above the summary entitled R E A
L   C Y C L E   A N D   T I M E   A C C O U N T I N G as well as
other reported information across the log file is what I need to make
an assessment of your simulations' performance.

 Please see below.

However, in your case I suspect that the
bottleneck is multi-threaded scaling on the AMD CPUs and you should
probably decrease the number of threads per MPI rank and share GPUs
between 2-4 ranks.

 After I test all three clusters, I found it may NOT be an issue of AMD cpus.
 Intel cpus has the SAME scaling issue.

 However, I am curious as to how you justify the setup of 2-4 ranks sharing
 GPUs ? Can you please explain it a bit more ?


You could try running
mpirun -np 4 mdrun -ntomp 2 -gpu_id 0011
but I suspect this won't help because your scaling issue

 Your guess is correct but why is that ?  it is worse. The more nodes are
 involved in a task, the performance is worse.


 in my
experience even reaction field runs don't scale across nodes with 10G
ethernet if you have more than 4-6 ranks per node trying to
communicate (let alone with PME).

 What dose it mean  let alone with PME ?  how to do so ? by mdrun ?
 I do know  mdrun -npme to specify PME process.

 Thank you.

 Dwey



 ### One GPU 

  R E A L   C Y C L E   A N D   T I M E   A C C O U N T I N G

  Computing: Nodes   Th. Count  Wall t (s) G-Cycles   %
 -
  Neighbor search18 11 431.81713863.390 1.6
  Launch GPU ops.18501 472.90615182.556 1.7
  Force  185011328.61142654.785 4.9
  PME mesh   18501   11561.327   371174.09042.8
  Wait GPU local 185016888.008   221138.11125.5
  NB X/F buffer ops. 189911216.49939055.455 4.5
  Write traj.18   1030  12.741  409.039 0.0
  Update 185011696.35854461.226 6.3
  Constraints185011969.72663237.647 7.3
  Rest   11458.82046835.133 5.4
 -
  Total  1   27036.812   868011.431   100.0
 -
 -
  PME spread/gather  18   10026975.086   223933.73925.8
  PME 3D-FFT 18   10023928.259   126115.97614.5
  PME solve  18501 636.48820434.327 2.4
 -
  GPU timings
 -
  Computing: Count  Wall t (s)  ms/step   %
 -
  Pair list H2D 11  43.4350.434 0.2
  X / q H2D501 567.1680.113 2.8
  Nonbonded F kernel   400   14174.3163.54470.8
  Nonbonded F+ene k.904314.4384.79421.5
  Nonbonded F+ene+prune k.  11 572.3705.724 2.9
  F D2H

Re: mdrun on 8-core AMD + GTX TITAN (was: Re: [gmx-users] Re: Gromacs-4.6 on two Titans GPUs)

2013-11-12 Thread Dwey Kauffman
Hi Mark and Szilard

Thanks for your both suggestions. They are very helpful.


 Neither run had a PP-PME work distribution suitable for the hardware it
 was
 running on (and fixing that for each run requires opposite changes).
 Adding
 a GPU and hoping to see scaling requires that there be proportionately
 more
 GPU work available to do, *and* enough absolute work to do. mdrun tries to
 do this, and reports early in the log file, which is one of the reasons
 Szilard asked to see whole log files - please use a file sharing service
 to
 do that.


This task involves GPU calculation. We would not see PP-PME work
distribution.
This is a good hint from the angle of PP-PME work distribution.  And I
guessed that two GPUs' calculations are fast / or no enough work for GPU
calculation, which is aligned with your explanation.
 
Please see logs below again.

 ONE GPU##

http://pastebin.com/B6bRUVSa

 TWO GPUs##
http://pastebin.com/SLAYnejP
 

 As you can see, this test was made on the same node regardless of
  networking.  Can the performance be improved  say 50% more when 2 GPUs
 are
  used on a general task ?  If yes, how ?
 
  Indeed, as Richard pointed out, I was asking for *full* logs, these
  summaries can't tell much, the table above the summary entitled R E A
  L   C Y C L E   A N D   T I M E   A C C O U N T I N G as well as
  other reported information across the log file is what I need to make
  an assessment of your simulations' performance.
 
  Please see below.
 
  However, in your case I suspect that the
  bottleneck is multi-threaded scaling on the AMD CPUs and you should
  probably decrease the number of threads per MPI rank and share GPUs
  between 2-4 ranks.
 
  After I test all three clusters, I found it may NOT be an issue of AMD
  cpus.
  Intel cpus has the SAME scaling issue.
 
  However, I am curious as to how you justify the setup of 2-4 ranks
 sharing
  GPUs ? Can you please explain it a bit more ?
 

 NUMA effects on multi-socket AMD processors are particularly severe; the
 way GROMACS uses OpenMP is not well suited to them. Using a rank (or two)
 per socket will greatly reduce those effects, but introduces different
 algorithmic overhead from the need to do DD and explicitly communicate
 between ranks. (You can see the latter in your .log file snippets below.)
 Also, that means the parcel of PP work available from a rank to give to
 the
 GPU is smaller, which is the opposite of what you'd like for GPU
 performance and/or scaling. We are working on a general solution for this
 and lots of related issues in the post-5.0 space, but there is a very hard
 limitation imposed by the need to amortize the cost of CPU-GPU transfer by
 having lots of PP work available to do.


Is this reason why the scaling of two GPUs won't happen because of smaller
PP workload ?
From the implication, I am wondering if we can increase PP workload through
parameters in a mdp file.  The question is what parameters are mostly
related to PP workload ? Would you please give more specific suggestions ?  



  NOTE: The GPU has 20% more load than the CPU. This imbalance causes
performance loss, consider using a shorter cut-off and a finer PME
  grid.
 

 This note needs to be addressed before maximum throughput is achieved and
 the question of scaling is worth considering. Ideally, Wait GPU local
 should be nearly zero, achieved as suggested above. Note that
 launch+force+mesh+wait is the sum of gpu total! But much of the
 information
 needed is higher up the log file, and the whole question is constrained by
 things like rvdw.


From the note, it clearly suggested a shorter cut-off and a finer PME grid.
I am not sure how to set up a finer PME grid but I am able to set up shorter
cut-offs . However, it is risky to do so based on others' reports.
 
Indeed, I see differences among tests for 1 GPU.
Here cutoffs refer to rlist, rvdw and rcoulomb.  

I found that the smaller values of cutoffs, the faster computations.
The question is how small they can go because  it is interesting to know if
these different cutoffs generate equally good simulations.

As to  two GPUs, when I set up larger cut-offs,  these two GPUs in the same
node had been very busy.   However, the outcome in such a configuration is
worse in terms of ns/day and time.

So what dose a finer PME grid mean, with respect to GPU workload ?

You mention the sum of GPU total is  launch + force + mesh + wait.I
thought PME mesh is carried out by CPU instead of GPU. Do I miss something
here ?
I thought  GPU is responsible for the calculation of short-ranged non-bonded
force whereas CPU is responsible for that of bonded and PME long-ranged
force.  Can you clarify it here ?

Also, would rvdw play an important role in improving the performance of GPU
calculation ?


 
 Unfortunately you didn't copy the GPU timing stuff here! Roughly, all the
 performance gain you are seeing here is eliminating most of the single-GPU
 wait gpu term by throwing 

Re: mdrun on 8-core AMD + GTX TITAN (was: Re: [gmx-users] Re: Gromacs-4.6 on two Titans GPUs)

2013-11-10 Thread Mark Abraham
On Sun, Nov 10, 2013 at 5:28 AM, Dwey Kauffman mpi...@gmail.com wrote:

 Hi Szilard,

  Thank you very much for your suggestions.

 Actually, I was jumping to conclusions too early, as you mentioned AMD
 cluster, I assumed you must have 12-16-core Opteron CPUs. If you
 have an 8-core (desktop?) AMD CPU, than you may not need to run more
 than one rank per GPU.

 Yes, we do have independent clusters of AMD, AMD opteron, Intel Corei7. All
 nodes of three clusters are  installed with (at least) 1 GPU card.   I have
 run the same test on these three clusters.

 Let's focus on a basic scaling issue:  One GPU  v.s Two GPUs within the
 same
 node of 8-core AMD cpu.
 Using 1 GPU, we  can  have a performance of ~32 ns/day.  Using two GPU, we
 gain not much more ( ~38.5 ns/day ).  It is about ~20% more performance.
 However, this is not really true because in some tests, I also saw only
 2-5%
 more, which really surprised me.


Neither run had a PP-PME work distribution suitable for the hardware it was
running on (and fixing that for each run requires opposite changes). Adding
a GPU and hoping to see scaling requires that there be proportionately more
GPU work available to do, *and* enough absolute work to do. mdrun tries to
do this, and reports early in the log file, which is one of the reasons
Szilard asked to see whole log files - please use a file sharing service to
do that.

As you can see, this test was made on the same node regardless of
 networking.  Can the performance be improved  say 50% more when 2 GPUs are
 used on a general task ?  If yes, how ?

 Indeed, as Richard pointed out, I was asking for *full* logs, these
 summaries can't tell much, the table above the summary entitled R E A
 L   C Y C L E   A N D   T I M E   A C C O U N T I N G as well as
 other reported information across the log file is what I need to make
 an assessment of your simulations' performance.

 Please see below.

 However, in your case I suspect that the
 bottleneck is multi-threaded scaling on the AMD CPUs and you should
 probably decrease the number of threads per MPI rank and share GPUs
 between 2-4 ranks.

 After I test all three clusters, I found it may NOT be an issue of AMD
 cpus.
 Intel cpus has the SAME scaling issue.

 However, I am curious as to how you justify the setup of 2-4 ranks sharing
 GPUs ? Can you please explain it a bit more ?


NUMA effects on multi-socket AMD processors are particularly severe; the
way GROMACS uses OpenMP is not well suited to them. Using a rank (or two)
per socket will greatly reduce those effects, but introduces different
algorithmic overhead from the need to do DD and explicitly communicate
between ranks. (You can see the latter in your .log file snippets below.)
Also, that means the parcel of PP work available from a rank to give to the
GPU is smaller, which is the opposite of what you'd like for GPU
performance and/or scaling. We are working on a general solution for this
and lots of related issues in the post-5.0 space, but there is a very hard
limitation imposed by the need to amortize the cost of CPU-GPU transfer by
having lots of PP work available to do.

You could try running
 mpirun -np 4 mdrun -ntomp 2 -gpu_id 0011
 but I suspect this won't help because your scaling issue

 Your guess is correct but why is that ?  it is worse. The more nodes are
 involved in a task, the performance is worse.


  in my
 experience even reaction field runs don't scale across nodes with 10G
 ethernet if you have more than 4-6 ranks per node trying to
 communicate (let alone with PME).

 What dose it mean  let alone with PME ?  how to do so ? by mdrun ?
 I do know  mdrun -npme to specify PME process.


If using PME (rather than RF), network demands are more severe.


 Thank you.

 Dwey



 ### One GPU 

  R E A L   C Y C L E   A N D   T I M E   A C C O U N T I N G

  Computing: Nodes   Th. Count  Wall t (s) G-Cycles   %

 -
  Neighbor search18 11 431.81713863.390 1.6
  Launch GPU ops.18501 472.90615182.556 1.7
  Force  185011328.61142654.785 4.9
  PME mesh   18501   11561.327   371174.09042.8
  Wait GPU local 185016888.008   221138.11125.5
  NB X/F buffer ops. 189911216.49939055.455 4.5
  Write traj.18   1030  12.741  409.039 0.0
  Update 185011696.35854461.226 6.3
  Constraints185011969.72663237.647 7.3
  Rest   11458.82046835.133 5.4

 -
  Total  1   27036.812   868011.431   100.0

 

Re: mdrun on 8-core AMD + GTX TITAN (was: Re: [gmx-users] Re: Gromacs-4.6 on two Titans GPUs)

2013-11-09 Thread Dwey Kauffman
Hi Szilard,

 Thank you very much for your suggestions.

Actually, I was jumping to conclusions too early, as you mentioned AMD
cluster, I assumed you must have 12-16-core Opteron CPUs. If you
have an 8-core (desktop?) AMD CPU, than you may not need to run more
than one rank per GPU.

Yes, we do have independent clusters of AMD, AMD opteron, Intel Corei7. All
nodes of three clusters are  installed with (at least) 1 GPU card.   I have
run the same test on these three clusters.

Let's focus on a basic scaling issue:  One GPU  v.s Two GPUs within the same
node of 8-core AMD cpu.
Using 1 GPU, we  can  have a performance of ~32 ns/day.  Using two GPU, we
gain not much more ( ~38.5 ns/day ).  It is about ~20% more performance.
However, this is not really true because in some tests, I also saw only 2-5%
more, which really surprised me.

As you can see, this test was made on the same node regardless of
networking.  Can the performance be improved  say 50% more when 2 GPUs are
used on a general task ?  If yes, how ?  

Indeed, as Richard pointed out, I was asking for *full* logs, these
summaries can't tell much, the table above the summary entitled R E A
L   C Y C L E   A N D   T I M E   A C C O U N T I N G as well as
other reported information across the log file is what I need to make
an assessment of your simulations' performance.

Please see below.

However, in your case I suspect that the
bottleneck is multi-threaded scaling on the AMD CPUs and you should
probably decrease the number of threads per MPI rank and share GPUs
between 2-4 ranks.

After I test all three clusters, I found it may NOT be an issue of AMD cpus.
Intel cpus has the SAME scaling issue.

However, I am curious as to how you justify the setup of 2-4 ranks sharing
GPUs ? Can you please explain it a bit more ?


You could try running
mpirun -np 4 mdrun -ntomp 2 -gpu_id 0011
but I suspect this won't help because your scaling issue

Your guess is correct but why is that ?  it is worse. The more nodes are
involved in a task, the performance is worse.


 in my
experience even reaction field runs don't scale across nodes with 10G
ethernet if you have more than 4-6 ranks per node trying to
communicate (let alone with PME). 

What dose it mean  let alone with PME ?  how to do so ? by mdrun ?
I do know  mdrun -npme to specify PME process.

Thank you.

Dwey



### One GPU 

 R E A L   C Y C L E   A N D   T I M E   A C C O U N T I N G

 Computing: Nodes   Th. Count  Wall t (s) G-Cycles   %
-
 Neighbor search18 11 431.81713863.390 1.6
 Launch GPU ops.18501 472.90615182.556 1.7
 Force  185011328.61142654.785 4.9
 PME mesh   18501   11561.327   371174.09042.8
 Wait GPU local 185016888.008   221138.11125.5
 NB X/F buffer ops. 189911216.49939055.455 4.5
 Write traj.18   1030  12.741  409.039 0.0
 Update 185011696.35854461.226 6.3
 Constraints185011969.72663237.647 7.3
 Rest   11458.82046835.133 5.4
-
 Total  1   27036.812   868011.431   100.0
-
-
 PME spread/gather  18   10026975.086   223933.73925.8
 PME 3D-FFT 18   10023928.259   126115.97614.5
 PME solve  18501 636.48820434.327 2.4
-
 GPU timings
-
 Computing: Count  Wall t (s)  ms/step   %
-
 Pair list H2D 11  43.4350.434 0.2
 X / q H2D501 567.1680.113 2.8
 Nonbonded F kernel   400   14174.3163.54470.8
 Nonbonded F+ene k.904314.4384.79421.5
 Nonbonded F+ene+prune k.  11 572.3705.724 2.9
 F D2H501 358.1200.072 1.8
-
 Total  20029.8464.006   100.0
-

Force evaluation time GPU/CPU: 4.006 ms/2.578 ms = 1.554
For optimal performance this ratio should be close to 1!



mdrun on 8-core AMD + GTX TITAN (was: Re: [gmx-users] Re: Gromacs-4.6 on two Titans GPUs)

2013-11-07 Thread Szilárd Páll
Let's not hijack James' thread as your hardware is different from his.

On Tue, Nov 5, 2013 at 11:00 PM, Dwey Kauffman mpi...@gmail.com wrote:
 Hi Szilard,

Thanks for your suggestions. I am  indeed aware of this page. In a 8-core
 AMD with 1GPU, I am very happy about its performance. See below. My

Actually, I was jumping to conclusions too early, as you mentioned AMD
cluster, I assumed you must have 12-16-core Opteron CPUs. If you
have an 8-core (desktop?) AMD CPU, than you may not need to run more
than one rank per GPU.

 intention is to obtain a even better one because we have multiple nodes.

Btw, I'm not sure it's an economically viable solution to install
Infiniband network - especially if you have desktop-class machines.
Such a network will end up costing $500 per machine just for a single
network card, let alone cabling and switches.


 ### 8 core AMD with  1 GPU,
 Force evaluation time GPU/CPU: 4.006 ms/2.578 ms = 1.554
 For optimal performance this ratio should be close to 1!


 NOTE: The GPU has 20% more load than the CPU. This imbalance causes
   performance loss, consider using a shorter cut-off and a finer PME
 grid.

Core t (s)   Wall t (s)(%)
Time:   216205.51027036.812  799.7
  7h30:36
  (ns/day)(hour/ns)
 Performance:   31.9560.751

 ### 8 core AMD with 2 GPUs

Core t (s)   Wall t (s)(%)
Time:   178961.45022398.880  799.0
  6h13:18
  (ns/day)(hour/ns)
 Performance:   38.5730.622
 Finished mdrun on node 0 Sat Jul 13 09:24:39 2013


Indeed, as Richard pointed out, I was asking for *full* logs, these
summaries can't tell much, the table above the summary entitled R E A
L   C Y C L E   A N D   T I M E   A C C O U N T I N G as well as
other reported information across the log file is what I need to make
an assessment of your simulations' performance.

However, in your case I suspect that the
bottleneck is multi-threaded scaling on the AMD CPUs and you should
probably decrease the number of threads per MPI rank and share GPUs
between 2-4 ranks.


 OK but can you give a example of mdrun command ? given a 8 core AMD with 2
 GPUs.
 I will try to run it again.

You could try running
mpirun -np 4 mdrun -ntomp 2 -gpu_id 0011
but I suspect this won't help because your scaling issue



Regarding scaling across nodes, you can't expect much from gigabit
ethernet - especially not from the cheaper cards/switches, in my
experience even reaction field runs don't scale across nodes with 10G
ethernet if you have more than 4-6 ranks per node trying to
communicate (let alone with PME). However, on infiniband clusters we
have seen scaling to 100 atoms/core (at peak).

 From your comments, it sounds like a cluster of AMD cpus is difficult to
 scale across nodes in our current setup.

 Let's assume we install Infiniband (20 or 40GB/s) in the same system of 16
 nodes of 8 core AMD with 1 GPU only. Considering the same AMD system, what
 is a good way to obtain better performance  when we run a task across nodes
 ? in other words, what dose mudrun_mpi look like ?

 Thanks,
 Dwey




 --
 View this message in context: 
 http://gromacs.5086.x6.nabble.com/Gromacs-4-6-on-two-Titans-GPUs-tp5012186p5012279.html
 Sent from the GROMACS Users Forum mailing list archive at Nabble.com.
 --
 gmx-users mailing listgmx-users@gromacs.org
 http://lists.gromacs.org/mailman/listinfo/gmx-users
 * Please search the archive at 
 http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
 * Please don't post (un)subscribe requests to the list. Use the
 www interface or send it to gmx-users-requ...@gromacs.org.
 * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
-- 
gmx-users mailing listgmx-users@gromacs.org
http://lists.gromacs.org/mailman/listinfo/gmx-users
* Please search the archive at 
http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
* Please don't post (un)subscribe requests to the list. Use the 
www interface or send it to gmx-users-requ...@gromacs.org.
* Can't post? Read http://www.gromacs.org/Support/Mailing_Lists