Re: [gmx-users] Can we set the number of pure PME nodes when using GPUCPU?

2014-08-26 Thread Szilárd Páll
On Tue, Aug 26, 2014 at 7:51 AM, Theodore Si sjyz...@gmail.com wrote:
 Hi  Szilárd,

 But CUDA 5.5 won't work with icc 14, right?

Sure, but I don't see how is gcc 4.4 + CUDA 5.0 superior to [a recent
compiler that nvcc supports] + CUDA 5.5?

Additionally, as I said before, gcc 4.8 will almost certainly outperform icc.

Cheers,
--
Szilárd

 It only works with 12.1 unless a header of CUDA 5.5 to be modified.

 Theo


 On 8/25/2014 9:44 PM, Szilárd Páll wrote:

 On Mon, Aug 25, 2014 at 8:08 AM, Mark Abraham mark.j.abra...@gmail.com
 wrote:

 On Mon, Aug 25, 2014 at 5:01 AM, Theodore Si sjyz...@gmail.com wrote:

 Hi,

 https://onedrive.live.com/redir?resid=990FCE59E48164A4!
 2572authkey=!AP82sTNxS6MHgUkithint=file%2clog
 https://onedrive.live.com/redir?resid=990FCE59E48164A4!
 2482authkey=!APLkizOBzXtPHxsithint=file%2clog

 These are 2 log files. The first one  is using 64 cpu cores(64 / 16 = 4
 nodes) and 4nodes*2 = 8 GPUs, and the second is using 512 cpu cores, no
 GPU.
 When we look at the 64 cores log file, we find that in the  R E A L   C
 Y
 C L E   A N D   T I M E   A C C O U N T I N G table, the total wall time
 is
 the sum of every line, that is 37.730=2.201+0.082+...+1.150. So we think
 that when the CPUs is doing PME, GPUs are doing nothing. That's why we
 say
 they are working sequentially.

 Please note that sequential means one phase after another. Your log
 files don't show the timing breakdown for the GPUs, which is distinct
 from
 showing that the GPUs ran and then the CPUs ran (which I don't think the
 code even permits!). References to CUDA 8x8 kernels do show the GPU was
 active. There was an issue with mdrun not always being able to gather and
 publish the GPU timing results; I don't recall the conditions (Szilard
 might remember), but it might be fixed in a later release.

 It is a limitation (well, I'd say borderline bug) in CUDA that if you
 have multiple work-queues (=streams), reliable timing using the CUDA
 built-in mechanisms is impossible. There may be a way to work around
 this, but that won't happen in the current versions. What's important
 is to observe the wait time on the CPU sideand of course, if the Op is
 profiling this is not an issue.

 In any case, you
 should probably be doing performance optimization on a GROMACS version
 that
 isn't a year old.

 I gather that you didn't actually observe the GPUs idle - e.g. with a
 performance monitoring tool? Otherwise, and in the absence of a
 description
 of your simulation system, I'd say that log file looks somewhere between
 normal and optimal. For the record, for better performance, you should
 probably be following the advice of the install guide and not compiling
 FFTW with AVX support, and using one of the five gcc minor versions
 released since 4.4 ;-)

 And besides avoiding ancient gcc versions, I suggest using CUDA 5.5
 (which you can use because you have version 5.5 driver which I see in
 your log file):

 Additionally, I suggest avoiding MKL and using FFTW instead. For the
 grid sizes of our interest all benchmarks I did in the past showed
 considerably higher FFTW performance. Same goes for icc, but feel free
 to benchmark and please report back if you find the opposite.

 As for the 512 cores log file, the total wall time is approximately the
 sum

 of PME mesh and PME wait for PP. We think this is because PME-dedicated
 nodes finished early, and the total wall time is the time spent on PP
 nodes, therefore time spent on PME is covered.


 Yes, using an offload model makes it awkward to report CPU timings,
 because
 there are two kinds of CPU ranks. The total of the Wall t column adds
 up
 to twice the total time taken (which is noted explicitly in more recent
 mdrun versions). By design, the PME ranks do finish early, as you know
 from
 Figure 3.16 of the manual. As you can see in the table, the PP ranks
 spend
 26% of their time waiting for the results from the PME ranks, and this is
 the origin of the note (above the table) that you might want to balance
 things better.

 Mark

 On 8/23/2014 9:30 PM, Mark Abraham wrote:

 On Sat, Aug 23, 2014 at 1:47 PM, Theodore Si sjyz...@gmail.com wrote:

   Hi,

 When we used 2 GPU nodes (each has 2 cpus and 2 gpus) to do a
 mdrun(with
 no PME-dedicated node), we noticed that when CPU are doing PME, GPU
 are
 idle,

 That could happen if the GPU completes its work too fast, in which case
 the
 end of the log file will probably scream about imbalance.

 that is they are doing their work sequentially.


 Highly unlikely, not least because the code is written to overlap the
 short-range work on the GPU with everything else on the CPU. What's
 your
 evidence for *sequential* rather than *imbalanced*?


   Is it supposed to be so?
 No, but without seeing your .log files, mdrun command lines and knowing
 about your hardware, there's nothing we can say.


   Is it the same reason as GPUs on PME-dedicated nodes won't be used
 during

 a run like you said before?

 Why would you 

Re: [gmx-users] Can we set the number of pure PME nodes when using GPUCPU?

2014-08-25 Thread Szilárd Páll
On Mon, Aug 25, 2014 at 8:08 AM, Mark Abraham mark.j.abra...@gmail.com wrote:
 On Mon, Aug 25, 2014 at 5:01 AM, Theodore Si sjyz...@gmail.com wrote:

 Hi,

 https://onedrive.live.com/redir?resid=990FCE59E48164A4!
 2572authkey=!AP82sTNxS6MHgUkithint=file%2clog
 https://onedrive.live.com/redir?resid=990FCE59E48164A4!
 2482authkey=!APLkizOBzXtPHxsithint=file%2clog

 These are 2 log files. The first one  is using 64 cpu cores(64 / 16 = 4
 nodes) and 4nodes*2 = 8 GPUs, and the second is using 512 cpu cores, no GPU.
 When we look at the 64 cores log file, we find that in the  R E A L   C Y
 C L E   A N D   T I M E   A C C O U N T I N G table, the total wall time is
 the sum of every line, that is 37.730=2.201+0.082+...+1.150. So we think
 that when the CPUs is doing PME, GPUs are doing nothing. That's why we say
 they are working sequentially.


 Please note that sequential means one phase after another. Your log
 files don't show the timing breakdown for the GPUs, which is distinct from
 showing that the GPUs ran and then the CPUs ran (which I don't think the
 code even permits!). References to CUDA 8x8 kernels do show the GPU was
 active. There was an issue with mdrun not always being able to gather and
 publish the GPU timing results; I don't recall the conditions (Szilard
 might remember), but it might be fixed in a later release.

It is a limitation (well, I'd say borderline bug) in CUDA that if you
have multiple work-queues (=streams), reliable timing using the CUDA
built-in mechanisms is impossible. There may be a way to work around
this, but that won't happen in the current versions. What's important
is to observe the wait time on the CPU sideand of course, if the Op is
profiling this is not an issue.

 In any case, you
 should probably be doing performance optimization on a GROMACS version that
 isn't a year old.

 I gather that you didn't actually observe the GPUs idle - e.g. with a
 performance monitoring tool? Otherwise, and in the absence of a description
 of your simulation system, I'd say that log file looks somewhere between
 normal and optimal. For the record, for better performance, you should
 probably be following the advice of the install guide and not compiling
 FFTW with AVX support, and using one of the five gcc minor versions
 released since 4.4 ;-)

And besides avoiding ancient gcc versions, I suggest using CUDA 5.5
(which you can use because you have version 5.5 driver which I see in
your log file):

Additionally, I suggest avoiding MKL and using FFTW instead. For the
grid sizes of our interest all benchmarks I did in the past showed
considerably higher FFTW performance. Same goes for icc, but feel free
to benchmark and please report back if you find the opposite.

 As for the 512 cores log file, the total wall time is approximately the sum
 of PME mesh and PME wait for PP. We think this is because PME-dedicated
 nodes finished early, and the total wall time is the time spent on PP
 nodes, therefore time spent on PME is covered.


 Yes, using an offload model makes it awkward to report CPU timings, because
 there are two kinds of CPU ranks. The total of the Wall t column adds up
 to twice the total time taken (which is noted explicitly in more recent
 mdrun versions). By design, the PME ranks do finish early, as you know from
 Figure 3.16 of the manual. As you can see in the table, the PP ranks spend
 26% of their time waiting for the results from the PME ranks, and this is
 the origin of the note (above the table) that you might want to balance
 things better.

 Mark

 On 8/23/2014 9:30 PM, Mark Abraham wrote:

 On Sat, Aug 23, 2014 at 1:47 PM, Theodore Si sjyz...@gmail.com wrote:

  Hi,

 When we used 2 GPU nodes (each has 2 cpus and 2 gpus) to do a mdrun(with
 no PME-dedicated node), we noticed that when CPU are doing PME, GPU are
 idle,


 That could happen if the GPU completes its work too fast, in which case
 the
 end of the log file will probably scream about imbalance.

 that is they are doing their work sequentially.


 Highly unlikely, not least because the code is written to overlap the
 short-range work on the GPU with everything else on the CPU. What's your
 evidence for *sequential* rather than *imbalanced*?


  Is it supposed to be so?


 No, but without seeing your .log files, mdrun command lines and knowing
 about your hardware, there's nothing we can say.


  Is it the same reason as GPUs on PME-dedicated nodes won't be used during
 a run like you said before?


 Why would you suppose that? I said GPUs do work from the PP ranks on their
 node. That's true here.

 So if we want to exploit our hardware, we should map PP-PME ranks
 manually,

 right? Say, use one node as PME-dedicated node and leave the GPUs on that
 node idle, and use two nodes to do the other stuff. How do you think
 about
 this arrangement?

  Probably a terrible idea. You should identify the cause of the
 imbalance,
 and fix that.

 Mark


  Theo


 On 8/22/2014 7:20 PM, Mark Abraham wrote:

  Hi,

 

Re: [gmx-users] Can we set the number of pure PME nodes when using GPUCPU?

2014-08-25 Thread Xingcheng Lin
Theodore Si sjyzhxw@... writes:

 
 Hi,
 
 https://onedrive.live.com/redir?
resid=990FCE59E48164A4!2572authkey=!AP82sTNxS6MHgUkithint=file%2clog
 https://onedrive.live.com/redir?
resid=990FCE59E48164A4!2482authkey=!APLkizOBzXtPHxsithint=file%2clog
 
 These are 2 log files. The first one  is using 64 cpu cores(64 / 16 = 4 
 nodes) and 4nodes*2 = 8 GPUs, and the second is using 512 cpu cores, no GPU.
 When we look at the 64 cores log file, we find that in the  R E A L   C 
 Y C L E   A N D   T I M E   A C C O U N T I N G table, the total wall 
 time is the sum of every line, that is 37.730=2.201+0.082+...+1.150. So 
 we think that when the CPUs is doing PME, GPUs are doing nothing. That's 
 why we say they are working sequentially.
 As for the 512 cores log file, the total wall time is approximately the 
 sum of PME mesh and PME wait for PP. We think this is because 
 PME-dedicated nodes finished early, and the total wall time is the time 
 spent on PP nodes, therefore time spent on PME is covered.
 
Hi, 

I have a naive question:

In your log file there are only 2 GPUs being detected:

2 GPUs detected on host gpu42:
  #0: NVIDIA Tesla K20m, compute cap.: 3.5, ECC: yes, stat: compatible
  #1: NVIDIA Tesla K20m, compute cap.: 3.5, ECC: yes, stat: compatible

In the end you selected 8 GPUs

8 GPUs user-selected for this run: #0, #0, #0, #0, #1, #1, #1, #1

Did you choose 8 GPUs or 2 GPUs? What is your mdrun command?

Thank you,



-- 
Gromacs Users mailing list

* Please search the archive at 
http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting!

* Can't post? Read http://www.gromacs.org/Support/Mailing_Lists

* For (un)subscribe requests visit
https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a 
mail to gmx-users-requ...@gromacs.org.


Re: [gmx-users] Can we set the number of pure PME nodes when using GPUCPU?

2014-08-25 Thread Szilárd Páll
On Mon, Aug 25, 2014 at 7:12 PM, Xingcheng Lin
linxingcheng50...@gmail.com wrote:
 Theodore Si sjyzhxw@... writes:


 Hi,

 https://onedrive.live.com/redir?
 resid=990FCE59E48164A4!2572authkey=!AP82sTNxS6MHgUkithint=file%2clog
 https://onedrive.live.com/redir?
 resid=990FCE59E48164A4!2482authkey=!APLkizOBzXtPHxsithint=file%2clog

 These are 2 log files. The first one  is using 64 cpu cores(64 / 16 = 4
 nodes) and 4nodes*2 = 8 GPUs, and the second is using 512 cpu cores, no GPU.
 When we look at the 64 cores log file, we find that in the  R E A L   C
 Y C L E   A N D   T I M E   A C C O U N T I N G table, the total wall
 time is the sum of every line, that is 37.730=2.201+0.082+...+1.150. So
 we think that when the CPUs is doing PME, GPUs are doing nothing. That's
 why we say they are working sequentially.
 As for the 512 cores log file, the total wall time is approximately the
 sum of PME mesh and PME wait for PP. We think this is because
 PME-dedicated nodes finished early, and the total wall time is the time
 spent on PP nodes, therefore time spent on PME is covered.

 Hi,

 I have a naive question:

 In your log file there are only 2 GPUs being detected:

 2 GPUs detected on host gpu42:
   #0: NVIDIA Tesla K20m, compute cap.: 3.5, ECC: yes, stat: compatible
   #1: NVIDIA Tesla K20m, compute cap.: 3.5, ECC: yes, stat: compatible

 In the end you selected 8 GPUs

 8 GPUs user-selected for this run: #0, #0, #0, #0, #1, #1, #1, #1

 Did you choose 8 GPUs or 2 GPUs? What is your mdrun command?

That's an outdated message which is only correct if a single rank/GPU
is used. If you use a more up-to-date 4.6.x or 5.0.x version, you'd
get something like this instead:

2 GPUs user-selected for this run.
Mapping of GPUs to the 8 PP ranks in this node: #0, #0,  #0, #0, #1, #1, #1, #1

Cheers,
--
Szilárd

 Thank you,



 --
 Gromacs Users mailing list

 * Please search the archive at 
 http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting!

 * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists

 * For (un)subscribe requests visit
 https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a 
 mail to gmx-users-requ...@gromacs.org.
-- 
Gromacs Users mailing list

* Please search the archive at 
http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting!

* Can't post? Read http://www.gromacs.org/Support/Mailing_Lists

* For (un)subscribe requests visit
https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a 
mail to gmx-users-requ...@gromacs.org.


Re: [gmx-users] Can we set the number of pure PME nodes when using GPUCPU?

2014-08-25 Thread Theodore Si

I mapped 2 GPUs to multiple MPI ranks by using -gpu_id

On 8/26/2014 1:12 AM, Xingcheng Lin wrote:

Theodore Si sjyzhxw@... writes:


Hi,

https://onedrive.live.com/redir?

resid=990FCE59E48164A4!2572authkey=!AP82sTNxS6MHgUkithint=file%2clog

https://onedrive.live.com/redir?

resid=990FCE59E48164A4!2482authkey=!APLkizOBzXtPHxsithint=file%2clog

These are 2 log files. The first one  is using 64 cpu cores(64 / 16 = 4
nodes) and 4nodes*2 = 8 GPUs, and the second is using 512 cpu cores, no GPU.
When we look at the 64 cores log file, we find that in the  R E A L   C
Y C L E   A N D   T I M E   A C C O U N T I N G table, the total wall
time is the sum of every line, that is 37.730=2.201+0.082+...+1.150. So
we think that when the CPUs is doing PME, GPUs are doing nothing. That's
why we say they are working sequentially.
As for the 512 cores log file, the total wall time is approximately the
sum of PME mesh and PME wait for PP. We think this is because
PME-dedicated nodes finished early, and the total wall time is the time
spent on PP nodes, therefore time spent on PME is covered.


Hi,

I have a naive question:

In your log file there are only 2 GPUs being detected:

2 GPUs detected on host gpu42:
   #0: NVIDIA Tesla K20m, compute cap.: 3.5, ECC: yes, stat: compatible
   #1: NVIDIA Tesla K20m, compute cap.: 3.5, ECC: yes, stat: compatible

In the end you selected 8 GPUs

8 GPUs user-selected for this run: #0, #0, #0, #0, #1, #1, #1, #1

Did you choose 8 GPUs or 2 GPUs? What is your mdrun command?

Thank you,





--
Gromacs Users mailing list

* Please search the archive at 
http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting!

* Can't post? Read http://www.gromacs.org/Support/Mailing_Lists

* For (un)subscribe requests visit
https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a 
mail to gmx-users-requ...@gromacs.org.


Re: [gmx-users] Can we set the number of pure PME nodes when using GPUCPU?

2014-08-25 Thread Theodore Si

Hi  Szilárd,

But CUDA 5.5 won't work with icc 14, right?
It only works with 12.1 unless a header of CUDA 5.5 to be modified.

Theo

On 8/25/2014 9:44 PM, Szilárd Páll wrote:

On Mon, Aug 25, 2014 at 8:08 AM, Mark Abraham mark.j.abra...@gmail.com wrote:

On Mon, Aug 25, 2014 at 5:01 AM, Theodore Si sjyz...@gmail.com wrote:


Hi,

https://onedrive.live.com/redir?resid=990FCE59E48164A4!
2572authkey=!AP82sTNxS6MHgUkithint=file%2clog
https://onedrive.live.com/redir?resid=990FCE59E48164A4!
2482authkey=!APLkizOBzXtPHxsithint=file%2clog

These are 2 log files. The first one  is using 64 cpu cores(64 / 16 = 4
nodes) and 4nodes*2 = 8 GPUs, and the second is using 512 cpu cores, no GPU.
When we look at the 64 cores log file, we find that in the  R E A L   C Y
C L E   A N D   T I M E   A C C O U N T I N G table, the total wall time is
the sum of every line, that is 37.730=2.201+0.082+...+1.150. So we think
that when the CPUs is doing PME, GPUs are doing nothing. That's why we say
they are working sequentially.


Please note that sequential means one phase after another. Your log
files don't show the timing breakdown for the GPUs, which is distinct from
showing that the GPUs ran and then the CPUs ran (which I don't think the
code even permits!). References to CUDA 8x8 kernels do show the GPU was
active. There was an issue with mdrun not always being able to gather and
publish the GPU timing results; I don't recall the conditions (Szilard
might remember), but it might be fixed in a later release.

It is a limitation (well, I'd say borderline bug) in CUDA that if you
have multiple work-queues (=streams), reliable timing using the CUDA
built-in mechanisms is impossible. There may be a way to work around
this, but that won't happen in the current versions. What's important
is to observe the wait time on the CPU sideand of course, if the Op is
profiling this is not an issue.


In any case, you
should probably be doing performance optimization on a GROMACS version that
isn't a year old.

I gather that you didn't actually observe the GPUs idle - e.g. with a
performance monitoring tool? Otherwise, and in the absence of a description
of your simulation system, I'd say that log file looks somewhere between
normal and optimal. For the record, for better performance, you should
probably be following the advice of the install guide and not compiling
FFTW with AVX support, and using one of the five gcc minor versions
released since 4.4 ;-)

And besides avoiding ancient gcc versions, I suggest using CUDA 5.5
(which you can use because you have version 5.5 driver which I see in
your log file):

Additionally, I suggest avoiding MKL and using FFTW instead. For the
grid sizes of our interest all benchmarks I did in the past showed
considerably higher FFTW performance. Same goes for icc, but feel free
to benchmark and please report back if you find the opposite.


As for the 512 cores log file, the total wall time is approximately the sum

of PME mesh and PME wait for PP. We think this is because PME-dedicated
nodes finished early, and the total wall time is the time spent on PP
nodes, therefore time spent on PME is covered.


Yes, using an offload model makes it awkward to report CPU timings, because
there are two kinds of CPU ranks. The total of the Wall t column adds up
to twice the total time taken (which is noted explicitly in more recent
mdrun versions). By design, the PME ranks do finish early, as you know from
Figure 3.16 of the manual. As you can see in the table, the PP ranks spend
26% of their time waiting for the results from the PME ranks, and this is
the origin of the note (above the table) that you might want to balance
things better.

Mark

On 8/23/2014 9:30 PM, Mark Abraham wrote:

On Sat, Aug 23, 2014 at 1:47 PM, Theodore Si sjyz...@gmail.com wrote:

  Hi,

When we used 2 GPU nodes (each has 2 cpus and 2 gpus) to do a mdrun(with
no PME-dedicated node), we noticed that when CPU are doing PME, GPU are
idle,


That could happen if the GPU completes its work too fast, in which case
the
end of the log file will probably scream about imbalance.

that is they are doing their work sequentially.


Highly unlikely, not least because the code is written to overlap the
short-range work on the GPU with everything else on the CPU. What's your
evidence for *sequential* rather than *imbalanced*?


  Is it supposed to be so?
No, but without seeing your .log files, mdrun command lines and knowing
about your hardware, there's nothing we can say.


  Is it the same reason as GPUs on PME-dedicated nodes won't be used during

a run like you said before?


Why would you suppose that? I said GPUs do work from the PP ranks on their
node. That's true here.

So if we want to exploit our hardware, we should map PP-PME ranks
manually,


right? Say, use one node as PME-dedicated node and leave the GPUs on that
node idle, and use two nodes to do the other stuff. How do you think
about
this arrangement?

  Probably a terrible idea. You should 

Re: [gmx-users] Can we set the number of pure PME nodes when using GPUCPU?

2014-08-24 Thread Theodore Si

Hi,

https://onedrive.live.com/redir?resid=990FCE59E48164A4!2572authkey=!AP82sTNxS6MHgUkithint=file%2clog
https://onedrive.live.com/redir?resid=990FCE59E48164A4!2482authkey=!APLkizOBzXtPHxsithint=file%2clog

These are 2 log files. The first one  is using 64 cpu cores(64 / 16 = 4 
nodes) and 4nodes*2 = 8 GPUs, and the second is using 512 cpu cores, no GPU.
When we look at the 64 cores log file, we find that in the  R E A L   C 
Y C L E   A N D   T I M E   A C C O U N T I N G table, the total wall 
time is the sum of every line, that is 37.730=2.201+0.082+...+1.150. So 
we think that when the CPUs is doing PME, GPUs are doing nothing. That's 
why we say they are working sequentially.
As for the 512 cores log file, the total wall time is approximately the 
sum of PME mesh and PME wait for PP. We think this is because 
PME-dedicated nodes finished early, and the total wall time is the time 
spent on PP nodes, therefore time spent on PME is covered.



On 8/23/2014 9:30 PM, Mark Abraham wrote:

On Sat, Aug 23, 2014 at 1:47 PM, Theodore Si sjyz...@gmail.com wrote:


Hi,

When we used 2 GPU nodes (each has 2 cpus and 2 gpus) to do a mdrun(with
no PME-dedicated node), we noticed that when CPU are doing PME, GPU are
idle,


That could happen if the GPU completes its work too fast, in which case the
end of the log file will probably scream about imbalance.

that is they are doing their work sequentially.


Highly unlikely, not least because the code is written to overlap the
short-range work on the GPU with everything else on the CPU. What's your
evidence for *sequential* rather than *imbalanced*?



Is it supposed to be so?


No, but without seeing your .log files, mdrun command lines and knowing
about your hardware, there's nothing we can say.



Is it the same reason as GPUs on PME-dedicated nodes won't be used during
a run like you said before?


Why would you suppose that? I said GPUs do work from the PP ranks on their
node. That's true here.

So if we want to exploit our hardware, we should map PP-PME ranks manually,

right? Say, use one node as PME-dedicated node and leave the GPUs on that
node idle, and use two nodes to do the other stuff. How do you think about
this arrangement?


Probably a terrible idea. You should identify the cause of the imbalance,
and fix that.

Mark



Theo


On 8/22/2014 7:20 PM, Mark Abraham wrote:


Hi,

Because no work will be sent to them. The GPU implementation can
accelerate
domains from PP ranks on their node, but with an MPMD setup that uses
dedicated PME nodes, there will be no PP ranks on nodes that have been set
up with only PME ranks. The two offload models (PP work - GPU; PME work
-
CPU subset) do not work well together, as I said.

One can devise various schemes in 4.6/5.0 that could use those GPUs, but
they either require
* each node does both PME and PP work (thus limiting scaling because of
the
all-to-all for PME, and perhaps making poor use of locality on
multi-socket
nodes), or
* that all nodes have PP ranks, but only some have PME ranks, and the
nodes
map their GPUs to PP ranks in a way that is different depending on whether
PME ranks are present (which could work well, but relies on the DD
load-balancer recognizing and taking advantage of the faster progress of
the PP ranks that have better GPU support, and requires that you get very
dirty hands laying out PP and PME ranks onto hardware that will later
match
the requirements of the DD load balancer, and probably that you balance
PP-PME load manually)

I do not recommend the last approach, because of its complexity.

Clearly there are design decisions to improve. Work is underway.

Cheers,

Mark


On Fri, Aug 22, 2014 at 10:11 AM, Theodore Si sjyz...@gmail.com wrote:

  Hi Mark,

Could you tell me why that when we are GPU-CPU nodes as PME-dedicated
nodes, the GPU on such nodes will be idle?


Theo

On 8/11/2014 9:36 PM, Mark Abraham wrote:

  Hi,

What Carsten said, if running on nodes that have GPUs.

If running on a mixed setup (some nodes with GPU, some not), then
arranging
your MPI environment to place PME ranks on CPU-only nodes is probably
worthwhile. For example, all your PP ranks first, mapped to GPU nodes,
then
all your PME ranks, mapped to CPU-only nodes, and then use mdrun
-ddorder
pp_pme.

Mark


On Mon, Aug 11, 2014 at 2:45 AM, Theodore Si sjyz...@gmail.com wrote:

   Hi Mark,


This is information of our cluster, could you give us some advice as
regards to our cluster so that we can make GMX run faster on our
system?

Each CPU node has 2 CPUs and each GPU node has 2 CPUs and 2 Nvidia K20M


Device Name Device Type Specifications  Number
CPU NodeIntelH2216JFFKRNodesCPU: 2×Intel Xeon E5-2670(8
Cores,
2.6GHz, 20MB Cache, 8.0GT)
Mem: 64GB(8×8GB) ECC Registered DDR3 1600MHz Samsung Memory 332
Fat NodeIntelH2216WPFKRNodesCPU: 2×Intel Xeon E5-2670(8
Cores,
2.6GHz, 20MB Cache, 8.0GT)
Mem: 256G(16×16G) ECC Registered DDR3 1600MHz Samsung Memory20
GPU Node

Re: [gmx-users] Can we set the number of pure PME nodes when using GPUCPU?

2014-08-23 Thread Theodore Si

Hi,

When we used 2 GPU nodes (each has 2 cpus and 2 gpus) to do a mdrun(with 
no PME-dedicated node), we noticed that when CPU are doing PME, GPU are 
idle, that is they are doing their work sequentially. Is it supposed to 
be so? Is it the same reason as GPUs on PME-dedicated nodes won't be 
used during a run like you said before? So if we want to exploit our 
hardware, we should map PP-PME ranks manually, right? Say, use one node 
as PME-dedicated node and leave the GPUs on that node idle, and use two 
nodes to do the other stuff. How do you think about this arrangement?


Theo

On 8/22/2014 7:20 PM, Mark Abraham wrote:

Hi,

Because no work will be sent to them. The GPU implementation can accelerate
domains from PP ranks on their node, but with an MPMD setup that uses
dedicated PME nodes, there will be no PP ranks on nodes that have been set
up with only PME ranks. The two offload models (PP work - GPU; PME work -
CPU subset) do not work well together, as I said.

One can devise various schemes in 4.6/5.0 that could use those GPUs, but
they either require
* each node does both PME and PP work (thus limiting scaling because of the
all-to-all for PME, and perhaps making poor use of locality on multi-socket
nodes), or
* that all nodes have PP ranks, but only some have PME ranks, and the nodes
map their GPUs to PP ranks in a way that is different depending on whether
PME ranks are present (which could work well, but relies on the DD
load-balancer recognizing and taking advantage of the faster progress of
the PP ranks that have better GPU support, and requires that you get very
dirty hands laying out PP and PME ranks onto hardware that will later match
the requirements of the DD load balancer, and probably that you balance
PP-PME load manually)

I do not recommend the last approach, because of its complexity.

Clearly there are design decisions to improve. Work is underway.

Cheers,

Mark


On Fri, Aug 22, 2014 at 10:11 AM, Theodore Si sjyz...@gmail.com wrote:


Hi Mark,

Could you tell me why that when we are GPU-CPU nodes as PME-dedicated
nodes, the GPU on such nodes will be idle?


Theo

On 8/11/2014 9:36 PM, Mark Abraham wrote:


Hi,

What Carsten said, if running on nodes that have GPUs.

If running on a mixed setup (some nodes with GPU, some not), then
arranging
your MPI environment to place PME ranks on CPU-only nodes is probably
worthwhile. For example, all your PP ranks first, mapped to GPU nodes,
then
all your PME ranks, mapped to CPU-only nodes, and then use mdrun -ddorder
pp_pme.

Mark


On Mon, Aug 11, 2014 at 2:45 AM, Theodore Si sjyz...@gmail.com wrote:

  Hi Mark,

This is information of our cluster, could you give us some advice as
regards to our cluster so that we can make GMX run faster on our system?

Each CPU node has 2 CPUs and each GPU node has 2 CPUs and 2 Nvidia K20M


Device Name Device Type Specifications  Number
CPU NodeIntelH2216JFFKRNodesCPU: 2×Intel Xeon E5-2670(8
Cores,
2.6GHz, 20MB Cache, 8.0GT)
Mem: 64GB(8×8GB) ECC Registered DDR3 1600MHz Samsung Memory 332
Fat NodeIntelH2216WPFKRNodesCPU: 2×Intel Xeon E5-2670(8
Cores,
2.6GHz, 20MB Cache, 8.0GT)
Mem: 256G(16×16G) ECC Registered DDR3 1600MHz Samsung Memory20
GPU NodeIntelR2208GZ4GC CPU: 2×Intel Xeon E5-2670(8
Cores,
2.6GHz, 20MB Cache, 8.0GT)
Mem: 64GB(8×8GB) ECC Registered DDR3 1600MHz Samsung Memory 50
MIC NodeIntelR2208GZ4GC CPU: 2×Intel Xeon E5-2670(8
Cores,
2.6GHz, 20MB Cache, 8.0GT)
Mem: 64GB(8×8GB) ECC Registered DDR3 1600MHz Samsung Memory 5
Computing Network SwitchMellanox Infiniband FDR Core Switch
648× FDR Core Switch MSX6536-10R, Mellanox Unified Fabric Manager   1
Mellanox SX1036 40Gb Switch 36× 40Gb Ethernet Switch SX1036, 36× QSFP
Interface 1
Management Network Switch   Extreme Summit X440-48t-10G 2-layer
Switch
48× 1Giga Switch Summit X440-48t-10G, authorized by ExtremeXOS   9
Extreme Summit X650-24X 3-layer Switch  24× 10Giga 3-layer Ethernet
Switch
Summit X650-24X, authorized by ExtremeXOS1
Parallel StorageDDN Parallel Storage System DDN SFA12K
Storage
System   1
GPU GPU Accelerator NVIDIA Tesla Kepler K20M70
MIC MIC Intel Xeon Phi 5110P Knights Corner 10
40Gb Ethernet Card  MCX314A-BCBTMellanox ConnextX-3 Chip 40Gb
Ethernet Card
2× 40Gb Ethernet ports, enough QSFP cables  16
SSD Intel SSD910Intel SSD910 Disk, 400GB, PCIE  80







On 8/10/2014 5:50 AM, Mark Abraham wrote:

  That's not what I said You can set...

-npme behaves the same whether or not GPUs are in use. Using separate
ranks
for PME caters to trying to minimize the cost of the all-to-all
communication of the 3DFFT. That's still relevant when using GPUs, but
if
separate PME ranks are used, any GPUs on nodes that only have PME ranks
are
left idle. The most effective approach depends critically on the
hardware
and simulation setup, and whether you pay money for 

Re: [gmx-users] Can we set the number of pure PME nodes when using GPUCPU?

2014-08-23 Thread Mark Abraham
On Sat, Aug 23, 2014 at 1:47 PM, Theodore Si sjyz...@gmail.com wrote:

 Hi,

 When we used 2 GPU nodes (each has 2 cpus and 2 gpus) to do a mdrun(with
 no PME-dedicated node), we noticed that when CPU are doing PME, GPU are
 idle,


That could happen if the GPU completes its work too fast, in which case the
end of the log file will probably scream about imbalance.

that is they are doing their work sequentially.


Highly unlikely, not least because the code is written to overlap the
short-range work on the GPU with everything else on the CPU. What's your
evidence for *sequential* rather than *imbalanced*?


 Is it supposed to be so?


No, but without seeing your .log files, mdrun command lines and knowing
about your hardware, there's nothing we can say.


 Is it the same reason as GPUs on PME-dedicated nodes won't be used during
 a run like you said before?


Why would you suppose that? I said GPUs do work from the PP ranks on their
node. That's true here.

So if we want to exploit our hardware, we should map PP-PME ranks manually,
 right? Say, use one node as PME-dedicated node and leave the GPUs on that
 node idle, and use two nodes to do the other stuff. How do you think about
 this arrangement?


Probably a terrible idea. You should identify the cause of the imbalance,
and fix that.

Mark



 Theo


 On 8/22/2014 7:20 PM, Mark Abraham wrote:

 Hi,

 Because no work will be sent to them. The GPU implementation can
 accelerate
 domains from PP ranks on their node, but with an MPMD setup that uses
 dedicated PME nodes, there will be no PP ranks on nodes that have been set
 up with only PME ranks. The two offload models (PP work - GPU; PME work
 -
 CPU subset) do not work well together, as I said.

 One can devise various schemes in 4.6/5.0 that could use those GPUs, but
 they either require
 * each node does both PME and PP work (thus limiting scaling because of
 the
 all-to-all for PME, and perhaps making poor use of locality on
 multi-socket
 nodes), or
 * that all nodes have PP ranks, but only some have PME ranks, and the
 nodes
 map their GPUs to PP ranks in a way that is different depending on whether
 PME ranks are present (which could work well, but relies on the DD
 load-balancer recognizing and taking advantage of the faster progress of
 the PP ranks that have better GPU support, and requires that you get very
 dirty hands laying out PP and PME ranks onto hardware that will later
 match
 the requirements of the DD load balancer, and probably that you balance
 PP-PME load manually)

 I do not recommend the last approach, because of its complexity.

 Clearly there are design decisions to improve. Work is underway.

 Cheers,

 Mark


 On Fri, Aug 22, 2014 at 10:11 AM, Theodore Si sjyz...@gmail.com wrote:

  Hi Mark,

 Could you tell me why that when we are GPU-CPU nodes as PME-dedicated
 nodes, the GPU on such nodes will be idle?


 Theo

 On 8/11/2014 9:36 PM, Mark Abraham wrote:

  Hi,

 What Carsten said, if running on nodes that have GPUs.

 If running on a mixed setup (some nodes with GPU, some not), then
 arranging
 your MPI environment to place PME ranks on CPU-only nodes is probably
 worthwhile. For example, all your PP ranks first, mapped to GPU nodes,
 then
 all your PME ranks, mapped to CPU-only nodes, and then use mdrun
 -ddorder
 pp_pme.

 Mark


 On Mon, Aug 11, 2014 at 2:45 AM, Theodore Si sjyz...@gmail.com wrote:

   Hi Mark,

 This is information of our cluster, could you give us some advice as
 regards to our cluster so that we can make GMX run faster on our
 system?

 Each CPU node has 2 CPUs and each GPU node has 2 CPUs and 2 Nvidia K20M


 Device Name Device Type Specifications  Number
 CPU NodeIntelH2216JFFKRNodesCPU: 2×Intel Xeon E5-2670(8
 Cores,
 2.6GHz, 20MB Cache, 8.0GT)
 Mem: 64GB(8×8GB) ECC Registered DDR3 1600MHz Samsung Memory 332
 Fat NodeIntelH2216WPFKRNodesCPU: 2×Intel Xeon E5-2670(8
 Cores,
 2.6GHz, 20MB Cache, 8.0GT)
 Mem: 256G(16×16G) ECC Registered DDR3 1600MHz Samsung Memory20
 GPU NodeIntelR2208GZ4GC CPU: 2×Intel Xeon E5-2670(8
 Cores,
 2.6GHz, 20MB Cache, 8.0GT)
 Mem: 64GB(8×8GB) ECC Registered DDR3 1600MHz Samsung Memory 50
 MIC NodeIntelR2208GZ4GC CPU: 2×Intel Xeon E5-2670(8
 Cores,
 2.6GHz, 20MB Cache, 8.0GT)
 Mem: 64GB(8×8GB) ECC Registered DDR3 1600MHz Samsung Memory 5
 Computing Network SwitchMellanox Infiniband FDR Core Switch
 648× FDR Core Switch MSX6536-10R, Mellanox Unified Fabric Manager
  1
 Mellanox SX1036 40Gb Switch 36× 40Gb Ethernet Switch SX1036, 36×
 QSFP
 Interface 1
 Management Network Switch   Extreme Summit X440-48t-10G 2-layer
 Switch
 48× 1Giga Switch Summit X440-48t-10G, authorized by ExtremeXOS   9
 Extreme Summit X650-24X 3-layer Switch  24× 10Giga 3-layer Ethernet
 Switch
 Summit X650-24X, authorized by ExtremeXOS1
 Parallel StorageDDN Parallel Storage System DDN SFA12K
 Storage
 System   1
 GPU GPU 

Re: [gmx-users] Can we set the number of pure PME nodes when using GPUCPU?

2014-08-22 Thread Theodore Si

Hi Mark,

Could you tell me why that when we are GPU-CPU nodes as PME-dedicated 
nodes, the GPU on such nodes will be idle?


Theo

On 8/11/2014 9:36 PM, Mark Abraham wrote:

Hi,

What Carsten said, if running on nodes that have GPUs.

If running on a mixed setup (some nodes with GPU, some not), then arranging
your MPI environment to place PME ranks on CPU-only nodes is probably
worthwhile. For example, all your PP ranks first, mapped to GPU nodes, then
all your PME ranks, mapped to CPU-only nodes, and then use mdrun -ddorder
pp_pme.

Mark


On Mon, Aug 11, 2014 at 2:45 AM, Theodore Si sjyz...@gmail.com wrote:


Hi Mark,

This is information of our cluster, could you give us some advice as
regards to our cluster so that we can make GMX run faster on our system?

Each CPU node has 2 CPUs and each GPU node has 2 CPUs and 2 Nvidia K20M


Device Name Device Type Specifications  Number
CPU NodeIntelH2216JFFKRNodesCPU: 2×Intel Xeon E5-2670(8 Cores,
2.6GHz, 20MB Cache, 8.0GT)
Mem: 64GB(8×8GB) ECC Registered DDR3 1600MHz Samsung Memory 332
Fat NodeIntelH2216WPFKRNodesCPU: 2×Intel Xeon E5-2670(8 Cores,
2.6GHz, 20MB Cache, 8.0GT)
Mem: 256G(16×16G) ECC Registered DDR3 1600MHz Samsung Memory20
GPU NodeIntelR2208GZ4GC CPU: 2×Intel Xeon E5-2670(8 Cores,
2.6GHz, 20MB Cache, 8.0GT)
Mem: 64GB(8×8GB) ECC Registered DDR3 1600MHz Samsung Memory 50
MIC NodeIntelR2208GZ4GC CPU: 2×Intel Xeon E5-2670(8 Cores,
2.6GHz, 20MB Cache, 8.0GT)
Mem: 64GB(8×8GB) ECC Registered DDR3 1600MHz Samsung Memory 5
Computing Network SwitchMellanox Infiniband FDR Core Switch
648× FDR Core Switch MSX6536-10R, Mellanox Unified Fabric Manager   1
Mellanox SX1036 40Gb Switch 36× 40Gb Ethernet Switch SX1036, 36× QSFP
Interface 1
Management Network Switch   Extreme Summit X440-48t-10G 2-layer Switch
48× 1Giga Switch Summit X440-48t-10G, authorized by ExtremeXOS   9
Extreme Summit X650-24X 3-layer Switch  24× 10Giga 3-layer Ethernet Switch
Summit X650-24X, authorized by ExtremeXOS1
Parallel StorageDDN Parallel Storage System DDN SFA12K Storage
System   1
GPU GPU Accelerator NVIDIA Tesla Kepler K20M70
MIC MIC Intel Xeon Phi 5110P Knights Corner 10
40Gb Ethernet Card  MCX314A-BCBTMellanox ConnextX-3 Chip 40Gb
Ethernet Card
2× 40Gb Ethernet ports, enough QSFP cables  16
SSD Intel SSD910Intel SSD910 Disk, 400GB, PCIE  80







On 8/10/2014 5:50 AM, Mark Abraham wrote:


That's not what I said You can set...

-npme behaves the same whether or not GPUs are in use. Using separate
ranks
for PME caters to trying to minimize the cost of the all-to-all
communication of the 3DFFT. That's still relevant when using GPUs, but if
separate PME ranks are used, any GPUs on nodes that only have PME ranks
are
left idle. The most effective approach depends critically on the hardware
and simulation setup, and whether you pay money for your hardware.

Mark


On Sat, Aug 9, 2014 at 2:56 AM, Theodore Si sjyz...@gmail.com wrote:

  Hi,

You mean no matter we use GPU acceleration or not, -npme is just a
reference?
Why we can't set that to a exact value?


On 8/9/2014 5:14 AM, Mark Abraham wrote:

  You can set the number of PME-only ranks with -npme. Whether it's useful

is
another matter :-) The CPU-based PME offload and the GPU-based PP
offload
do not combine very well.

Mark


On Fri, Aug 8, 2014 at 7:24 AM, Theodore Si sjyz...@gmail.com wrote:

   Hi,


Can we set the number manually with -npme when using GPU acceleration?


--
Gromacs Users mailing list

* Please search the archive at http://www.gromacs.org/
Support/Mailing_Lists/GMX-Users_List before posting!

* Can't post? Read http://www.gromacs.org/Support/Mailing_Lists

* For (un)subscribe requests visit
https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
send a mail to gmx-users-requ...@gromacs.org.


  --

Gromacs Users mailing list

* Please search the archive at http://www.gromacs.org/
Support/Mailing_Lists/GMX-Users_List before posting!

* Can't post? Read http://www.gromacs.org/Support/Mailing_Lists

* For (un)subscribe requests visit
https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
send a mail to gmx-users-requ...@gromacs.org.



--
Gromacs Users mailing list

* Please search the archive at http://www.gromacs.org/
Support/Mailing_Lists/GMX-Users_List before posting!

* Can't post? Read http://www.gromacs.org/Support/Mailing_Lists

* For (un)subscribe requests visit
https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
send a mail to gmx-users-requ...@gromacs.org.



--
Gromacs Users mailing list

* Please search the archive at 
http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting!

* Can't post? Read http://www.gromacs.org/Support/Mailing_Lists

* For (un)subscribe requests visit
https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a 

Re: [gmx-users] Can we set the number of pure PME nodes when using GPUCPU?

2014-08-22 Thread Mark Abraham
Hi,

Because no work will be sent to them. The GPU implementation can accelerate
domains from PP ranks on their node, but with an MPMD setup that uses
dedicated PME nodes, there will be no PP ranks on nodes that have been set
up with only PME ranks. The two offload models (PP work - GPU; PME work -
CPU subset) do not work well together, as I said.

One can devise various schemes in 4.6/5.0 that could use those GPUs, but
they either require
* each node does both PME and PP work (thus limiting scaling because of the
all-to-all for PME, and perhaps making poor use of locality on multi-socket
nodes), or
* that all nodes have PP ranks, but only some have PME ranks, and the nodes
map their GPUs to PP ranks in a way that is different depending on whether
PME ranks are present (which could work well, but relies on the DD
load-balancer recognizing and taking advantage of the faster progress of
the PP ranks that have better GPU support, and requires that you get very
dirty hands laying out PP and PME ranks onto hardware that will later match
the requirements of the DD load balancer, and probably that you balance
PP-PME load manually)

I do not recommend the last approach, because of its complexity.

Clearly there are design decisions to improve. Work is underway.

Cheers,

Mark


On Fri, Aug 22, 2014 at 10:11 AM, Theodore Si sjyz...@gmail.com wrote:

 Hi Mark,

 Could you tell me why that when we are GPU-CPU nodes as PME-dedicated
 nodes, the GPU on such nodes will be idle?


 Theo

 On 8/11/2014 9:36 PM, Mark Abraham wrote:

 Hi,

 What Carsten said, if running on nodes that have GPUs.

 If running on a mixed setup (some nodes with GPU, some not), then
 arranging
 your MPI environment to place PME ranks on CPU-only nodes is probably
 worthwhile. For example, all your PP ranks first, mapped to GPU nodes,
 then
 all your PME ranks, mapped to CPU-only nodes, and then use mdrun -ddorder
 pp_pme.

 Mark


 On Mon, Aug 11, 2014 at 2:45 AM, Theodore Si sjyz...@gmail.com wrote:

  Hi Mark,

 This is information of our cluster, could you give us some advice as
 regards to our cluster so that we can make GMX run faster on our system?

 Each CPU node has 2 CPUs and each GPU node has 2 CPUs and 2 Nvidia K20M


 Device Name Device Type Specifications  Number
 CPU NodeIntelH2216JFFKRNodesCPU: 2×Intel Xeon E5-2670(8
 Cores,
 2.6GHz, 20MB Cache, 8.0GT)
 Mem: 64GB(8×8GB) ECC Registered DDR3 1600MHz Samsung Memory 332
 Fat NodeIntelH2216WPFKRNodesCPU: 2×Intel Xeon E5-2670(8
 Cores,
 2.6GHz, 20MB Cache, 8.0GT)
 Mem: 256G(16×16G) ECC Registered DDR3 1600MHz Samsung Memory20
 GPU NodeIntelR2208GZ4GC CPU: 2×Intel Xeon E5-2670(8
 Cores,
 2.6GHz, 20MB Cache, 8.0GT)
 Mem: 64GB(8×8GB) ECC Registered DDR3 1600MHz Samsung Memory 50
 MIC NodeIntelR2208GZ4GC CPU: 2×Intel Xeon E5-2670(8
 Cores,
 2.6GHz, 20MB Cache, 8.0GT)
 Mem: 64GB(8×8GB) ECC Registered DDR3 1600MHz Samsung Memory 5
 Computing Network SwitchMellanox Infiniband FDR Core Switch
 648× FDR Core Switch MSX6536-10R, Mellanox Unified Fabric Manager   1
 Mellanox SX1036 40Gb Switch 36× 40Gb Ethernet Switch SX1036, 36× QSFP
 Interface 1
 Management Network Switch   Extreme Summit X440-48t-10G 2-layer
 Switch
 48× 1Giga Switch Summit X440-48t-10G, authorized by ExtremeXOS   9
 Extreme Summit X650-24X 3-layer Switch  24× 10Giga 3-layer Ethernet
 Switch
 Summit X650-24X, authorized by ExtremeXOS1
 Parallel StorageDDN Parallel Storage System DDN SFA12K
 Storage
 System   1
 GPU GPU Accelerator NVIDIA Tesla Kepler K20M70
 MIC MIC Intel Xeon Phi 5110P Knights Corner 10
 40Gb Ethernet Card  MCX314A-BCBTMellanox ConnextX-3 Chip 40Gb
 Ethernet Card
 2× 40Gb Ethernet ports, enough QSFP cables  16
 SSD Intel SSD910Intel SSD910 Disk, 400GB, PCIE  80







 On 8/10/2014 5:50 AM, Mark Abraham wrote:

  That's not what I said You can set...

 -npme behaves the same whether or not GPUs are in use. Using separate
 ranks
 for PME caters to trying to minimize the cost of the all-to-all
 communication of the 3DFFT. That's still relevant when using GPUs, but
 if
 separate PME ranks are used, any GPUs on nodes that only have PME ranks
 are
 left idle. The most effective approach depends critically on the
 hardware
 and simulation setup, and whether you pay money for your hardware.

 Mark


 On Sat, Aug 9, 2014 at 2:56 AM, Theodore Si sjyz...@gmail.com wrote:

   Hi,

 You mean no matter we use GPU acceleration or not, -npme is just a
 reference?
 Why we can't set that to a exact value?


 On 8/9/2014 5:14 AM, Mark Abraham wrote:

   You can set the number of PME-only ranks with -npme. Whether it's
 useful

 is
 another matter :-) The CPU-based PME offload and the GPU-based PP
 offload
 do not combine very well.

 Mark


 On Fri, Aug 8, 2014 at 7:24 AM, Theodore Si sjyz...@gmail.com
 wrote:

Hi,

  Can we set the number 

Re: [gmx-users] Can we set the number of pure PME nodes when using GPUCPU?

2014-08-19 Thread Theodore Si

Hi,

How can we designate which CPU-only nodes to be PME-dedicated nodes? 
What mdrun options or what configuration should we use to make that happen?


BR,
Theo

On 8/11/2014 9:36 PM, Mark Abraham wrote:

Hi,

What Carsten said, if running on nodes that have GPUs.

If running on a mixed setup (some nodes with GPU, some not), then arranging
your MPI environment to place PME ranks on CPU-only nodes is probably
worthwhile. For example, all your PP ranks first, mapped to GPU nodes, then
all your PME ranks, mapped to CPU-only nodes, and then use mdrun -ddorder
pp_pme.

Mark


On Mon, Aug 11, 2014 at 2:45 AM, Theodore Si sjyz...@gmail.com wrote:


Hi Mark,

This is information of our cluster, could you give us some advice as
regards to our cluster so that we can make GMX run faster on our system?

Each CPU node has 2 CPUs and each GPU node has 2 CPUs and 2 Nvidia K20M


Device Name Device Type Specifications  Number
CPU NodeIntelH2216JFFKRNodesCPU: 2×Intel Xeon E5-2670(8 Cores,
2.6GHz, 20MB Cache, 8.0GT)
Mem: 64GB(8×8GB) ECC Registered DDR3 1600MHz Samsung Memory 332
Fat NodeIntelH2216WPFKRNodesCPU: 2×Intel Xeon E5-2670(8 Cores,
2.6GHz, 20MB Cache, 8.0GT)
Mem: 256G(16×16G) ECC Registered DDR3 1600MHz Samsung Memory20
GPU NodeIntelR2208GZ4GC CPU: 2×Intel Xeon E5-2670(8 Cores,
2.6GHz, 20MB Cache, 8.0GT)
Mem: 64GB(8×8GB) ECC Registered DDR3 1600MHz Samsung Memory 50
MIC NodeIntelR2208GZ4GC CPU: 2×Intel Xeon E5-2670(8 Cores,
2.6GHz, 20MB Cache, 8.0GT)
Mem: 64GB(8×8GB) ECC Registered DDR3 1600MHz Samsung Memory 5
Computing Network SwitchMellanox Infiniband FDR Core Switch
648× FDR Core Switch MSX6536-10R, Mellanox Unified Fabric Manager   1
Mellanox SX1036 40Gb Switch 36× 40Gb Ethernet Switch SX1036, 36× QSFP
Interface 1
Management Network Switch   Extreme Summit X440-48t-10G 2-layer Switch
48× 1Giga Switch Summit X440-48t-10G, authorized by ExtremeXOS   9
Extreme Summit X650-24X 3-layer Switch  24× 10Giga 3-layer Ethernet Switch
Summit X650-24X, authorized by ExtremeXOS1
Parallel StorageDDN Parallel Storage System DDN SFA12K Storage
System   1
GPU GPU Accelerator NVIDIA Tesla Kepler K20M70
MIC MIC Intel Xeon Phi 5110P Knights Corner 10
40Gb Ethernet Card  MCX314A-BCBTMellanox ConnextX-3 Chip 40Gb
Ethernet Card
2× 40Gb Ethernet ports, enough QSFP cables  16
SSD Intel SSD910Intel SSD910 Disk, 400GB, PCIE  80







On 8/10/2014 5:50 AM, Mark Abraham wrote:


That's not what I said You can set...

-npme behaves the same whether or not GPUs are in use. Using separate
ranks
for PME caters to trying to minimize the cost of the all-to-all
communication of the 3DFFT. That's still relevant when using GPUs, but if
separate PME ranks are used, any GPUs on nodes that only have PME ranks
are
left idle. The most effective approach depends critically on the hardware
and simulation setup, and whether you pay money for your hardware.

Mark


On Sat, Aug 9, 2014 at 2:56 AM, Theodore Si sjyz...@gmail.com wrote:

  Hi,

You mean no matter we use GPU acceleration or not, -npme is just a
reference?
Why we can't set that to a exact value?


On 8/9/2014 5:14 AM, Mark Abraham wrote:

  You can set the number of PME-only ranks with -npme. Whether it's useful

is
another matter :-) The CPU-based PME offload and the GPU-based PP
offload
do not combine very well.

Mark


On Fri, Aug 8, 2014 at 7:24 AM, Theodore Si sjyz...@gmail.com wrote:

   Hi,


Can we set the number manually with -npme when using GPU acceleration?


--
Gromacs Users mailing list

* Please search the archive at http://www.gromacs.org/
Support/Mailing_Lists/GMX-Users_List before posting!

* Can't post? Read http://www.gromacs.org/Support/Mailing_Lists

* For (un)subscribe requests visit
https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
send a mail to gmx-users-requ...@gromacs.org.


  --

Gromacs Users mailing list

* Please search the archive at http://www.gromacs.org/
Support/Mailing_Lists/GMX-Users_List before posting!

* Can't post? Read http://www.gromacs.org/Support/Mailing_Lists

* For (un)subscribe requests visit
https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
send a mail to gmx-users-requ...@gromacs.org.



--
Gromacs Users mailing list

* Please search the archive at http://www.gromacs.org/
Support/Mailing_Lists/GMX-Users_List before posting!

* Can't post? Read http://www.gromacs.org/Support/Mailing_Lists

* For (un)subscribe requests visit
https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
send a mail to gmx-users-requ...@gromacs.org.



--
Gromacs Users mailing list

* Please search the archive at 
http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting!

* Can't post? Read http://www.gromacs.org/Support/Mailing_Lists

* For (un)subscribe requests visit

Re: [gmx-users] Can we set the number of pure PME nodes when using GPUCPU?

2014-08-19 Thread Szilárd Páll
On Tue, Aug 19, 2014 at 4:19 PM, Theodore Si sjyz...@gmail.com wrote:
 Hi,

 How can we designate which CPU-only nodes to be PME-dedicated nodes?

mpirun -np N mdrun_mpi -npme M

Starts N ranks out of which M will be PME-only and (M-N) PP ranks.

 What
 mdrun options or what configuration should we use to make that happen?

You can change the rank ordering with -ddorder, there are three
available patters. Otherwise, you can do manual rank ordering by
telling MPI how to reorder ranks presented to mdrun itself; e.g. with
MPICH you hcan use the MPICH_RANK_ORDER environment variable.

Cheers,
--
Szilárd

 BR,
 Theo


 On 8/11/2014 9:36 PM, Mark Abraham wrote:

 Hi,

 What Carsten said, if running on nodes that have GPUs.

 If running on a mixed setup (some nodes with GPU, some not), then
 arranging
 your MPI environment to place PME ranks on CPU-only nodes is probably
 worthwhile. For example, all your PP ranks first, mapped to GPU nodes,
 then
 all your PME ranks, mapped to CPU-only nodes, and then use mdrun -ddorder
 pp_pme.

 Mark


 On Mon, Aug 11, 2014 at 2:45 AM, Theodore Si sjyz...@gmail.com wrote:

 Hi Mark,

 This is information of our cluster, could you give us some advice as
 regards to our cluster so that we can make GMX run faster on our system?

 Each CPU node has 2 CPUs and each GPU node has 2 CPUs and 2 Nvidia K20M


 Device Name Device Type Specifications  Number
 CPU NodeIntelH2216JFFKRNodesCPU: 2×Intel Xeon E5-2670(8
 Cores,
 2.6GHz, 20MB Cache, 8.0GT)
 Mem: 64GB(8×8GB) ECC Registered DDR3 1600MHz Samsung Memory 332
 Fat NodeIntelH2216WPFKRNodesCPU: 2×Intel Xeon E5-2670(8
 Cores,
 2.6GHz, 20MB Cache, 8.0GT)
 Mem: 256G(16×16G) ECC Registered DDR3 1600MHz Samsung Memory20
 GPU NodeIntelR2208GZ4GC CPU: 2×Intel Xeon E5-2670(8
 Cores,
 2.6GHz, 20MB Cache, 8.0GT)
 Mem: 64GB(8×8GB) ECC Registered DDR3 1600MHz Samsung Memory 50
 MIC NodeIntelR2208GZ4GC CPU: 2×Intel Xeon E5-2670(8
 Cores,
 2.6GHz, 20MB Cache, 8.0GT)
 Mem: 64GB(8×8GB) ECC Registered DDR3 1600MHz Samsung Memory 5
 Computing Network SwitchMellanox Infiniband FDR Core Switch
 648× FDR Core Switch MSX6536-10R, Mellanox Unified Fabric Manager   1
 Mellanox SX1036 40Gb Switch 36× 40Gb Ethernet Switch SX1036, 36× QSFP
 Interface 1
 Management Network Switch   Extreme Summit X440-48t-10G 2-layer
 Switch
 48× 1Giga Switch Summit X440-48t-10G, authorized by ExtremeXOS   9
 Extreme Summit X650-24X 3-layer Switch  24× 10Giga 3-layer Ethernet
 Switch
 Summit X650-24X, authorized by ExtremeXOS1
 Parallel StorageDDN Parallel Storage System DDN SFA12K
 Storage
 System   1
 GPU GPU Accelerator NVIDIA Tesla Kepler K20M70
 MIC MIC Intel Xeon Phi 5110P Knights Corner 10
 40Gb Ethernet Card  MCX314A-BCBTMellanox ConnextX-3 Chip 40Gb
 Ethernet Card
 2× 40Gb Ethernet ports, enough QSFP cables  16
 SSD Intel SSD910Intel SSD910 Disk, 400GB, PCIE  80







 On 8/10/2014 5:50 AM, Mark Abraham wrote:

 That's not what I said You can set...

 -npme behaves the same whether or not GPUs are in use. Using separate
 ranks
 for PME caters to trying to minimize the cost of the all-to-all
 communication of the 3DFFT. That's still relevant when using GPUs, but
 if
 separate PME ranks are used, any GPUs on nodes that only have PME ranks
 are
 left idle. The most effective approach depends critically on the
 hardware
 and simulation setup, and whether you pay money for your hardware.

 Mark


 On Sat, Aug 9, 2014 at 2:56 AM, Theodore Si sjyz...@gmail.com wrote:

   Hi,

 You mean no matter we use GPU acceleration or not, -npme is just a
 reference?
 Why we can't set that to a exact value?


 On 8/9/2014 5:14 AM, Mark Abraham wrote:

   You can set the number of PME-only ranks with -npme. Whether it's
 useful

 is
 another matter :-) The CPU-based PME offload and the GPU-based PP
 offload
 do not combine very well.

 Mark


 On Fri, Aug 8, 2014 at 7:24 AM, Theodore Si sjyz...@gmail.com wrote:

Hi,

 Can we set the number manually with -npme when using GPU
 acceleration?


 --
 Gromacs Users mailing list

 * Please search the archive at http://www.gromacs.org/
 Support/Mailing_Lists/GMX-Users_List before posting!

 * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists

 * For (un)subscribe requests visit
 https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
 send a mail to gmx-users-requ...@gromacs.org.


   --

 Gromacs Users mailing list

 * Please search the archive at http://www.gromacs.org/
 Support/Mailing_Lists/GMX-Users_List before posting!

 * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists

 * For (un)subscribe requests visit
 https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
 send a mail to gmx-users-requ...@gromacs.org.


 --
 Gromacs Users mailing list

 * Please search the archive at http://www.gromacs.org/
 

Re: [gmx-users] Can we set the number of pure PME nodes when using GPUCPU?

2014-08-11 Thread Theodore Si

Hi Mark,

This is information of our cluster, could you give us some advice as 
regards to our cluster so that we can make GMX run faster on our system?


Each CPU node has 2 CPUs and each GPU node has 2 CPUs and 2 Nvidia K20M


Device Name Device Type Specifications  Number
CPU Node 	IntelH2216JFFKRNodes 	CPU: 2×Intel Xeon E5-2670(8 Cores, 
2.6GHz, 20MB Cache, 8.0GT)

Mem: 64GB(8×8GB) ECC Registered DDR3 1600MHz Samsung Memory 332
Fat Node 	IntelH2216WPFKRNodes 	CPU: 2×Intel Xeon E5-2670(8 Cores, 
2.6GHz, 20MB Cache, 8.0GT)

Mem: 256G(16×16G) ECC Registered DDR3 1600MHz Samsung Memory20
GPU Node 	IntelR2208GZ4GC 	CPU: 2×Intel Xeon E5-2670(8 Cores, 2.6GHz, 
20MB Cache, 8.0GT)

Mem: 64GB(8×8GB) ECC Registered DDR3 1600MHz Samsung Memory 50
MIC Node 	IntelR2208GZ4GC 	CPU: 2×Intel Xeon E5-2670(8 Cores, 2.6GHz, 
20MB Cache, 8.0GT)

Mem: 64GB(8×8GB) ECC Registered DDR3 1600MHz Samsung Memory 5
Computing Network Switch 	Mellanox Infiniband FDR Core Switch 	648× FDR 
Core Switch MSX6536-10R, Mellanox Unified Fabric Manager 	1
Mellanox SX1036 40Gb Switch 	36× 40Gb Ethernet Switch SX1036, 36× QSFP 
Interface 	1
Management Network Switch 	Extreme Summit X440-48t-10G 2-layer Switch 
48× 1Giga Switch Summit X440-48t-10G, authorized by ExtremeXOS 	9
Extreme Summit X650-24X 3-layer Switch 	24× 10Giga 3-layer Ethernet 
Switch Summit X650-24X, authorized by ExtremeXOS 	1

Parallel StorageDDN Parallel Storage System DDN SFA12K Storage 
System   1
GPU GPU Accelerator NVIDIA Tesla Kepler K20M70
MIC MIC Intel Xeon Phi 5110P Knights Corner 10
40Gb Ethernet Card 	MCX314A-BCBT 	Mellanox ConnextX-3 Chip 40Gb Ethernet 
Card

2× 40Gb Ethernet ports, enough QSFP cables  16
SSD Intel SSD910Intel SSD910 Disk, 400GB, PCIE  80






On 8/10/2014 5:50 AM, Mark Abraham wrote:

That's not what I said You can set...

-npme behaves the same whether or not GPUs are in use. Using separate ranks
for PME caters to trying to minimize the cost of the all-to-all
communication of the 3DFFT. That's still relevant when using GPUs, but if
separate PME ranks are used, any GPUs on nodes that only have PME ranks are
left idle. The most effective approach depends critically on the hardware
and simulation setup, and whether you pay money for your hardware.

Mark


On Sat, Aug 9, 2014 at 2:56 AM, Theodore Si sjyz...@gmail.com wrote:


Hi,

You mean no matter we use GPU acceleration or not, -npme is just a
reference?
Why we can't set that to a exact value?


On 8/9/2014 5:14 AM, Mark Abraham wrote:


You can set the number of PME-only ranks with -npme. Whether it's useful
is
another matter :-) The CPU-based PME offload and the GPU-based PP offload
do not combine very well.

Mark


On Fri, Aug 8, 2014 at 7:24 AM, Theodore Si sjyz...@gmail.com wrote:

  Hi,

Can we set the number manually with -npme when using GPU acceleration?


--
Gromacs Users mailing list

* Please search the archive at http://www.gromacs.org/
Support/Mailing_Lists/GMX-Users_List before posting!

* Can't post? Read http://www.gromacs.org/Support/Mailing_Lists

* For (un)subscribe requests visit
https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
send a mail to gmx-users-requ...@gromacs.org.



--
Gromacs Users mailing list

* Please search the archive at http://www.gromacs.org/
Support/Mailing_Lists/GMX-Users_List before posting!

* Can't post? Read http://www.gromacs.org/Support/Mailing_Lists

* For (un)subscribe requests visit
https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
send a mail to gmx-users-requ...@gromacs.org.



--
Gromacs Users mailing list

* Please search the archive at 
http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting!

* Can't post? Read http://www.gromacs.org/Support/Mailing_Lists

* For (un)subscribe requests visit
https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a 
mail to gmx-users-requ...@gromacs.org.


Re: [gmx-users] Can we set the number of pure PME nodes when using GPUCPU?

2014-08-11 Thread Carsten Kutzner
Hi,

you could start twice as many MPI processes per node as you have GPUs on a
node and use half of all processes for PME, e.g. on 4 nodes:

mpirun -np 8 mdrun -s in.tpr -npme 4

or start 4 processes per node:
mpirun -np 16 mdrun -s in.tpr -npme 4 -gpu_id 0011

or with more OpenMP threads for the PME processes:
mpirun -np 16 mdrun -ntomp 2 -ntomp_pme 6 -pin on -s in.tpr -npme 4 -gpu_id 0011

With -ntomp and -ntomp_pme you can fine-tune the compute
power ratio between PME and PP nodes. You need to try out different
combinations to find the optimum, the comments in the md.log file
give hints on what to change.

This approach usually yields good performance if you use several nodes,
on a single node there will for sure be better settings (most likely
more MPI processes with less OpenMP threads each).

Carsten


On 11 Aug 2014, at 11:45, Theodore Si sjyz...@gmail.com wrote:

 Hi Mark,
 
 This is information of our cluster, could you give us some advice as regards 
 to our cluster so that we can make GMX run faster on our system?
 
 Each CPU node has 2 CPUs and each GPU node has 2 CPUs and 2 Nvidia K20M
 
 
 Device Name   Device Type Specifications  Number
 CPU Node  IntelH2216JFFKRNodesCPU: 2×Intel Xeon E5-2670(8 Cores, 
 2.6GHz, 20MB Cache, 8.0GT)
 Mem: 64GB(8×8GB) ECC Registered DDR3 1600MHz Samsung Memory   332
 Fat Node  IntelH2216WPFKRNodesCPU: 2×Intel Xeon E5-2670(8 Cores, 
 2.6GHz, 20MB Cache, 8.0GT)
 Mem: 256G(16×16G) ECC Registered DDR3 1600MHz Samsung Memory  20
 GPU Node  IntelR2208GZ4GC CPU: 2×Intel Xeon E5-2670(8 Cores, 
 2.6GHz, 20MB Cache, 8.0GT)
 Mem: 64GB(8×8GB) ECC Registered DDR3 1600MHz Samsung Memory   50
 MIC Node  IntelR2208GZ4GC CPU: 2×Intel Xeon E5-2670(8 Cores, 
 2.6GHz, 20MB Cache, 8.0GT)
 Mem: 64GB(8×8GB) ECC Registered DDR3 1600MHz Samsung Memory   5
 Computing Network Switch  Mellanox Infiniband FDR Core Switch 648× 
 FDR Core Switch MSX6536-10R, Mellanox Unified Fabric Manager   1
 Mellanox SX1036 40Gb Switch   36× 40Gb Ethernet Switch SX1036, 36× QSFP 
 Interface 1
 Management Network Switch Extreme Summit X440-48t-10G 2-layer Switch 48× 
 1Giga Switch Summit X440-48t-10G, authorized by ExtremeXOS   9
 Extreme Summit X650-24X 3-layer Switch24× 10Giga 3-layer Ethernet 
 Switch Summit X650-24X, authorized by ExtremeXOS1
 Parallel Storage  DDN Parallel Storage System DDN SFA12K Storage 
 System   1
 GPU   GPU Accelerator NVIDIA Tesla Kepler K20M70
 MIC   MIC Intel Xeon Phi 5110P Knights Corner 10
 40Gb Ethernet CardMCX314A-BCBTMellanox ConnextX-3 Chip 40Gb Ethernet 
 Card
 2× 40Gb Ethernet ports, enough QSFP cables16
 SSD   Intel SSD910Intel SSD910 Disk, 400GB, PCIE  80
 
 
 
 
 
 
 On 8/10/2014 5:50 AM, Mark Abraham wrote:
 That's not what I said You can set...
 
 -npme behaves the same whether or not GPUs are in use. Using separate ranks
 for PME caters to trying to minimize the cost of the all-to-all
 communication of the 3DFFT. That's still relevant when using GPUs, but if
 separate PME ranks are used, any GPUs on nodes that only have PME ranks are
 left idle. The most effective approach depends critically on the hardware
 and simulation setup, and whether you pay money for your hardware.
 
 Mark
 
 
 On Sat, Aug 9, 2014 at 2:56 AM, Theodore Si sjyz...@gmail.com wrote:
 
 Hi,
 
 You mean no matter we use GPU acceleration or not, -npme is just a
 reference?
 Why we can't set that to a exact value?
 
 
 On 8/9/2014 5:14 AM, Mark Abraham wrote:
 
 You can set the number of PME-only ranks with -npme. Whether it's useful
 is
 another matter :-) The CPU-based PME offload and the GPU-based PP offload
 do not combine very well.
 
 Mark
 
 
 On Fri, Aug 8, 2014 at 7:24 AM, Theodore Si sjyz...@gmail.com wrote:
 
  Hi,
 Can we set the number manually with -npme when using GPU acceleration?
 
 
 --
 Gromacs Users mailing list
 
 * Please search the archive at http://www.gromacs.org/
 Support/Mailing_Lists/GMX-Users_List before posting!
 
 * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
 
 * For (un)subscribe requests visit
 https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
 send a mail to gmx-users-requ...@gromacs.org.
 
 
 --
 Gromacs Users mailing list
 
 * Please search the archive at http://www.gromacs.org/
 Support/Mailing_Lists/GMX-Users_List before posting!
 
 * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
 
 * For (un)subscribe requests visit
 https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
 send a mail to gmx-users-requ...@gromacs.org.
 
 
 -- 
 Gromacs Users mailing list
 
 * Please search the archive at 
 http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting!
 
 * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
 
 * For (un)subscribe requests visit
 https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a 
 mail to 

Re: [gmx-users] Can we set the number of pure PME nodes when using GPUCPU?

2014-08-11 Thread Mark Abraham
Hi,

What Carsten said, if running on nodes that have GPUs.

If running on a mixed setup (some nodes with GPU, some not), then arranging
your MPI environment to place PME ranks on CPU-only nodes is probably
worthwhile. For example, all your PP ranks first, mapped to GPU nodes, then
all your PME ranks, mapped to CPU-only nodes, and then use mdrun -ddorder
pp_pme.

Mark


On Mon, Aug 11, 2014 at 2:45 AM, Theodore Si sjyz...@gmail.com wrote:

 Hi Mark,

 This is information of our cluster, could you give us some advice as
 regards to our cluster so that we can make GMX run faster on our system?

 Each CPU node has 2 CPUs and each GPU node has 2 CPUs and 2 Nvidia K20M


 Device Name Device Type Specifications  Number
 CPU NodeIntelH2216JFFKRNodesCPU: 2×Intel Xeon E5-2670(8 Cores,
 2.6GHz, 20MB Cache, 8.0GT)
 Mem: 64GB(8×8GB) ECC Registered DDR3 1600MHz Samsung Memory 332
 Fat NodeIntelH2216WPFKRNodesCPU: 2×Intel Xeon E5-2670(8 Cores,
 2.6GHz, 20MB Cache, 8.0GT)
 Mem: 256G(16×16G) ECC Registered DDR3 1600MHz Samsung Memory20
 GPU NodeIntelR2208GZ4GC CPU: 2×Intel Xeon E5-2670(8 Cores,
 2.6GHz, 20MB Cache, 8.0GT)
 Mem: 64GB(8×8GB) ECC Registered DDR3 1600MHz Samsung Memory 50
 MIC NodeIntelR2208GZ4GC CPU: 2×Intel Xeon E5-2670(8 Cores,
 2.6GHz, 20MB Cache, 8.0GT)
 Mem: 64GB(8×8GB) ECC Registered DDR3 1600MHz Samsung Memory 5
 Computing Network SwitchMellanox Infiniband FDR Core Switch
 648× FDR Core Switch MSX6536-10R, Mellanox Unified Fabric Manager   1
 Mellanox SX1036 40Gb Switch 36× 40Gb Ethernet Switch SX1036, 36× QSFP
 Interface 1
 Management Network Switch   Extreme Summit X440-48t-10G 2-layer Switch
 48× 1Giga Switch Summit X440-48t-10G, authorized by ExtremeXOS   9
 Extreme Summit X650-24X 3-layer Switch  24× 10Giga 3-layer Ethernet Switch
 Summit X650-24X, authorized by ExtremeXOS1
 Parallel StorageDDN Parallel Storage System DDN SFA12K Storage
 System   1
 GPU GPU Accelerator NVIDIA Tesla Kepler K20M70
 MIC MIC Intel Xeon Phi 5110P Knights Corner 10
 40Gb Ethernet Card  MCX314A-BCBTMellanox ConnextX-3 Chip 40Gb
 Ethernet Card
 2× 40Gb Ethernet ports, enough QSFP cables  16
 SSD Intel SSD910Intel SSD910 Disk, 400GB, PCIE  80







 On 8/10/2014 5:50 AM, Mark Abraham wrote:

 That's not what I said You can set...

 -npme behaves the same whether or not GPUs are in use. Using separate
 ranks
 for PME caters to trying to minimize the cost of the all-to-all
 communication of the 3DFFT. That's still relevant when using GPUs, but if
 separate PME ranks are used, any GPUs on nodes that only have PME ranks
 are
 left idle. The most effective approach depends critically on the hardware
 and simulation setup, and whether you pay money for your hardware.

 Mark


 On Sat, Aug 9, 2014 at 2:56 AM, Theodore Si sjyz...@gmail.com wrote:

  Hi,

 You mean no matter we use GPU acceleration or not, -npme is just a
 reference?
 Why we can't set that to a exact value?


 On 8/9/2014 5:14 AM, Mark Abraham wrote:

  You can set the number of PME-only ranks with -npme. Whether it's useful
 is
 another matter :-) The CPU-based PME offload and the GPU-based PP
 offload
 do not combine very well.

 Mark


 On Fri, Aug 8, 2014 at 7:24 AM, Theodore Si sjyz...@gmail.com wrote:

   Hi,

 Can we set the number manually with -npme when using GPU acceleration?


 --
 Gromacs Users mailing list

 * Please search the archive at http://www.gromacs.org/
 Support/Mailing_Lists/GMX-Users_List before posting!

 * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists

 * For (un)subscribe requests visit
 https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
 send a mail to gmx-users-requ...@gromacs.org.


  --
 Gromacs Users mailing list

 * Please search the archive at http://www.gromacs.org/
 Support/Mailing_Lists/GMX-Users_List before posting!

 * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists

 * For (un)subscribe requests visit
 https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
 send a mail to gmx-users-requ...@gromacs.org.


 --
 Gromacs Users mailing list

 * Please search the archive at http://www.gromacs.org/
 Support/Mailing_Lists/GMX-Users_List before posting!

 * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists

 * For (un)subscribe requests visit
 https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
 send a mail to gmx-users-requ...@gromacs.org.

-- 
Gromacs Users mailing list

* Please search the archive at 
http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting!

* Can't post? Read http://www.gromacs.org/Support/Mailing_Lists

* For (un)subscribe requests visit
https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a 
mail to gmx-users-requ...@gromacs.org.


Re: [gmx-users] Can we set the number of pure PME nodes when using GPUCPU?

2014-08-09 Thread Theodore Si

Hi,

You mean no matter we use GPU acceleration or not, -npme is just a 
reference?

Why we can't set that to a exact value?

On 8/9/2014 5:14 AM, Mark Abraham wrote:

You can set the number of PME-only ranks with -npme. Whether it's useful is
another matter :-) The CPU-based PME offload and the GPU-based PP offload
do not combine very well.

Mark


On Fri, Aug 8, 2014 at 7:24 AM, Theodore Si sjyz...@gmail.com wrote:


Hi,

Can we set the number manually with -npme when using GPU acceleration?


--
Gromacs Users mailing list

* Please search the archive at http://www.gromacs.org/
Support/Mailing_Lists/GMX-Users_List before posting!

* Can't post? Read http://www.gromacs.org/Support/Mailing_Lists

* For (un)subscribe requests visit
https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
send a mail to gmx-users-requ...@gromacs.org.



--
Gromacs Users mailing list

* Please search the archive at 
http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting!

* Can't post? Read http://www.gromacs.org/Support/Mailing_Lists

* For (un)subscribe requests visit
https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a 
mail to gmx-users-requ...@gromacs.org.


Re: [gmx-users] Can we set the number of pure PME nodes when using GPUCPU?

2014-08-09 Thread Mark Abraham
That's not what I said You can set...

-npme behaves the same whether or not GPUs are in use. Using separate ranks
for PME caters to trying to minimize the cost of the all-to-all
communication of the 3DFFT. That's still relevant when using GPUs, but if
separate PME ranks are used, any GPUs on nodes that only have PME ranks are
left idle. The most effective approach depends critically on the hardware
and simulation setup, and whether you pay money for your hardware.

Mark


On Sat, Aug 9, 2014 at 2:56 AM, Theodore Si sjyz...@gmail.com wrote:

 Hi,

 You mean no matter we use GPU acceleration or not, -npme is just a
 reference?
 Why we can't set that to a exact value?


 On 8/9/2014 5:14 AM, Mark Abraham wrote:

 You can set the number of PME-only ranks with -npme. Whether it's useful
 is
 another matter :-) The CPU-based PME offload and the GPU-based PP offload
 do not combine very well.

 Mark


 On Fri, Aug 8, 2014 at 7:24 AM, Theodore Si sjyz...@gmail.com wrote:

  Hi,

 Can we set the number manually with -npme when using GPU acceleration?


 --
 Gromacs Users mailing list

 * Please search the archive at http://www.gromacs.org/
 Support/Mailing_Lists/GMX-Users_List before posting!

 * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists

 * For (un)subscribe requests visit
 https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
 send a mail to gmx-users-requ...@gromacs.org.


 --
 Gromacs Users mailing list

 * Please search the archive at http://www.gromacs.org/
 Support/Mailing_Lists/GMX-Users_List before posting!

 * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists

 * For (un)subscribe requests visit
 https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
 send a mail to gmx-users-requ...@gromacs.org.

-- 
Gromacs Users mailing list

* Please search the archive at 
http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting!

* Can't post? Read http://www.gromacs.org/Support/Mailing_Lists

* For (un)subscribe requests visit
https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a 
mail to gmx-users-requ...@gromacs.org.


Re: [gmx-users] Can we set the number of pure PME nodes when using GPUCPU?

2014-08-08 Thread Mark Abraham
You can set the number of PME-only ranks with -npme. Whether it's useful is
another matter :-) The CPU-based PME offload and the GPU-based PP offload
do not combine very well.

Mark


On Fri, Aug 8, 2014 at 7:24 AM, Theodore Si sjyz...@gmail.com wrote:

 Hi,

 Can we set the number manually with -npme when using GPU acceleration?


 --
 Gromacs Users mailing list

 * Please search the archive at http://www.gromacs.org/
 Support/Mailing_Lists/GMX-Users_List before posting!

 * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists

 * For (un)subscribe requests visit
 https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
 send a mail to gmx-users-requ...@gromacs.org.

-- 
Gromacs Users mailing list

* Please search the archive at 
http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting!

* Can't post? Read http://www.gromacs.org/Support/Mailing_Lists

* For (un)subscribe requests visit
https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a 
mail to gmx-users-requ...@gromacs.org.