Re: [gmx-users] Running gmx-4.6.x over multiple homogeneous nodes with GPU acceleration

2013-06-09 Thread Szilárd Páll
On Wed, Jun 5, 2013 at 4:35 PM, João Henriques
joao.henriques.32...@gmail.com wrote:
 Just to wrap up this thread, it does work when the mpirun is properly
 configured. I knew it had to be my fault :)

 Something like this works like a charm:
 mpirun -npernode 2 mdrun_mpi -ntomp 8 -gpu_id 01 -deffnm md -v

That is indeed the correct way to launch the simulation, this way
you'll have two ranks per node, each using a different GPU. However,
coming back to your initial (non-working) launch config, if you want
to run 4 x 4 threads per rank you'll have to assign two for each GPU:
mpirun -np 4 mdrun_mpi -gpu_id 00111 -deffnm md -v

If scaling in the multi-threading is limiting performance, the above
will help (compared to 2 x 8s thread per rank) - which is often the
case on AMD and I've seen cases where it did help already on a single
Intel node.

I'd like to point out one more thing which is important when you run
on more than just a node or two. GPU accelerated runs don't switch to
using separate PME ranks - mostly because it's very hard to pick the
settings for distributing cores between PP and PME ranks. However,
already from around two-three nodes, you will get better performance
by using separate PME ranks.

You should experiment with using part of the cores (usually half is a
decent choice) for PME either by running 2 PP + 1 PME or 2 PP and 2
PME.

 Thank you Mark and Szilárd for your invaluable expertise.

Welcome!

--
Szilárd


 Best regards,
 João Henriques


 On Wed, Jun 5, 2013 at 4:21 PM, João Henriques 
 joao.henriques.32...@gmail.com wrote:

 Ok, thanks once again. I will do my best to overcome this issue.

 Best regards,
 João Henriques


 On Wed, Jun 5, 2013 at 3:33 PM, Mark Abraham mark.j.abra...@gmail.comwrote:

 On Wed, Jun 5, 2013 at 2:53 PM, João Henriques 
 joao.henriques.32...@gmail.com wrote:

  Sorry to keep bugging you guys, but even after considering all you
  suggested and reading the bugzilla thread Mark pointed out, I'm still
  unable to make the simulation run over multiple nodes.
  *Here is a template of a simple submission over 2 nodes:*
 
  --- START ---
  #!/bin/sh
  #
  # - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
  #
  # Job name
  #SBATCH -J md
  #
  # No. of nodes and no. of processors per node
  #SBATCH -N 2
  #SBATCH --exclusive
  #
  # Time needed to complete the job
  #SBATCH -t 48:00:00
  #
  # Add modules
  module load gcc/4.6.3
  module load openmpi/1.6.3/gcc/4.6.3
  module load cuda/5.0
  module load gromacs/4.6
  #
  # - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
  #
  grompp -f md.mdp -c npt.gro -t npt.cpt -p topol -o md.tpr
  mpirun -np 4 mdrun_mpi -gpu_id 01 -deffnm md -v
  #
  # - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
  --- END ---
 
  *Here is an extract of the md.log:*
 
  --- START ---
  Using 4 MPI processes
  Using 4 OpenMP threads per MPI process
 
  Detecting CPU-specific acceleration.
  Present hardware specification:
  Vendor: GenuineIntel
  Brand:  Intel(R) Xeon(R) CPU E5-2650 0 @ 2.00GHz
  Family:  6  Model: 45  Stepping:  7
  Features: aes apic avx clfsh cmov cx8 cx16 htt lahf_lm mmx msr
 nonstop_tsc
  pcid pclmuldq pdcm pdpe1gb popcnt pse rdtscp sse2 sse3 sse4.1 sse4.2
 ssse3
  tdt x2apic
  Acceleration most likely to fit this hardware: AVX_256
  Acceleration selected at GROMACS compile time: AVX_256
 
 
  2 GPUs detected on host en001:
#0: NVIDIA Tesla K20m, compute cap.: 3.5, ECC: yes, stat: compatible
#1: NVIDIA Tesla K20m, compute cap.: 3.5, ECC: yes, stat: compatible
 
 
  ---
  Program mdrun_mpi, VERSION 4.6
  Source code file:
 
 /lunarc/sw/erik/src/gromacs/gromacs-4.6/src/gmxlib/gmx_detect_hardware.c,
  line: 322
 
  Fatal error:
  Incorrect launch configuration: mismatching number of PP MPI processes
 and
  GPUs per node.
 

 per node is critical here.


  mdrun_mpi was started with 4 PP MPI processes per node, but you
 provided 2
  GPUs.
 

 ...and here. As far as mdrun_mpi knows from the MPI system there's only
 MPI
 ranks on this one node.

 For more information and tips for troubleshooting, please check the
 GROMACS
  website at http://www.gromacs.org/Documentation/Errors
  ---
  --- END ---
 
  As you can see, gmx is having trouble understanding that there's a
 second
  node available. Note that since I did not specify -ntomp, it assigned 4
  threads to each of the 4 mpi processes (filling the entire avail. 16
 CPUs
  *on
  one node*).
  For the same exact submission, if I do set -ntomp 8 (since I have 4
 MPI
  procs * 8 OpenMP threads = 32 CPUs total on the 2 nodes) I get a warning
  telling me that I'm hyperthreading, which can only mean that *gmx is
  assigning all processes to the first node once again.*
  Am I doing something wrong or is there some problem with gmx-4.6? I
 guess
  it can only be my fault, since I've never seen anyone 

Re: [gmx-users] Running gmx-4.6.x over multiple homogeneous nodes with GPU acceleration

2013-06-05 Thread João Henriques
Thank you very much for both contributions. I will conduct some tests to
assess which approach works best for my system.

Much appreciated,
Best regards,
João Henriques


On Tue, Jun 4, 2013 at 6:30 PM, Szilárd Páll szilard.p...@cbr.su.se wrote:

 mdrun is not blind, just the current design does report the hardware
 of all compute nodes used. Whatever CPU/GPU hardware mdrun reports in
 the log/std output is *only* what rank 0, i.e. the first MPI process,
 detects. If you have a heterogeneous hardware configuration, in most
 cases you should be able to run just fine, but you'll still get only
 the hardware the first rank sits on reported.

 Hence, if you want to run on 5 of the nodes you mention, you just do:
 mpirun -np 10 mdrun_mpi [-gpu_id 01]

 You may want to try both -ntomp 8 and -ntomp 16 (using HyperThreading
 does not always help).

 Also note that if you use GPU sharing among ranks (in order to use 8
 threads/rank), (for some technical reasons) disabling dynamic load
 balancing may help - especially if you have a homogenous simulation
 system (and hardware setup).


 Cheers,
 --
 Szilárd


 On Tue, Jun 4, 2013 at 3:31 PM, João Henriques
 joao.henriques.32...@gmail.com wrote:
  Dear all,
 
  Since gmx-4.6 came out, I've been particularly interested in taking
  advantage of the native GPU acceleration for my simulations. Luckily, I
  have access to a cluster with the following specs PER NODE:
 
  CPU
  2 E5-2650 (2.0 Ghz, 8-core)
 
  GPU
  2 Nvidia K20
 
  I've become quite familiar with the heterogenous parallelization and
  multiple MPI ranks per GPU schemes on a SINGLE NODE. Everything works
  fine, no problems at all.
 
  Currently, I'm working with a nasty system comprising 608159 tip3p water
  molecules and it would really help to accelerate things up a bit.
  Therefore, I would really like to try to parallelize my system over
  multiple nodes and keep the GPU acceleration.
 
  I've tried many different command combinations, but mdrun seems to be
 blind
  towards the GPUs existing on other nodes. It always finds GPUs #0 and #1
 on
  the first node and tries to fit everything into these, completely
  disregarding the existence of the other GPUs on the remaining requested
  nodes.
 
  Once again, note that all nodes have exactly the same specs.
 
  Literature on the official gmx website is not, well... you know...
 in-depth
  and I would really appreciate if someone could shed some light into this
  subject.
 
  Thank you,
  Best regards,
 
  --
  João Henriques
  --
  gmx-users mailing listgmx-users@gromacs.org
  http://lists.gromacs.org/mailman/listinfo/gmx-users
  * Please search the archive at
 http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
  * Please don't post (un)subscribe requests to the list. Use the
  www interface or send it to gmx-users-requ...@gromacs.org.
  * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
 --
 gmx-users mailing listgmx-users@gromacs.org
 http://lists.gromacs.org/mailman/listinfo/gmx-users
 * Please search the archive at
 http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
 * Please don't post (un)subscribe requests to the list. Use the
 www interface or send it to gmx-users-requ...@gromacs.org.
 * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists




-- 
João Henriques
--
gmx-users mailing listgmx-users@gromacs.org
http://lists.gromacs.org/mailman/listinfo/gmx-users
* Please search the archive at 
http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
* Please don't post (un)subscribe requests to the list. Use the
www interface or send it to gmx-users-requ...@gromacs.org.
* Can't post? Read http://www.gromacs.org/Support/Mailing_Lists


Re: [gmx-users] Running gmx-4.6.x over multiple homogeneous nodes with GPU acceleration

2013-06-05 Thread João Henriques
Sorry to keep bugging you guys, but even after considering all you
suggested and reading the bugzilla thread Mark pointed out, I'm still
unable to make the simulation run over multiple nodes.
*Here is a template of a simple submission over 2 nodes:*

--- START ---
#!/bin/sh
#
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
#
# Job name
#SBATCH -J md
#
# No. of nodes and no. of processors per node
#SBATCH -N 2
#SBATCH --exclusive
#
# Time needed to complete the job
#SBATCH -t 48:00:00
#
# Add modules
module load gcc/4.6.3
module load openmpi/1.6.3/gcc/4.6.3
module load cuda/5.0
module load gromacs/4.6
#
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
#
grompp -f md.mdp -c npt.gro -t npt.cpt -p topol -o md.tpr
mpirun -np 4 mdrun_mpi -gpu_id 01 -deffnm md -v
#
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
--- END ---

*Here is an extract of the md.log:*

--- START ---
Using 4 MPI processes
Using 4 OpenMP threads per MPI process

Detecting CPU-specific acceleration.
Present hardware specification:
Vendor: GenuineIntel
Brand:  Intel(R) Xeon(R) CPU E5-2650 0 @ 2.00GHz
Family:  6  Model: 45  Stepping:  7
Features: aes apic avx clfsh cmov cx8 cx16 htt lahf_lm mmx msr nonstop_tsc
pcid pclmuldq pdcm pdpe1gb popcnt pse rdtscp sse2 sse3 sse4.1 sse4.2 ssse3
tdt x2apic
Acceleration most likely to fit this hardware: AVX_256
Acceleration selected at GROMACS compile time: AVX_256


2 GPUs detected on host en001:
  #0: NVIDIA Tesla K20m, compute cap.: 3.5, ECC: yes, stat: compatible
  #1: NVIDIA Tesla K20m, compute cap.: 3.5, ECC: yes, stat: compatible


---
Program mdrun_mpi, VERSION 4.6
Source code file:
/lunarc/sw/erik/src/gromacs/gromacs-4.6/src/gmxlib/gmx_detect_hardware.c,
line: 322

Fatal error:
Incorrect launch configuration: mismatching number of PP MPI processes and
GPUs per node.
mdrun_mpi was started with 4 PP MPI processes per node, but you provided 2
GPUs.
For more information and tips for troubleshooting, please check the GROMACS
website at http://www.gromacs.org/Documentation/Errors
---
--- END ---

As you can see, gmx is having trouble understanding that there's a second
node available. Note that since I did not specify -ntomp, it assigned 4
threads to each of the 4 mpi processes (filling the entire avail. 16 CPUs *on
one node*).
For the same exact submission, if I do set -ntomp 8 (since I have 4 MPI
procs * 8 OpenMP threads = 32 CPUs total on the 2 nodes) I get a warning
telling me that I'm hyperthreading, which can only mean that *gmx is
assigning all processes to the first node once again.*
Am I doing something wrong or is there some problem with gmx-4.6? I guess
it can only be my fault, since I've never seen anyone else complaining
about the same issue here.

*Here are the cluter specs details:*

http://www.lunarc.lu.se/Systems/ErikDetails

Thank you for your patience and expertise,
Best regards,
João Henriques



On Tue, Jun 4, 2013 at 6:30 PM, Szilárd Páll szilard.p...@cbr.su.se wrote:

 mdrun is not blind, just the current design does report the hardware
 of all compute nodes used. Whatever CPU/GPU hardware mdrun reports in
 the log/std output is *only* what rank 0, i.e. the first MPI process,
 detects. If you have a heterogeneous hardware configuration, in most
 cases you should be able to run just fine, but you'll still get only
 the hardware the first rank sits on reported.

 Hence, if you want to run on 5 of the nodes you mention, you just do:
 mpirun -np 10 mdrun_mpi [-gpu_id 01]

 You may want to try both -ntomp 8 and -ntomp 16 (using HyperThreading
 does not always help).

 Also note that if you use GPU sharing among ranks (in order to use 8
 threads/rank), (for some technical reasons) disabling dynamic load
 balancing may help - especially if you have a homogenous simulation
 system (and hardware setup).


 Cheers,
 --
 Szilárd


 On Tue, Jun 4, 2013 at 3:31 PM, João Henriques
 joao.henriques.32...@gmail.com wrote:
  Dear all,
 
  Since gmx-4.6 came out, I've been particularly interested in taking
  advantage of the native GPU acceleration for my simulations. Luckily, I
  have access to a cluster with the following specs PER NODE:
 
  CPU
  2 E5-2650 (2.0 Ghz, 8-core)
 
  GPU
  2 Nvidia K20
 
  I've become quite familiar with the heterogenous parallelization and
  multiple MPI ranks per GPU schemes on a SINGLE NODE. Everything works
  fine, no problems at all.
 
  Currently, I'm working with a nasty system comprising 608159 tip3p water
  molecules and it would really help to accelerate things up a bit.
  Therefore, I would really like to try to parallelize my system over
  multiple nodes and keep the GPU acceleration.
 
  I've tried many different command combinations, but mdrun seems to be
 blind
  towards the GPUs existing on other nodes. It always finds GPUs #0 and #1
 on
  the first node and tries to fit 

Re: [gmx-users] Running gmx-4.6.x over multiple homogeneous nodes with GPU acceleration

2013-06-05 Thread Mark Abraham
On Wed, Jun 5, 2013 at 2:53 PM, João Henriques 
joao.henriques.32...@gmail.com wrote:

 Sorry to keep bugging you guys, but even after considering all you
 suggested and reading the bugzilla thread Mark pointed out, I'm still
 unable to make the simulation run over multiple nodes.
 *Here is a template of a simple submission over 2 nodes:*

 --- START ---
 #!/bin/sh
 #
 # - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
 #
 # Job name
 #SBATCH -J md
 #
 # No. of nodes and no. of processors per node
 #SBATCH -N 2
 #SBATCH --exclusive
 #
 # Time needed to complete the job
 #SBATCH -t 48:00:00
 #
 # Add modules
 module load gcc/4.6.3
 module load openmpi/1.6.3/gcc/4.6.3
 module load cuda/5.0
 module load gromacs/4.6
 #
 # - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
 #
 grompp -f md.mdp -c npt.gro -t npt.cpt -p topol -o md.tpr
 mpirun -np 4 mdrun_mpi -gpu_id 01 -deffnm md -v
 #
 # - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
 --- END ---

 *Here is an extract of the md.log:*

 --- START ---
 Using 4 MPI processes
 Using 4 OpenMP threads per MPI process

 Detecting CPU-specific acceleration.
 Present hardware specification:
 Vendor: GenuineIntel
 Brand:  Intel(R) Xeon(R) CPU E5-2650 0 @ 2.00GHz
 Family:  6  Model: 45  Stepping:  7
 Features: aes apic avx clfsh cmov cx8 cx16 htt lahf_lm mmx msr nonstop_tsc
 pcid pclmuldq pdcm pdpe1gb popcnt pse rdtscp sse2 sse3 sse4.1 sse4.2 ssse3
 tdt x2apic
 Acceleration most likely to fit this hardware: AVX_256
 Acceleration selected at GROMACS compile time: AVX_256


 2 GPUs detected on host en001:
   #0: NVIDIA Tesla K20m, compute cap.: 3.5, ECC: yes, stat: compatible
   #1: NVIDIA Tesla K20m, compute cap.: 3.5, ECC: yes, stat: compatible


 ---
 Program mdrun_mpi, VERSION 4.6
 Source code file:
 /lunarc/sw/erik/src/gromacs/gromacs-4.6/src/gmxlib/gmx_detect_hardware.c,
 line: 322

 Fatal error:
 Incorrect launch configuration: mismatching number of PP MPI processes and
 GPUs per node.


per node is critical here.


 mdrun_mpi was started with 4 PP MPI processes per node, but you provided 2
 GPUs.


...and here. As far as mdrun_mpi knows from the MPI system there's only MPI
ranks on this one node.

For more information and tips for troubleshooting, please check the GROMACS
 website at http://www.gromacs.org/Documentation/Errors
 ---
 --- END ---

 As you can see, gmx is having trouble understanding that there's a second
 node available. Note that since I did not specify -ntomp, it assigned 4
 threads to each of the 4 mpi processes (filling the entire avail. 16 CPUs
 *on
 one node*).
 For the same exact submission, if I do set -ntomp 8 (since I have 4 MPI
 procs * 8 OpenMP threads = 32 CPUs total on the 2 nodes) I get a warning
 telling me that I'm hyperthreading, which can only mean that *gmx is
 assigning all processes to the first node once again.*
 Am I doing something wrong or is there some problem with gmx-4.6? I guess
 it can only be my fault, since I've never seen anyone else complaining
 about the same issue here.


Assigning MPI processes to nodes is a matter configuring your MPI. GROMACS
just follows the MPI system information it gets from MPI - hence the
oversubscription. If you assign two MPI processes to each node, then things
should work.

Mark
--
gmx-users mailing listgmx-users@gromacs.org
http://lists.gromacs.org/mailman/listinfo/gmx-users
* Please search the archive at 
http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
* Please don't post (un)subscribe requests to the list. Use the
www interface or send it to gmx-users-requ...@gromacs.org.
* Can't post? Read http://www.gromacs.org/Support/Mailing_Lists


Re: [gmx-users] Running gmx-4.6.x over multiple homogeneous nodes with GPU acceleration

2013-06-05 Thread João Henriques
Ok, thanks once again. I will do my best to overcome this issue.

Best regards,
João Henriques


On Wed, Jun 5, 2013 at 3:33 PM, Mark Abraham mark.j.abra...@gmail.comwrote:

 On Wed, Jun 5, 2013 at 2:53 PM, João Henriques 
 joao.henriques.32...@gmail.com wrote:

  Sorry to keep bugging you guys, but even after considering all you
  suggested and reading the bugzilla thread Mark pointed out, I'm still
  unable to make the simulation run over multiple nodes.
  *Here is a template of a simple submission over 2 nodes:*
 
  --- START ---
  #!/bin/sh
  #
  # - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
  #
  # Job name
  #SBATCH -J md
  #
  # No. of nodes and no. of processors per node
  #SBATCH -N 2
  #SBATCH --exclusive
  #
  # Time needed to complete the job
  #SBATCH -t 48:00:00
  #
  # Add modules
  module load gcc/4.6.3
  module load openmpi/1.6.3/gcc/4.6.3
  module load cuda/5.0
  module load gromacs/4.6
  #
  # - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
  #
  grompp -f md.mdp -c npt.gro -t npt.cpt -p topol -o md.tpr
  mpirun -np 4 mdrun_mpi -gpu_id 01 -deffnm md -v
  #
  # - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
  --- END ---
 
  *Here is an extract of the md.log:*
 
  --- START ---
  Using 4 MPI processes
  Using 4 OpenMP threads per MPI process
 
  Detecting CPU-specific acceleration.
  Present hardware specification:
  Vendor: GenuineIntel
  Brand:  Intel(R) Xeon(R) CPU E5-2650 0 @ 2.00GHz
  Family:  6  Model: 45  Stepping:  7
  Features: aes apic avx clfsh cmov cx8 cx16 htt lahf_lm mmx msr
 nonstop_tsc
  pcid pclmuldq pdcm pdpe1gb popcnt pse rdtscp sse2 sse3 sse4.1 sse4.2
 ssse3
  tdt x2apic
  Acceleration most likely to fit this hardware: AVX_256
  Acceleration selected at GROMACS compile time: AVX_256
 
 
  2 GPUs detected on host en001:
#0: NVIDIA Tesla K20m, compute cap.: 3.5, ECC: yes, stat: compatible
#1: NVIDIA Tesla K20m, compute cap.: 3.5, ECC: yes, stat: compatible
 
 
  ---
  Program mdrun_mpi, VERSION 4.6
  Source code file:
  /lunarc/sw/erik/src/gromacs/gromacs-4.6/src/gmxlib/gmx_detect_hardware.c,
  line: 322
 
  Fatal error:
  Incorrect launch configuration: mismatching number of PP MPI processes
 and
  GPUs per node.
 

 per node is critical here.


  mdrun_mpi was started with 4 PP MPI processes per node, but you provided
 2
  GPUs.
 

 ...and here. As far as mdrun_mpi knows from the MPI system there's only MPI
 ranks on this one node.

 For more information and tips for troubleshooting, please check the GROMACS
  website at http://www.gromacs.org/Documentation/Errors
  ---
  --- END ---
 
  As you can see, gmx is having trouble understanding that there's a second
  node available. Note that since I did not specify -ntomp, it assigned 4
  threads to each of the 4 mpi processes (filling the entire avail. 16 CPUs
  *on
  one node*).
  For the same exact submission, if I do set -ntomp 8 (since I have 4 MPI
  procs * 8 OpenMP threads = 32 CPUs total on the 2 nodes) I get a warning
  telling me that I'm hyperthreading, which can only mean that *gmx is
  assigning all processes to the first node once again.*
  Am I doing something wrong or is there some problem with gmx-4.6? I guess
  it can only be my fault, since I've never seen anyone else complaining
  about the same issue here.
 

 Assigning MPI processes to nodes is a matter configuring your MPI. GROMACS
 just follows the MPI system information it gets from MPI - hence the
 oversubscription. If you assign two MPI processes to each node, then things
 should work.

 Mark
 --
 gmx-users mailing listgmx-users@gromacs.org
 http://lists.gromacs.org/mailman/listinfo/gmx-users
 * Please search the archive at
 http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
 * Please don't post (un)subscribe requests to the list. Use the
 www interface or send it to gmx-users-requ...@gromacs.org.
 * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists




-- 
João Henriques
--
gmx-users mailing listgmx-users@gromacs.org
http://lists.gromacs.org/mailman/listinfo/gmx-users
* Please search the archive at 
http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
* Please don't post (un)subscribe requests to the list. Use the
www interface or send it to gmx-users-requ...@gromacs.org.
* Can't post? Read http://www.gromacs.org/Support/Mailing_Lists


Re: [gmx-users] Running gmx-4.6.x over multiple homogeneous nodes with GPU acceleration

2013-06-05 Thread João Henriques
Just to wrap up this thread, it does work when the mpirun is properly
configured. I knew it had to be my fault :)

Something like this works like a charm:
mpirun -npernode 2 mdrun_mpi -ntomp 8 -gpu_id 01 -deffnm md -v

Thank you Mark and Szilárd for your invaluable expertise.

Best regards,
João Henriques


On Wed, Jun 5, 2013 at 4:21 PM, João Henriques 
joao.henriques.32...@gmail.com wrote:

 Ok, thanks once again. I will do my best to overcome this issue.

 Best regards,
 João Henriques


 On Wed, Jun 5, 2013 at 3:33 PM, Mark Abraham mark.j.abra...@gmail.comwrote:

 On Wed, Jun 5, 2013 at 2:53 PM, João Henriques 
 joao.henriques.32...@gmail.com wrote:

  Sorry to keep bugging you guys, but even after considering all you
  suggested and reading the bugzilla thread Mark pointed out, I'm still
  unable to make the simulation run over multiple nodes.
  *Here is a template of a simple submission over 2 nodes:*
 
  --- START ---
  #!/bin/sh
  #
  # - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
  #
  # Job name
  #SBATCH -J md
  #
  # No. of nodes and no. of processors per node
  #SBATCH -N 2
  #SBATCH --exclusive
  #
  # Time needed to complete the job
  #SBATCH -t 48:00:00
  #
  # Add modules
  module load gcc/4.6.3
  module load openmpi/1.6.3/gcc/4.6.3
  module load cuda/5.0
  module load gromacs/4.6
  #
  # - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
  #
  grompp -f md.mdp -c npt.gro -t npt.cpt -p topol -o md.tpr
  mpirun -np 4 mdrun_mpi -gpu_id 01 -deffnm md -v
  #
  # - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
  --- END ---
 
  *Here is an extract of the md.log:*
 
  --- START ---
  Using 4 MPI processes
  Using 4 OpenMP threads per MPI process
 
  Detecting CPU-specific acceleration.
  Present hardware specification:
  Vendor: GenuineIntel
  Brand:  Intel(R) Xeon(R) CPU E5-2650 0 @ 2.00GHz
  Family:  6  Model: 45  Stepping:  7
  Features: aes apic avx clfsh cmov cx8 cx16 htt lahf_lm mmx msr
 nonstop_tsc
  pcid pclmuldq pdcm pdpe1gb popcnt pse rdtscp sse2 sse3 sse4.1 sse4.2
 ssse3
  tdt x2apic
  Acceleration most likely to fit this hardware: AVX_256
  Acceleration selected at GROMACS compile time: AVX_256
 
 
  2 GPUs detected on host en001:
#0: NVIDIA Tesla K20m, compute cap.: 3.5, ECC: yes, stat: compatible
#1: NVIDIA Tesla K20m, compute cap.: 3.5, ECC: yes, stat: compatible
 
 
  ---
  Program mdrun_mpi, VERSION 4.6
  Source code file:
 
 /lunarc/sw/erik/src/gromacs/gromacs-4.6/src/gmxlib/gmx_detect_hardware.c,
  line: 322
 
  Fatal error:
  Incorrect launch configuration: mismatching number of PP MPI processes
 and
  GPUs per node.
 

 per node is critical here.


  mdrun_mpi was started with 4 PP MPI processes per node, but you
 provided 2
  GPUs.
 

 ...and here. As far as mdrun_mpi knows from the MPI system there's only
 MPI
 ranks on this one node.

 For more information and tips for troubleshooting, please check the
 GROMACS
  website at http://www.gromacs.org/Documentation/Errors
  ---
  --- END ---
 
  As you can see, gmx is having trouble understanding that there's a
 second
  node available. Note that since I did not specify -ntomp, it assigned 4
  threads to each of the 4 mpi processes (filling the entire avail. 16
 CPUs
  *on
  one node*).
  For the same exact submission, if I do set -ntomp 8 (since I have 4
 MPI
  procs * 8 OpenMP threads = 32 CPUs total on the 2 nodes) I get a warning
  telling me that I'm hyperthreading, which can only mean that *gmx is
  assigning all processes to the first node once again.*
  Am I doing something wrong or is there some problem with gmx-4.6? I
 guess
  it can only be my fault, since I've never seen anyone else complaining
  about the same issue here.
 

 Assigning MPI processes to nodes is a matter configuring your MPI. GROMACS
 just follows the MPI system information it gets from MPI - hence the
 oversubscription. If you assign two MPI processes to each node, then
 things
 should work.

 Mark
 --
 gmx-users mailing listgmx-users@gromacs.org
 http://lists.gromacs.org/mailman/listinfo/gmx-users
 * Please search the archive at
 http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
 * Please don't post (un)subscribe requests to the list. Use the
 www interface or send it to gmx-users-requ...@gromacs.org.
 * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists




 --
 João Henriques




-- 
João Henriques
--
gmx-users mailing listgmx-users@gromacs.org
http://lists.gromacs.org/mailman/listinfo/gmx-users
* Please search the archive at 
http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
* Please don't post (un)subscribe requests to the list. Use the
www interface or send it to gmx-users-requ...@gromacs.org.
* Can't post? Read http://www.gromacs.org/Support/Mailing_Lists


Re: [gmx-users] Running gmx-4.6.x over multiple homogeneous nodes with GPU acceleration

2013-06-04 Thread Mark Abraham
Yes, documentation and output is not optimal. Resources are limited, sorry.
Some of these issues are discussed in
http://bugzilla.gromacs.org/issues/1135. The good news is that it sounds
like you are having a non-problem. The output tacitly assumes
heterogeneity. If your performance results are linear over small numbers of
nodes, then you're doing fine.

Mark

On Tue, Jun 4, 2013 at 3:31 PM, João Henriques 
joao.henriques.32...@gmail.com wrote:

 Dear all,

 Since gmx-4.6 came out, I've been particularly interested in taking
 advantage of the native GPU acceleration for my simulations. Luckily, I
 have access to a cluster with the following specs PER NODE:

 CPU
 2 E5-2650 (2.0 Ghz, 8-core)

 GPU
 2 Nvidia K20

 I've become quite familiar with the heterogenous parallelization and
 multiple MPI ranks per GPU schemes on a SINGLE NODE. Everything works
 fine, no problems at all.

 Currently, I'm working with a nasty system comprising 608159 tip3p water
 molecules and it would really help to accelerate things up a bit.
 Therefore, I would really like to try to parallelize my system over
 multiple nodes and keep the GPU acceleration.

 I've tried many different command combinations, but mdrun seems to be blind
 towards the GPUs existing on other nodes. It always finds GPUs #0 and #1 on
 the first node and tries to fit everything into these, completely
 disregarding the existence of the other GPUs on the remaining requested
 nodes.

 Once again, note that all nodes have exactly the same specs.

 Literature on the official gmx website is not, well... you know... in-depth
 and I would really appreciate if someone could shed some light into this
 subject.

 Thank you,
 Best regards,

 --
 João Henriques
 --
 gmx-users mailing listgmx-users@gromacs.org
 http://lists.gromacs.org/mailman/listinfo/gmx-users
 * Please search the archive at
 http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
 * Please don't post (un)subscribe requests to the list. Use the
 www interface or send it to gmx-users-requ...@gromacs.org.
 * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists

--
gmx-users mailing listgmx-users@gromacs.org
http://lists.gromacs.org/mailman/listinfo/gmx-users
* Please search the archive at 
http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
* Please don't post (un)subscribe requests to the list. Use the
www interface or send it to gmx-users-requ...@gromacs.org.
* Can't post? Read http://www.gromacs.org/Support/Mailing_Lists


Re: [gmx-users] Running gmx-4.6.x over multiple homogeneous nodes with GPU acceleration

2013-06-04 Thread Szilárd Páll
mdrun is not blind, just the current design does report the hardware
of all compute nodes used. Whatever CPU/GPU hardware mdrun reports in
the log/std output is *only* what rank 0, i.e. the first MPI process,
detects. If you have a heterogeneous hardware configuration, in most
cases you should be able to run just fine, but you'll still get only
the hardware the first rank sits on reported.

Hence, if you want to run on 5 of the nodes you mention, you just do:
mpirun -np 10 mdrun_mpi [-gpu_id 01]

You may want to try both -ntomp 8 and -ntomp 16 (using HyperThreading
does not always help).

Also note that if you use GPU sharing among ranks (in order to use 8
threads/rank), (for some technical reasons) disabling dynamic load
balancing may help - especially if you have a homogenous simulation
system (and hardware setup).


Cheers,
--
Szilárd


On Tue, Jun 4, 2013 at 3:31 PM, João Henriques
joao.henriques.32...@gmail.com wrote:
 Dear all,

 Since gmx-4.6 came out, I've been particularly interested in taking
 advantage of the native GPU acceleration for my simulations. Luckily, I
 have access to a cluster with the following specs PER NODE:

 CPU
 2 E5-2650 (2.0 Ghz, 8-core)

 GPU
 2 Nvidia K20

 I've become quite familiar with the heterogenous parallelization and
 multiple MPI ranks per GPU schemes on a SINGLE NODE. Everything works
 fine, no problems at all.

 Currently, I'm working with a nasty system comprising 608159 tip3p water
 molecules and it would really help to accelerate things up a bit.
 Therefore, I would really like to try to parallelize my system over
 multiple nodes and keep the GPU acceleration.

 I've tried many different command combinations, but mdrun seems to be blind
 towards the GPUs existing on other nodes. It always finds GPUs #0 and #1 on
 the first node and tries to fit everything into these, completely
 disregarding the existence of the other GPUs on the remaining requested
 nodes.

 Once again, note that all nodes have exactly the same specs.

 Literature on the official gmx website is not, well... you know... in-depth
 and I would really appreciate if someone could shed some light into this
 subject.

 Thank you,
 Best regards,

 --
 João Henriques
 --
 gmx-users mailing listgmx-users@gromacs.org
 http://lists.gromacs.org/mailman/listinfo/gmx-users
 * Please search the archive at 
 http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
 * Please don't post (un)subscribe requests to the list. Use the
 www interface or send it to gmx-users-requ...@gromacs.org.
 * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
--
gmx-users mailing listgmx-users@gromacs.org
http://lists.gromacs.org/mailman/listinfo/gmx-users
* Please search the archive at 
http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
* Please don't post (un)subscribe requests to the list. Use the
www interface or send it to gmx-users-requ...@gromacs.org.
* Can't post? Read http://www.gromacs.org/Support/Mailing_Lists