Re: [gmx-users] Gromacs 4.5.1 on 48 core magny-cours AMDs

2010-10-22 Thread Sander Pronk
Hi Carsten,

I've been thinking a bit about this issue, and for now a relatively easy fix 
would be to enable thread affinity when all cores on a machine are used. When 
fewer threads are turned on, I don't want to turn on thread affinity because 
any combination might either
- interfere with other running mdruns
- cause mdrun to run sub-optimally by forcing it, for example, to run two 
threads on the same core with hyperthreading, for example. 

The second issue *could* be solved, but would require some work that I 
personally feel would be the domain of the operating system. I'm looking into 
using hwloc right now, but that doesn't appear to have cmake support.

It appears that relatively recent kernels are pretty good at distributing jobs; 
do you know which kernel version and distribution gave you the unreliable 
performance numbers you e-mailed?

Sander

On 21 Oct 2010, at 14:04 , Carsten Kutzner wrote:

 Hi Sander,
 
 On Oct 21, 2010, at 12:27 PM, Sander Pronk wrote:
 
 Hi Carsten,
 
 As Berk noted, we haven't had problems on 24-core machines, but quite 
 frankly I haven't looked at thread migration. 
 I did not have any problems on 32-core machines as well, only on 48-core ones.
 
 Currently, the wait states actively yield to the scheduler, which is an 
 opportunity for the scheduler to re-assign threads to different cores. I 
 could set harder thread affinity but that could compromise system 
 responsiveness (when running mdrun on a desktop machine without active 
 yielding, the system slows down noticeably). 
 
 One thing you could try is to turn on the THREAD_MPI_WAIT_FOR_NO_ONE option 
 in cmake. That turns off the yielding which might change the migration 
 behavior.
 I will try that, thanks!
 
 BTW What do you mean with bad performance, and how do you notice thread 
 migration issues?
 A while ago I benchmarked a ~80,000 atom test system (membrane+channel+water, 
 2 fs time 
 step, cutoffs @ 1 nm) on a 48-core 1.9 GHz AMD node. My first try gave a 
 lousy 7.5 ns/day 
 using Gromacs 4.0.7 and IntelMPI. According to AMD, parallel applications 
 should be
 run under control of numactl to be compliant to the new memory hierarchy. 
 Also, they
 suggest using OpenMPI rather than other MPI libs. With OpenMPI and numactl - 
 which pins
 the processes to the cores - the performance was nearly doubled to 14.3 
 ns/day. Using 
 Gromacs 4.5 I got 14.0 ns/day with OpenMPI+numactl and 15.2 ns/day with 
 threads (here no 
 pinning was necessary for the threaded version!)
 
 Now on another machine with identical hardware (but another Linux) I get 
 4.5.1 timings that 
 vary a lot (see g_tune_pme snippet below) even between identical runs. One 
 run actually approaches 
 the expected 15 ns/day, while the others with also 20 PME-only nodes) do not. 
 I cannot be shure
 that thread migration is the problem here, but correct pinning might be 
 necessary here. 
 
 Carsten
 
 
 
 g_tune_pme output snippet for mdrun with threads:
 -
 Benchmark steps : 1000
 dlb equilibration steps : 100
 Repeats for each test   : 4
 
 No.   scaling  rcoulomb  nkx  nky  nkz   spacing  rvdw  tpr file
   0   -input-  1.00   90   88   80  0.119865   1.00  
 ./Aquaporin_gmx4_bench00.tpr
 
 Individual timings for input file 0 (./Aquaporin_gmx4_bench00.tpr):
 PME nodes  Gcycles   ns/dayPME/fRemark
  24  1804.4428.7361.703OK.
  24  1805.6558.7301.689OK.
  24  1260.351   12.5050.647OK.
  24  1954.3148.0641.488OK.
  20  1753.3868.9921.960OK.
  20  1981.0327.9582.190OK.
  20  1344.375   11.7211.180OK.
  20  1103.340   14.2870.896OK.
  16  1876.1348.4041.713OK.
  16  1844.1118.5511.525OK.
  16  1757.4148.9721.845OK.
  16  1785.0508.8331.208OK.
   0  1851.6458.520  -  OK.
   0  1871.9558.427  -  OK.
   0  1978.3577.974  -  OK.
   0  1848.5158.534  -  OK.
  -1( 18) 1926.2028.1821.453OK.
  -1( 18) 1195.456   13.1840.826OK.
  -1( 18) 1816.7658.6771.853OK.
  -1( 18) 1218.834   12.9310.884OK.
 
 
 
 Sander
 
 On 21 Oct 2010, at 12:03 , Carsten Kutzner wrote:
 
 Hi,
 
 does anyone have experience with AMD's 12-core Magny-Cours
 processors? With 48 cores on a node it is essential that the processes
 are properly pinned to the cores for optimum performance.  Numactl
 can do this, but at the moment I do not get good performance with
 4.5.1 and threads, which still seem to be migrating around.
 
 Carsten
 
 
 
 --
 gmx-users mailing list

RE: [gmx-users] Gromacs 4.5.1 on 48 core magny-cours AMDs

2010-10-21 Thread Berk Hess

Hi,

We haven't observed any problems running with threads over 24 core AMD nodes 
(4x6 cores).

Berk

 From: ckut...@gwdg.de
 Date: Thu, 21 Oct 2010 12:03:00 +0200
 To: gmx-users@gromacs.org
 Subject: [gmx-users] Gromacs 4.5.1 on 48 core magny-cours AMDs
 
 Hi,
 
 does anyone have experience with AMD's 12-core Magny-Cours
 processors? With 48 cores on a node it is essential that the processes
 are properly pinned to the cores for optimum performance.  Numactl
 can do this, but at the moment I do not get good performance with
 4.5.1 and threads, which still seem to be migrating around.
 
 Carsten
 
 
 --
 Dr. Carsten Kutzner
 Max Planck Institute for Biophysical Chemistry
 Theoretical and Computational Biophysics
 Am Fassberg 11, 37077 Goettingen, Germany
 Tel. +49-551-2012313, Fax: +49-551-2012302
 http://www.mpibpc.mpg.de/home/grubmueller/ihp/ckutzne
 
 
 
 
 -- 
 gmx-users mailing listgmx-users@gromacs.org
 http://lists.gromacs.org/mailman/listinfo/gmx-users
 Please search the archive at 
 http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
 Please don't post (un)subscribe requests to the list. Use the 
 www interface or send it to gmx-users-requ...@gromacs.org.
 Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
  -- 
gmx-users mailing listgmx-users@gromacs.org
http://lists.gromacs.org/mailman/listinfo/gmx-users
Please search the archive at 
http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
Please don't post (un)subscribe requests to the list. Use the 
www interface or send it to gmx-users-requ...@gromacs.org.
Can't post? Read http://www.gromacs.org/Support/Mailing_Lists

Re: [gmx-users] Gromacs 4.5.1 on 48 core magny-cours AMDs

2010-10-21 Thread Sander Pronk
Hi Carsten,

As Berk noted, we haven't had problems on 24-core machines, but quite frankly I 
haven't looked at thread migration. 

Currently, the wait states actively yield to the scheduler, which is an 
opportunity for the scheduler to re-assign threads to different cores. I could 
set harder thread affinity but that could compromise system responsiveness 
(when running mdrun on a desktop machine without active yielding, the system 
slows down noticeably). 

One thing you could try is to turn on the THREAD_MPI_WAIT_FOR_NO_ONE option in 
cmake. That turns off the yielding which might change the migration behavior.

BTW What do you mean with bad performance, and how do you notice thread 
migration issues?

Sander

On 21 Oct 2010, at 12:03 , Carsten Kutzner wrote:

 Hi,
 
 does anyone have experience with AMD's 12-core Magny-Cours
 processors? With 48 cores on a node it is essential that the processes
 are properly pinned to the cores for optimum performance.  Numactl
 can do this, but at the moment I do not get good performance with
 4.5.1 and threads, which still seem to be migrating around.
 
 Carsten
 
 
 --
 Dr. Carsten Kutzner
 Max Planck Institute for Biophysical Chemistry
 Theoretical and Computational Biophysics
 Am Fassberg 11, 37077 Goettingen, Germany
 Tel. +49-551-2012313, Fax: +49-551-2012302
 http://www.mpibpc.mpg.de/home/grubmueller/ihp/ckutzne
 
 
 
 
 -- 
 gmx-users mailing listgmx-users@gromacs.org
 http://lists.gromacs.org/mailman/listinfo/gmx-users
 Please search the archive at 
 http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
 Please don't post (un)subscribe requests to the list. Use the 
 www interface or send it to gmx-users-requ...@gromacs.org.
 Can't post? Read http://www.gromacs.org/Support/Mailing_Lists

--
gmx-users mailing listgmx-users@gromacs.org
http://lists.gromacs.org/mailman/listinfo/gmx-users
Please search the archive at 
http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
Please don't post (un)subscribe requests to the list. Use the
www interface or send it to gmx-users-requ...@gromacs.org.
Can't post? Read http://www.gromacs.org/Support/Mailing_Lists


Re: [gmx-users] Gromacs 4.5.1 on 48 core magny-cours AMDs

2010-10-21 Thread Carsten Kutzner
Hi Sander,

On Oct 21, 2010, at 12:27 PM, Sander Pronk wrote:

 Hi Carsten,
 
 As Berk noted, we haven't had problems on 24-core machines, but quite frankly 
 I haven't looked at thread migration. 
I did not have any problems on 32-core machines as well, only on 48-core ones.
 
 Currently, the wait states actively yield to the scheduler, which is an 
 opportunity for the scheduler to re-assign threads to different cores. I 
 could set harder thread affinity but that could compromise system 
 responsiveness (when running mdrun on a desktop machine without active 
 yielding, the system slows down noticeably). 
 
 One thing you could try is to turn on the THREAD_MPI_WAIT_FOR_NO_ONE option 
 in cmake. That turns off the yielding which might change the migration 
 behavior.
I will try that, thanks!
 
 BTW What do you mean with bad performance, and how do you notice thread 
 migration issues?
A while ago I benchmarked a ~80,000 atom test system (membrane+channel+water, 2 
fs time 
step, cutoffs @ 1 nm) on a 48-core 1.9 GHz AMD node. My first try gave a lousy 
7.5 ns/day 
using Gromacs 4.0.7 and IntelMPI. According to AMD, parallel applications 
should be
run under control of numactl to be compliant to the new memory hierarchy. Also, 
they
suggest using OpenMPI rather than other MPI libs. With OpenMPI and numactl - 
which pins
the processes to the cores - the performance was nearly doubled to 14.3 ns/day. 
Using 
Gromacs 4.5 I got 14.0 ns/day with OpenMPI+numactl and 15.2 ns/day with threads 
(here no 
pinning was necessary for the threaded version!)

Now on another machine with identical hardware (but another Linux) I get 4.5.1 
timings that 
vary a lot (see g_tune_pme snippet below) even between identical runs. One run 
actually approaches 
the expected 15 ns/day, while the others with also 20 PME-only nodes) do not. I 
cannot be shure
that thread migration is the problem here, but correct pinning might be 
necessary here. 

Carsten
 


g_tune_pme output snippet for mdrun with threads:
-
Benchmark steps : 1000
dlb equilibration steps : 100
Repeats for each test   : 4

 No.   scaling  rcoulomb  nkx  nky  nkz   spacing  rvdw  tpr file
   0   -input-  1.00   90   88   80  0.119865   1.00  
./Aquaporin_gmx4_bench00.tpr

Individual timings for input file 0 (./Aquaporin_gmx4_bench00.tpr):
PME nodes  Gcycles   ns/dayPME/fRemark
  24  1804.4428.7361.703OK.
  24  1805.6558.7301.689OK.
  24  1260.351   12.5050.647OK.
  24  1954.3148.0641.488OK.
  20  1753.3868.9921.960OK.
  20  1981.0327.9582.190OK.
  20  1344.375   11.7211.180OK.
  20  1103.340   14.2870.896OK.
  16  1876.1348.4041.713OK.
  16  1844.1118.5511.525OK.
  16  1757.4148.9721.845OK.
  16  1785.0508.8331.208OK.
   0  1851.6458.520  -  OK.
   0  1871.9558.427  -  OK.
   0  1978.3577.974  -  OK.
   0  1848.5158.534  -  OK.
  -1( 18) 1926.2028.1821.453OK.
  -1( 18) 1195.456   13.1840.826OK.
  -1( 18) 1816.7658.6771.853OK.
  -1( 18) 1218.834   12.9310.884OK.



 Sander
 
 On 21 Oct 2010, at 12:03 , Carsten Kutzner wrote:
 
 Hi,
 
 does anyone have experience with AMD's 12-core Magny-Cours
 processors? With 48 cores on a node it is essential that the processes
 are properly pinned to the cores for optimum performance.  Numactl
 can do this, but at the moment I do not get good performance with
 4.5.1 and threads, which still seem to be migrating around.
 
 Carsten
 
 

--
gmx-users mailing listgmx-users@gromacs.org
http://lists.gromacs.org/mailman/listinfo/gmx-users
Please search the archive at 
http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
Please don't post (un)subscribe requests to the list. Use the
www interface or send it to gmx-users-requ...@gromacs.org.
Can't post? Read http://www.gromacs.org/Support/Mailing_Lists


Re: [gmx-users] Gromacs 4.5.1 on 48 core magny-cours AMDs

2010-10-21 Thread Sander Pronk

Thanks for the information; the OpenMPI recommendation is probably because 
OpenMPI goes to great lengths trying to avoid process migration. The numactl 
doesn't prevent migration as far as I can tell: it controls where memory gets 
allocated if it's NUMA. 

For gromacs the setting should of course always be
numactl --localalloc

As far as thread migration goes: that might also be a kernel issue. If I 
remember correctly, this issue has only recently been addressed in the Linux 
kernel. I'll check whether there's anything we can do specifically for Gromacs.

Sander


On 21 Oct 2010, at 14:04 , Carsten Kutzner wrote:

 Hi Sander,
 
 On Oct 21, 2010, at 12:27 PM, Sander Pronk wrote:
 
 Hi Carsten,
 
 As Berk noted, we haven't had problems on 24-core machines, but quite 
 frankly I haven't looked at thread migration. 
 I did not have any problems on 32-core machines as well, only on 48-core ones.
 
 Currently, the wait states actively yield to the scheduler, which is an 
 opportunity for the scheduler to re-assign threads to different cores. I 
 could set harder thread affinity but that could compromise system 
 responsiveness (when running mdrun on a desktop machine without active 
 yielding, the system slows down noticeably). 
 
 One thing you could try is to turn on the THREAD_MPI_WAIT_FOR_NO_ONE option 
 in cmake. That turns off the yielding which might change the migration 
 behavior.
 I will try that, thanks!
 
 BTW What do you mean with bad performance, and how do you notice thread 
 migration issues?
 A while ago I benchmarked a ~80,000 atom test system (membrane+channel+water, 
 2 fs time 
 step, cutoffs @ 1 nm) on a 48-core 1.9 GHz AMD node. My first try gave a 
 lousy 7.5 ns/day 
 using Gromacs 4.0.7 and IntelMPI. According to AMD, parallel applications 
 should be
 run under control of numactl to be compliant to the new memory hierarchy. 
 Also, they
 suggest using OpenMPI rather than other MPI libs. With OpenMPI and numactl - 
 which pins
 the processes to the cores - the performance was nearly doubled to 14.3 
 ns/day. Using 
 Gromacs 4.5 I got 14.0 ns/day with OpenMPI+numactl and 15.2 ns/day with 
 threads (here no 
 pinning was necessary for the threaded version!)
 
 Now on another machine with identical hardware (but another Linux) I get 
 4.5.1 timings that 
 vary a lot (see g_tune_pme snippet below) even between identical runs. One 
 run actually approaches 
 the expected 15 ns/day, while the others with also 20 PME-only nodes) do not. 
 I cannot be shure
 that thread migration is the problem here, but correct pinning might be 
 necessary here. 
 
 Carsten
 
 
 
 g_tune_pme output snippet for mdrun with threads:
 -
 Benchmark steps : 1000
 dlb equilibration steps : 100
 Repeats for each test   : 4
 
 No.   scaling  rcoulomb  nkx  nky  nkz   spacing  rvdw  tpr file
   0   -input-  1.00   90   88   80  0.119865   1.00  
 ./Aquaporin_gmx4_bench00.tpr
 
 Individual timings for input file 0 (./Aquaporin_gmx4_bench00.tpr):
 PME nodes  Gcycles   ns/dayPME/fRemark
  24  1804.4428.7361.703OK.
  24  1805.6558.7301.689OK.
  24  1260.351   12.5050.647OK.
  24  1954.3148.0641.488OK.
  20  1753.3868.9921.960OK.
  20  1981.0327.9582.190OK.
  20  1344.375   11.7211.180OK.
  20  1103.340   14.2870.896OK.
  16  1876.1348.4041.713OK.
  16  1844.1118.5511.525OK.
  16  1757.4148.9721.845OK.
  16  1785.0508.8331.208OK.
   0  1851.6458.520  -  OK.
   0  1871.9558.427  -  OK.
   0  1978.3577.974  -  OK.
   0  1848.5158.534  -  OK.
  -1( 18) 1926.2028.1821.453OK.
  -1( 18) 1195.456   13.1840.826OK.
  -1( 18) 1816.7658.6771.853OK.
  -1( 18) 1218.834   12.9310.884OK.
 
 
 
 Sander
 
 On 21 Oct 2010, at 12:03 , Carsten Kutzner wrote:
 
 Hi,
 
 does anyone have experience with AMD's 12-core Magny-Cours
 processors? With 48 cores on a node it is essential that the processes
 are properly pinned to the cores for optimum performance.  Numactl
 can do this, but at the moment I do not get good performance with
 4.5.1 and threads, which still seem to be migrating around.
 
 Carsten
 
 
 
 --
 gmx-users mailing listgmx-users@gromacs.org
 http://lists.gromacs.org/mailman/listinfo/gmx-users
 Please search the archive at 
 http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
 Please don't post (un)subscribe requests to the list. Use the
 www interface or send it to gmx-users-requ...@gromacs.org.
 Can't post? 

Re: [gmx-users] Gromacs 4.5.1 on 48 core magny-cours AMDs

2010-10-21 Thread Carsten Kutzner
On Oct 21, 2010, at 4:44 PM, Sander Pronk wrote:

 
 Thanks for the information; the OpenMPI recommendation is probably because 
 OpenMPI goes to great lengths trying to avoid process migration. The numactl 
 doesn't prevent migration as far as I can tell: it controls where memory gets 
 allocated if it's NUMA. 
My understanding is that processes get pinned to cores with the help of 
the --physcpubind switch to numactl, but please correct me if I am wrong.
 
Carsten--
gmx-users mailing listgmx-users@gromacs.org
http://lists.gromacs.org/mailman/listinfo/gmx-users
Please search the archive at 
http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
Please don't post (un)subscribe requests to the list. Use the
www interface or send it to gmx-users-requ...@gromacs.org.
Can't post? Read http://www.gromacs.org/Support/Mailing_Lists


Re: [gmx-users] Gromacs 4.5.1 on 48 core magny-cours AMDs

2010-10-21 Thread Sander Pronk

On 21 Oct 2010, at 16:50 , Carsten Kutzner wrote:

 On Oct 21, 2010, at 4:44 PM, Sander Pronk wrote:
 
 
 Thanks for the information; the OpenMPI recommendation is probably because 
 OpenMPI goes to great lengths trying to avoid process migration. The numactl 
 doesn't prevent migration as far as I can tell: it controls where memory 
 gets allocated if it's NUMA. 
 My understanding is that processes get pinned to cores with the help of 
 the --physcpubind switch to numactl, but please correct me if I am wrong.
 

The documentation isn't clear on that: you bind a process to a set of cores, 
but that wouldn't necessarily mean that threads that belong to that process are 
bound to a specific core. They might be allowed to migrate within the set that 
you give the process.


--
gmx-users mailing listgmx-users@gromacs.org
http://lists.gromacs.org/mailman/listinfo/gmx-users
Please search the archive at 
http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
Please don't post (un)subscribe requests to the list. Use the
www interface or send it to gmx-users-requ...@gromacs.org.
Can't post? Read http://www.gromacs.org/Support/Mailing_Lists


Re: [gmx-users] Gromacs 4.5.1 on 48 core magny-cours AMDs

2010-10-21 Thread Esztermann, Ansgar

 Thanks for the information; the OpenMPI recommendation is probably because 
 OpenMPI goes to great lengths trying to avoid process migration. The 
 numactl doesn't prevent migration as far as I can tell: it controls where 
 memory gets allocated if it's NUMA. 
 My understanding is that processes get pinned to cores with the help of 
 the --physcpubind switch to numactl, but please correct me if I am wrong.
 
 
 The documentation isn't clear on that: you bind a process to a set of cores, 
 but that wouldn't necessarily mean that threads that belong to that process 
 are bound to a specific core. They might be allowed to migrate within the set 
 that you give the process.

I've just checked a short (a few seconds) benchmark run -- I didn't observe any 
migrations with or without numactl.


A.

-- 
Ansgar Esztermann
DV-Systemadministration
Max-Planck-Institut für biophysikalische Chemie, Abteilung 105

--
gmx-users mailing listgmx-users@gromacs.org
http://lists.gromacs.org/mailman/listinfo/gmx-users
Please search the archive at 
http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
Please don't post (un)subscribe requests to the list. Use the
www interface or send it to gmx-users-requ...@gromacs.org.
Can't post? Read http://www.gromacs.org/Support/Mailing_Lists


Re: [gmx-users] Gromacs 4.5.1 on 48 core magny-cours AMDs

2010-10-21 Thread Ondrej Marsalek
Hi,

FWIW, I have recently asked about this in the hwloc mailing list:

http://www.open-mpi.org/community/lists/hwloc-users/2010/10/0232.php

In general, hwloc is a useful tool for these things.

http://www.open-mpi.org/projects/hwloc/

Best,
Ondrej


On Thu, Oct 21, 2010 at 12:03, Carsten Kutzner ckut...@gwdg.de wrote:
 Hi,

 does anyone have experience with AMD's 12-core Magny-Cours
 processors? With 48 cores on a node it is essential that the processes
 are properly pinned to the cores for optimum performance.  Numactl
 can do this, but at the moment I do not get good performance with
 4.5.1 and threads, which still seem to be migrating around.

 Carsten


 --
 Dr. Carsten Kutzner
 Max Planck Institute for Biophysical Chemistry
 Theoretical and Computational Biophysics
 Am Fassberg 11, 37077 Goettingen, Germany
 Tel. +49-551-2012313, Fax: +49-551-2012302
 http://www.mpibpc.mpg.de/home/grubmueller/ihp/ckutzne




 --
 gmx-users mailing list    gmx-us...@gromacs.org
 http://lists.gromacs.org/mailman/listinfo/gmx-users
 Please search the archive at 
 http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
 Please don't post (un)subscribe requests to the list. Use the
 www interface or send it to gmx-users-requ...@gromacs.org.
 Can't post? Read http://www.gromacs.org/Support/Mailing_Lists

--
gmx-users mailing listgmx-users@gromacs.org
http://lists.gromacs.org/mailman/listinfo/gmx-users
Please search the archive at 
http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
Please don't post (un)subscribe requests to the list. Use the
www interface or send it to gmx-users-requ...@gromacs.org.
Can't post? Read http://www.gromacs.org/Support/Mailing_Lists