Re: [gmx-users] Gromacs 4.5.1 on 48 core magny-cours AMDs
Hi Carsten, I've been thinking a bit about this issue, and for now a relatively easy fix would be to enable thread affinity when all cores on a machine are used. When fewer threads are turned on, I don't want to turn on thread affinity because any combination might either - interfere with other running mdruns - cause mdrun to run sub-optimally by forcing it, for example, to run two threads on the same core with hyperthreading, for example. The second issue *could* be solved, but would require some work that I personally feel would be the domain of the operating system. I'm looking into using hwloc right now, but that doesn't appear to have cmake support. It appears that relatively recent kernels are pretty good at distributing jobs; do you know which kernel version and distribution gave you the unreliable performance numbers you e-mailed? Sander On 21 Oct 2010, at 14:04 , Carsten Kutzner wrote: Hi Sander, On Oct 21, 2010, at 12:27 PM, Sander Pronk wrote: Hi Carsten, As Berk noted, we haven't had problems on 24-core machines, but quite frankly I haven't looked at thread migration. I did not have any problems on 32-core machines as well, only on 48-core ones. Currently, the wait states actively yield to the scheduler, which is an opportunity for the scheduler to re-assign threads to different cores. I could set harder thread affinity but that could compromise system responsiveness (when running mdrun on a desktop machine without active yielding, the system slows down noticeably). One thing you could try is to turn on the THREAD_MPI_WAIT_FOR_NO_ONE option in cmake. That turns off the yielding which might change the migration behavior. I will try that, thanks! BTW What do you mean with bad performance, and how do you notice thread migration issues? A while ago I benchmarked a ~80,000 atom test system (membrane+channel+water, 2 fs time step, cutoffs @ 1 nm) on a 48-core 1.9 GHz AMD node. My first try gave a lousy 7.5 ns/day using Gromacs 4.0.7 and IntelMPI. According to AMD, parallel applications should be run under control of numactl to be compliant to the new memory hierarchy. Also, they suggest using OpenMPI rather than other MPI libs. With OpenMPI and numactl - which pins the processes to the cores - the performance was nearly doubled to 14.3 ns/day. Using Gromacs 4.5 I got 14.0 ns/day with OpenMPI+numactl and 15.2 ns/day with threads (here no pinning was necessary for the threaded version!) Now on another machine with identical hardware (but another Linux) I get 4.5.1 timings that vary a lot (see g_tune_pme snippet below) even between identical runs. One run actually approaches the expected 15 ns/day, while the others with also 20 PME-only nodes) do not. I cannot be shure that thread migration is the problem here, but correct pinning might be necessary here. Carsten g_tune_pme output snippet for mdrun with threads: - Benchmark steps : 1000 dlb equilibration steps : 100 Repeats for each test : 4 No. scaling rcoulomb nkx nky nkz spacing rvdw tpr file 0 -input- 1.00 90 88 80 0.119865 1.00 ./Aquaporin_gmx4_bench00.tpr Individual timings for input file 0 (./Aquaporin_gmx4_bench00.tpr): PME nodes Gcycles ns/dayPME/fRemark 24 1804.4428.7361.703OK. 24 1805.6558.7301.689OK. 24 1260.351 12.5050.647OK. 24 1954.3148.0641.488OK. 20 1753.3868.9921.960OK. 20 1981.0327.9582.190OK. 20 1344.375 11.7211.180OK. 20 1103.340 14.2870.896OK. 16 1876.1348.4041.713OK. 16 1844.1118.5511.525OK. 16 1757.4148.9721.845OK. 16 1785.0508.8331.208OK. 0 1851.6458.520 - OK. 0 1871.9558.427 - OK. 0 1978.3577.974 - OK. 0 1848.5158.534 - OK. -1( 18) 1926.2028.1821.453OK. -1( 18) 1195.456 13.1840.826OK. -1( 18) 1816.7658.6771.853OK. -1( 18) 1218.834 12.9310.884OK. Sander On 21 Oct 2010, at 12:03 , Carsten Kutzner wrote: Hi, does anyone have experience with AMD's 12-core Magny-Cours processors? With 48 cores on a node it is essential that the processes are properly pinned to the cores for optimum performance. Numactl can do this, but at the moment I do not get good performance with 4.5.1 and threads, which still seem to be migrating around. Carsten -- gmx-users mailing list
RE: [gmx-users] Gromacs 4.5.1 on 48 core magny-cours AMDs
Hi, We haven't observed any problems running with threads over 24 core AMD nodes (4x6 cores). Berk From: ckut...@gwdg.de Date: Thu, 21 Oct 2010 12:03:00 +0200 To: gmx-users@gromacs.org Subject: [gmx-users] Gromacs 4.5.1 on 48 core magny-cours AMDs Hi, does anyone have experience with AMD's 12-core Magny-Cours processors? With 48 cores on a node it is essential that the processes are properly pinned to the cores for optimum performance. Numactl can do this, but at the moment I do not get good performance with 4.5.1 and threads, which still seem to be migrating around. Carsten -- Dr. Carsten Kutzner Max Planck Institute for Biophysical Chemistry Theoretical and Computational Biophysics Am Fassberg 11, 37077 Goettingen, Germany Tel. +49-551-2012313, Fax: +49-551-2012302 http://www.mpibpc.mpg.de/home/grubmueller/ihp/ckutzne -- gmx-users mailing listgmx-users@gromacs.org http://lists.gromacs.org/mailman/listinfo/gmx-users Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/Search before posting! Please don't post (un)subscribe requests to the list. Use the www interface or send it to gmx-users-requ...@gromacs.org. Can't post? Read http://www.gromacs.org/Support/Mailing_Lists -- gmx-users mailing listgmx-users@gromacs.org http://lists.gromacs.org/mailman/listinfo/gmx-users Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/Search before posting! Please don't post (un)subscribe requests to the list. Use the www interface or send it to gmx-users-requ...@gromacs.org. Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
Re: [gmx-users] Gromacs 4.5.1 on 48 core magny-cours AMDs
Hi Carsten, As Berk noted, we haven't had problems on 24-core machines, but quite frankly I haven't looked at thread migration. Currently, the wait states actively yield to the scheduler, which is an opportunity for the scheduler to re-assign threads to different cores. I could set harder thread affinity but that could compromise system responsiveness (when running mdrun on a desktop machine without active yielding, the system slows down noticeably). One thing you could try is to turn on the THREAD_MPI_WAIT_FOR_NO_ONE option in cmake. That turns off the yielding which might change the migration behavior. BTW What do you mean with bad performance, and how do you notice thread migration issues? Sander On 21 Oct 2010, at 12:03 , Carsten Kutzner wrote: Hi, does anyone have experience with AMD's 12-core Magny-Cours processors? With 48 cores on a node it is essential that the processes are properly pinned to the cores for optimum performance. Numactl can do this, but at the moment I do not get good performance with 4.5.1 and threads, which still seem to be migrating around. Carsten -- Dr. Carsten Kutzner Max Planck Institute for Biophysical Chemistry Theoretical and Computational Biophysics Am Fassberg 11, 37077 Goettingen, Germany Tel. +49-551-2012313, Fax: +49-551-2012302 http://www.mpibpc.mpg.de/home/grubmueller/ihp/ckutzne -- gmx-users mailing listgmx-users@gromacs.org http://lists.gromacs.org/mailman/listinfo/gmx-users Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/Search before posting! Please don't post (un)subscribe requests to the list. Use the www interface or send it to gmx-users-requ...@gromacs.org. Can't post? Read http://www.gromacs.org/Support/Mailing_Lists -- gmx-users mailing listgmx-users@gromacs.org http://lists.gromacs.org/mailman/listinfo/gmx-users Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/Search before posting! Please don't post (un)subscribe requests to the list. Use the www interface or send it to gmx-users-requ...@gromacs.org. Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
Re: [gmx-users] Gromacs 4.5.1 on 48 core magny-cours AMDs
Hi Sander, On Oct 21, 2010, at 12:27 PM, Sander Pronk wrote: Hi Carsten, As Berk noted, we haven't had problems on 24-core machines, but quite frankly I haven't looked at thread migration. I did not have any problems on 32-core machines as well, only on 48-core ones. Currently, the wait states actively yield to the scheduler, which is an opportunity for the scheduler to re-assign threads to different cores. I could set harder thread affinity but that could compromise system responsiveness (when running mdrun on a desktop machine without active yielding, the system slows down noticeably). One thing you could try is to turn on the THREAD_MPI_WAIT_FOR_NO_ONE option in cmake. That turns off the yielding which might change the migration behavior. I will try that, thanks! BTW What do you mean with bad performance, and how do you notice thread migration issues? A while ago I benchmarked a ~80,000 atom test system (membrane+channel+water, 2 fs time step, cutoffs @ 1 nm) on a 48-core 1.9 GHz AMD node. My first try gave a lousy 7.5 ns/day using Gromacs 4.0.7 and IntelMPI. According to AMD, parallel applications should be run under control of numactl to be compliant to the new memory hierarchy. Also, they suggest using OpenMPI rather than other MPI libs. With OpenMPI and numactl - which pins the processes to the cores - the performance was nearly doubled to 14.3 ns/day. Using Gromacs 4.5 I got 14.0 ns/day with OpenMPI+numactl and 15.2 ns/day with threads (here no pinning was necessary for the threaded version!) Now on another machine with identical hardware (but another Linux) I get 4.5.1 timings that vary a lot (see g_tune_pme snippet below) even between identical runs. One run actually approaches the expected 15 ns/day, while the others with also 20 PME-only nodes) do not. I cannot be shure that thread migration is the problem here, but correct pinning might be necessary here. Carsten g_tune_pme output snippet for mdrun with threads: - Benchmark steps : 1000 dlb equilibration steps : 100 Repeats for each test : 4 No. scaling rcoulomb nkx nky nkz spacing rvdw tpr file 0 -input- 1.00 90 88 80 0.119865 1.00 ./Aquaporin_gmx4_bench00.tpr Individual timings for input file 0 (./Aquaporin_gmx4_bench00.tpr): PME nodes Gcycles ns/dayPME/fRemark 24 1804.4428.7361.703OK. 24 1805.6558.7301.689OK. 24 1260.351 12.5050.647OK. 24 1954.3148.0641.488OK. 20 1753.3868.9921.960OK. 20 1981.0327.9582.190OK. 20 1344.375 11.7211.180OK. 20 1103.340 14.2870.896OK. 16 1876.1348.4041.713OK. 16 1844.1118.5511.525OK. 16 1757.4148.9721.845OK. 16 1785.0508.8331.208OK. 0 1851.6458.520 - OK. 0 1871.9558.427 - OK. 0 1978.3577.974 - OK. 0 1848.5158.534 - OK. -1( 18) 1926.2028.1821.453OK. -1( 18) 1195.456 13.1840.826OK. -1( 18) 1816.7658.6771.853OK. -1( 18) 1218.834 12.9310.884OK. Sander On 21 Oct 2010, at 12:03 , Carsten Kutzner wrote: Hi, does anyone have experience with AMD's 12-core Magny-Cours processors? With 48 cores on a node it is essential that the processes are properly pinned to the cores for optimum performance. Numactl can do this, but at the moment I do not get good performance with 4.5.1 and threads, which still seem to be migrating around. Carsten -- gmx-users mailing listgmx-users@gromacs.org http://lists.gromacs.org/mailman/listinfo/gmx-users Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/Search before posting! Please don't post (un)subscribe requests to the list. Use the www interface or send it to gmx-users-requ...@gromacs.org. Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
Re: [gmx-users] Gromacs 4.5.1 on 48 core magny-cours AMDs
Thanks for the information; the OpenMPI recommendation is probably because OpenMPI goes to great lengths trying to avoid process migration. The numactl doesn't prevent migration as far as I can tell: it controls where memory gets allocated if it's NUMA. For gromacs the setting should of course always be numactl --localalloc As far as thread migration goes: that might also be a kernel issue. If I remember correctly, this issue has only recently been addressed in the Linux kernel. I'll check whether there's anything we can do specifically for Gromacs. Sander On 21 Oct 2010, at 14:04 , Carsten Kutzner wrote: Hi Sander, On Oct 21, 2010, at 12:27 PM, Sander Pronk wrote: Hi Carsten, As Berk noted, we haven't had problems on 24-core machines, but quite frankly I haven't looked at thread migration. I did not have any problems on 32-core machines as well, only on 48-core ones. Currently, the wait states actively yield to the scheduler, which is an opportunity for the scheduler to re-assign threads to different cores. I could set harder thread affinity but that could compromise system responsiveness (when running mdrun on a desktop machine without active yielding, the system slows down noticeably). One thing you could try is to turn on the THREAD_MPI_WAIT_FOR_NO_ONE option in cmake. That turns off the yielding which might change the migration behavior. I will try that, thanks! BTW What do you mean with bad performance, and how do you notice thread migration issues? A while ago I benchmarked a ~80,000 atom test system (membrane+channel+water, 2 fs time step, cutoffs @ 1 nm) on a 48-core 1.9 GHz AMD node. My first try gave a lousy 7.5 ns/day using Gromacs 4.0.7 and IntelMPI. According to AMD, parallel applications should be run under control of numactl to be compliant to the new memory hierarchy. Also, they suggest using OpenMPI rather than other MPI libs. With OpenMPI and numactl - which pins the processes to the cores - the performance was nearly doubled to 14.3 ns/day. Using Gromacs 4.5 I got 14.0 ns/day with OpenMPI+numactl and 15.2 ns/day with threads (here no pinning was necessary for the threaded version!) Now on another machine with identical hardware (but another Linux) I get 4.5.1 timings that vary a lot (see g_tune_pme snippet below) even between identical runs. One run actually approaches the expected 15 ns/day, while the others with also 20 PME-only nodes) do not. I cannot be shure that thread migration is the problem here, but correct pinning might be necessary here. Carsten g_tune_pme output snippet for mdrun with threads: - Benchmark steps : 1000 dlb equilibration steps : 100 Repeats for each test : 4 No. scaling rcoulomb nkx nky nkz spacing rvdw tpr file 0 -input- 1.00 90 88 80 0.119865 1.00 ./Aquaporin_gmx4_bench00.tpr Individual timings for input file 0 (./Aquaporin_gmx4_bench00.tpr): PME nodes Gcycles ns/dayPME/fRemark 24 1804.4428.7361.703OK. 24 1805.6558.7301.689OK. 24 1260.351 12.5050.647OK. 24 1954.3148.0641.488OK. 20 1753.3868.9921.960OK. 20 1981.0327.9582.190OK. 20 1344.375 11.7211.180OK. 20 1103.340 14.2870.896OK. 16 1876.1348.4041.713OK. 16 1844.1118.5511.525OK. 16 1757.4148.9721.845OK. 16 1785.0508.8331.208OK. 0 1851.6458.520 - OK. 0 1871.9558.427 - OK. 0 1978.3577.974 - OK. 0 1848.5158.534 - OK. -1( 18) 1926.2028.1821.453OK. -1( 18) 1195.456 13.1840.826OK. -1( 18) 1816.7658.6771.853OK. -1( 18) 1218.834 12.9310.884OK. Sander On 21 Oct 2010, at 12:03 , Carsten Kutzner wrote: Hi, does anyone have experience with AMD's 12-core Magny-Cours processors? With 48 cores on a node it is essential that the processes are properly pinned to the cores for optimum performance. Numactl can do this, but at the moment I do not get good performance with 4.5.1 and threads, which still seem to be migrating around. Carsten -- gmx-users mailing listgmx-users@gromacs.org http://lists.gromacs.org/mailman/listinfo/gmx-users Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/Search before posting! Please don't post (un)subscribe requests to the list. Use the www interface or send it to gmx-users-requ...@gromacs.org. Can't post?
Re: [gmx-users] Gromacs 4.5.1 on 48 core magny-cours AMDs
On Oct 21, 2010, at 4:44 PM, Sander Pronk wrote: Thanks for the information; the OpenMPI recommendation is probably because OpenMPI goes to great lengths trying to avoid process migration. The numactl doesn't prevent migration as far as I can tell: it controls where memory gets allocated if it's NUMA. My understanding is that processes get pinned to cores with the help of the --physcpubind switch to numactl, but please correct me if I am wrong. Carsten-- gmx-users mailing listgmx-users@gromacs.org http://lists.gromacs.org/mailman/listinfo/gmx-users Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/Search before posting! Please don't post (un)subscribe requests to the list. Use the www interface or send it to gmx-users-requ...@gromacs.org. Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
Re: [gmx-users] Gromacs 4.5.1 on 48 core magny-cours AMDs
On 21 Oct 2010, at 16:50 , Carsten Kutzner wrote: On Oct 21, 2010, at 4:44 PM, Sander Pronk wrote: Thanks for the information; the OpenMPI recommendation is probably because OpenMPI goes to great lengths trying to avoid process migration. The numactl doesn't prevent migration as far as I can tell: it controls where memory gets allocated if it's NUMA. My understanding is that processes get pinned to cores with the help of the --physcpubind switch to numactl, but please correct me if I am wrong. The documentation isn't clear on that: you bind a process to a set of cores, but that wouldn't necessarily mean that threads that belong to that process are bound to a specific core. They might be allowed to migrate within the set that you give the process. -- gmx-users mailing listgmx-users@gromacs.org http://lists.gromacs.org/mailman/listinfo/gmx-users Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/Search before posting! Please don't post (un)subscribe requests to the list. Use the www interface or send it to gmx-users-requ...@gromacs.org. Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
Re: [gmx-users] Gromacs 4.5.1 on 48 core magny-cours AMDs
Thanks for the information; the OpenMPI recommendation is probably because OpenMPI goes to great lengths trying to avoid process migration. The numactl doesn't prevent migration as far as I can tell: it controls where memory gets allocated if it's NUMA. My understanding is that processes get pinned to cores with the help of the --physcpubind switch to numactl, but please correct me if I am wrong. The documentation isn't clear on that: you bind a process to a set of cores, but that wouldn't necessarily mean that threads that belong to that process are bound to a specific core. They might be allowed to migrate within the set that you give the process. I've just checked a short (a few seconds) benchmark run -- I didn't observe any migrations with or without numactl. A. -- Ansgar Esztermann DV-Systemadministration Max-Planck-Institut für biophysikalische Chemie, Abteilung 105 -- gmx-users mailing listgmx-users@gromacs.org http://lists.gromacs.org/mailman/listinfo/gmx-users Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/Search before posting! Please don't post (un)subscribe requests to the list. Use the www interface or send it to gmx-users-requ...@gromacs.org. Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
Re: [gmx-users] Gromacs 4.5.1 on 48 core magny-cours AMDs
Hi, FWIW, I have recently asked about this in the hwloc mailing list: http://www.open-mpi.org/community/lists/hwloc-users/2010/10/0232.php In general, hwloc is a useful tool for these things. http://www.open-mpi.org/projects/hwloc/ Best, Ondrej On Thu, Oct 21, 2010 at 12:03, Carsten Kutzner ckut...@gwdg.de wrote: Hi, does anyone have experience with AMD's 12-core Magny-Cours processors? With 48 cores on a node it is essential that the processes are properly pinned to the cores for optimum performance. Numactl can do this, but at the moment I do not get good performance with 4.5.1 and threads, which still seem to be migrating around. Carsten -- Dr. Carsten Kutzner Max Planck Institute for Biophysical Chemistry Theoretical and Computational Biophysics Am Fassberg 11, 37077 Goettingen, Germany Tel. +49-551-2012313, Fax: +49-551-2012302 http://www.mpibpc.mpg.de/home/grubmueller/ihp/ckutzne -- gmx-users mailing list gmx-us...@gromacs.org http://lists.gromacs.org/mailman/listinfo/gmx-users Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/Search before posting! Please don't post (un)subscribe requests to the list. Use the www interface or send it to gmx-users-requ...@gromacs.org. Can't post? Read http://www.gromacs.org/Support/Mailing_Lists -- gmx-users mailing listgmx-users@gromacs.org http://lists.gromacs.org/mailman/listinfo/gmx-users Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/Search before posting! Please don't post (un)subscribe requests to the list. Use the www interface or send it to gmx-users-requ...@gromacs.org. Can't post? Read http://www.gromacs.org/Support/Mailing_Lists