Re: [gmx-users] Scaling/performance on Gromacs 4

2012-06-06 Thread Manu Vajpai
Apologies for reviving such an old thread. For clarifications, interlagos
and bulldozer both have a modular architecture, as mentioned earlier. Each
bulldozer module has 2 integer cores and one floating point unit shared
between the two cores. So, although you have 64 cores (counting integer
cores) reported by the os, the number of floating point units is still 32.
Moreover, each FP unit can process two threads when it is possible, but
since gromacs is so compute intensive I am guessing it is saturated by just
one. Hence you are not observing  a scale-up by moving from 32 to 64
threads.

Regards,
Manu Vajpai
IIT Kanpur

On Fri, Mar 16, 2012 at 4:24 PM, Szilárd Páll szilard.p...@cbr.su.sewrote:

 Hi Sara,

 The bad performance you are seeing is most probably caused by the
 combination of the new AMD Interlagos CPUs, compiler, operating
 system and it is very likely the the old Gromacs version also
 contributes.

 In practice these new CPUs don't perform as well as expected, but that
 is partly due to compilers and operating systems not having full
 support for the new architecture. However, based on the quite
 extensive benchmarking I've done, the with such a large system should
 be considerably better than what your numbers show.

 This is what you should try:
 - compile Gromacs with gcc 4.6 using the -march=bdver1 optimization flag;
 - have at least 3.0 or preferably newer Linux kernel;
 - if you're not required to use 4.0.x, use 4.5.

 Note that you have to be careful with drawing conclusions from
 benchmarking on small number of cores with large systems; you will get
 artifacts from caching effects.


 And now a bit of fairly technical explanation, for more details ask Google
 ;)

 The machine you are using has AMD Interlagos CPUs based on the
 Bulldozer micro-architecture. This is a new architecture, a departure
 from previous AMD processors and in fact quite different from most
 current CPUs. Bulldozer cores are not the traditional physical
 cores. In fact the hardware unit is the module which consists of two
 half cores (at least when it comes to floating point units). and
 enable a special type of multithreading called clustered
 multithreading. This is slightly similar to the Intel cores with
 Hyper-Threading.


 Cheers,
 --
 Szilárd



 On Mon, Feb 20, 2012 at 5:12 PM, Sara Campos srrcam...@gmail.com wrote:
  Dear GROMACS users
 
  My group has had access to a quad processor, 64 core machine (4 x Opteron
  6274 @ 2.2 GHz with 16 cores)
  and I made some performance tests, using the following specifications:
 
  System size: 299787 atoms
  Number of MD steps: 1500
  Electrostatics treatment: PME
  Gromacs version: 4.0.4
  MPI: LAM
  Command ran: mpirun -ssi rpi tcp C mdrun_mpi ...
 
  #CPUS  Time (s)   Steps/s
  64 195.000 7.69
  32 192.000 7.81
  16 275.000 5.45
  8  381.000 3.94
  4  751.000 2.00
  2 1001.000 1.50
  1 2352.000 0.64
 
  The scaling is not good. But the weirdest is the 64 processors performing
  the same as 32. I see the plots from Dr. Hess on the GROMACS 4 paper on
 JCTC
  and I do not understand why this is happening. Can anyone help?
 
  Thanks in advance,
  Sara
 
  --
  gmx-users mailing listgmx-users@gromacs.org
  http://lists.gromacs.org/mailman/listinfo/gmx-users
  Please search the archive at
  http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
  Please don't post (un)subscribe requests to the list. Use the
  www interface or send it to gmx-users-requ...@gromacs.org.
  Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
 --
 gmx-users mailing listgmx-users@gromacs.org
 http://lists.gromacs.org/mailman/listinfo/gmx-users
 Please search the archive at
 http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
 Please don't post (un)subscribe requests to the list. Use the
 www interface or send it to gmx-users-requ...@gromacs.org.
 Can't post? Read http://www.gromacs.org/Support/Mailing_Lists

-- 
gmx-users mailing listgmx-users@gromacs.org
http://lists.gromacs.org/mailman/listinfo/gmx-users
Please search the archive at 
http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
Please don't post (un)subscribe requests to the list. Use the 
www interface or send it to gmx-users-requ...@gromacs.org.
Can't post? Read http://www.gromacs.org/Support/Mailing_Lists

Re: [gmx-users] Scaling/performance on Gromacs 4

2012-03-16 Thread Szilárd Páll
Hi Sara,

The bad performance you are seeing is most probably caused by the
combination of the new AMD Interlagos CPUs, compiler, operating
system and it is very likely the the old Gromacs version also
contributes.

In practice these new CPUs don't perform as well as expected, but that
is partly due to compilers and operating systems not having full
support for the new architecture. However, based on the quite
extensive benchmarking I've done, the with such a large system should
be considerably better than what your numbers show.

This is what you should try:
- compile Gromacs with gcc 4.6 using the -march=bdver1 optimization flag;
- have at least 3.0 or preferably newer Linux kernel;
- if you're not required to use 4.0.x, use 4.5.

Note that you have to be careful with drawing conclusions from
benchmarking on small number of cores with large systems; you will get
artifacts from caching effects.


And now a bit of fairly technical explanation, for more details ask Google ;)

The machine you are using has AMD Interlagos CPUs based on the
Bulldozer micro-architecture. This is a new architecture, a departure
from previous AMD processors and in fact quite different from most
current CPUs. Bulldozer cores are not the traditional physical
cores. In fact the hardware unit is the module which consists of two
half cores (at least when it comes to floating point units). and
enable a special type of multithreading called clustered
multithreading. This is slightly similar to the Intel cores with
Hyper-Threading.


Cheers,
--
Szilárd



On Mon, Feb 20, 2012 at 5:12 PM, Sara Campos srrcam...@gmail.com wrote:
 Dear GROMACS users

 My group has had access to a quad processor, 64 core machine (4 x Opteron
 6274 @ 2.2 GHz with 16 cores)
 and I made some performance tests, using the following specifications:

 System size: 299787 atoms
 Number of MD steps: 1500
 Electrostatics treatment: PME
 Gromacs version: 4.0.4
 MPI: LAM
 Command ran: mpirun -ssi rpi tcp C mdrun_mpi ...

 #CPUS          Time (s)   Steps/s
 64             195.000     7.69
 32             192.000     7.81
 16             275.000     5.45
 8              381.000     3.94
 4              751.000     2.00
 2             1001.000     1.50
 1             2352.000     0.64

 The scaling is not good. But the weirdest is the 64 processors performing
 the same as 32. I see the plots from Dr. Hess on the GROMACS 4 paper on JCTC
 and I do not understand why this is happening. Can anyone help?

 Thanks in advance,
 Sara

 --
 gmx-users mailing list    gmx-users@gromacs.org
 http://lists.gromacs.org/mailman/listinfo/gmx-users
 Please search the archive at
 http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
 Please don't post (un)subscribe requests to the list. Use the
 www interface or send it to gmx-users-requ...@gromacs.org.
 Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
--
gmx-users mailing listgmx-users@gromacs.org
http://lists.gromacs.org/mailman/listinfo/gmx-users
Please search the archive at 
http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
Please don't post (un)subscribe requests to the list. Use the
www interface or send it to gmx-users-requ...@gromacs.org.
Can't post? Read http://www.gromacs.org/Support/Mailing_Lists


[gmx-users] Scaling/performance on Gromacs 4

2012-02-20 Thread Sara Campos
Dear GROMACS users

My group has had access to a quad processor, 64 core machine (4 x Opteron
6274 @ 2.2 GHz with 16 cores)
and I made some performance tests, using the following specifications:

System size: 299787 atoms
Number of MD steps: 1500
Electrostatics treatment: PME
Gromacs version: 4.0.4
MPI: LAM
Command ran: mpirun -ssi rpi tcp C mdrun_mpi ...

#CPUS  Time (s)   Steps/s
64 195.000 7.69
32 192.000 7.81
16 275.000 5.45
8  381.000 3.94
4  751.000 2.00
2 1001.000 1.50
1 2352.000 0.64

The scaling is not good. But the weirdest is the 64 processors performing
the same as 32. I see the plots from Dr. Hess on the GROMACS 4 paper on JCTC
and I do not understand why this is happening. Can anyone help?

Thanks in advance,
Sara
-- 
gmx-users mailing listgmx-users@gromacs.org
http://lists.gromacs.org/mailman/listinfo/gmx-users
Please search the archive at 
http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
Please don't post (un)subscribe requests to the list. Use the 
www interface or send it to gmx-users-requ...@gromacs.org.
Can't post? Read http://www.gromacs.org/Support/Mailing_Lists

Re: [gmx-users] Scaling/performance on Gromacs 4

2012-02-20 Thread Carsten Kutzner
Hi Sara,

my guess is that 1500 steps are not at all sufficient for a benchmark on 64 
cores. 
The dynamic load balancing will need more time to adapt the domain sizes
for optimal balance. 
It is also important that you reset the timers when the load is balanced (to get
clean performance numbers); you might want to use the -resethway switch for 
that. 
g_tune_pme will help you find the performance optimum on any number of nodes, 
from 4.5 on it is included in Gromacs.

Carsten


Am Feb 20, 2012 um 5:12 PM schrieb Sara Campos:

 Dear GROMACS users
 
 My group has had access to a quad processor, 64 core machine (4 x Opteron 
 6274 @ 2.2 GHz with 16 cores)
 and I made some performance tests, using the following specifications:
 
 System size: 299787 atoms
 Number of MD steps: 1500
 Electrostatics treatment: PME
 Gromacs version: 4.0.4
 MPI: LAM
 Command ran: mpirun -ssi rpi tcp C mdrun_mpi ...
 
 #CPUS  Time (s)   Steps/s
 64 195.000 7.69
 32 192.000 7.81
 16 275.000 5.45
 8  381.000 3.94
 4  751.000 2.00
 2 1001.000 1.50
 1 2352.000 0.64
 
 The scaling is not good. But the weirdest is the 64 processors performing
 the same as 32. I see the plots from Dr. Hess on the GROMACS 4 paper on JCTC
 and I do not understand why this is happening. Can anyone help?
 
 Thanks in advance,
 Sara
 -- 
 gmx-users mailing listgmx-users@gromacs.org
 http://lists.gromacs.org/mailman/listinfo/gmx-users
 Please search the archive at 
 http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
 Please don't post (un)subscribe requests to the list. Use the 
 www interface or send it to gmx-users-requ...@gromacs.org.
 Can't post? Read http://www.gromacs.org/Support/Mailing_Lists

--
gmx-users mailing listgmx-users@gromacs.org
http://lists.gromacs.org/mailman/listinfo/gmx-users
Please search the archive at 
http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
Please don't post (un)subscribe requests to the list. Use the
www interface or send it to gmx-users-requ...@gromacs.org.
Can't post? Read http://www.gromacs.org/Support/Mailing_Lists


Re: [gmx-users] Scaling/performance on Gromacs 4

2012-02-20 Thread Floris Buelens
Poor scaling with MPI on many-core machines can also be due uneven job 
distributions across cores or jobs being wastefully swapped between cores. You 
might be able to fix this with some esoteric configuration options of mpirun 
(--bind-to-core worked for me with openMPI), but the surest option is to switch 
to gromacs 4.5 and run using thread-level parallelisation, bypassing MPI 
entirely.




 From: Sara Campos srrcam...@gmail.com
To: gmx-users@gromacs.org 
Sent: Monday, 20 February 2012, 17:12
Subject: [gmx-users] Scaling/performance on Gromacs 4
 

Dear GROMACS users

My group has had access to a quad processor, 64 core machine (4 x Opteron 6274 
@ 2.2 GHz with 16 
cores)
and I made some performance tests, using the following specifications:

System size: 299787 atoms
Number of MD steps: 1500
Electrostatics treatment: PME
Gromacs version: 4.0.4
MPI: LAM
Command ran: mpirun -ssi rpi tcp C mdrun_mpi ...

#CPUS          Time (s)   Steps/s
64             195.000     7.69
32             192.000     7.81
16             275.000     5.45
8              381.000     3.94
4              751.000     2.00
2             1001.000     1.50
1             2352.000     0.64

The scaling is not good. But the weirdest is the 64 processors performing
the same as 32. I see the plots from Dr. Hess on the GROMACS 4 paper on JCTC
and I do not understand why this is happening. Can anyone help?

Thanks in advance,
Sara

-- 
gmx-users mailing list    gmx-users@gromacs.org
http://lists.gromacs.org/mailman/listinfo/gmx-users
Please search the archive at 
http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
Please don't post (un)subscribe requests to the list. Use the 
www interface or send it to gmx-users-requ...@gromacs.org.
Can't post? Read http://www.gromacs.org/Support/Mailing_Lists-- 
gmx-users mailing listgmx-users@gromacs.org
http://lists.gromacs.org/mailman/listinfo/gmx-users
Please search the archive at 
http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
Please don't post (un)subscribe requests to the list. Use the 
www interface or send it to gmx-users-requ...@gromacs.org.
Can't post? Read http://www.gromacs.org/Support/Mailing_Lists

Re: [gmx-users] Scaling/performance on Gromacs 4

2012-02-20 Thread Mark Abraham

On 21/02/2012 8:11 AM, Floris Buelens wrote:
Poor scaling with MPI on many-core machines can also be due uneven job 
distributions across cores or jobs being wastefully swapped between 
cores. You might be able to fix this with some esoteric configuration 
options of mpirun (--bind-to-core worked for me with openMPI), but the 
surest option is to switch to gromacs 4.5 and run using thread-level 
parallelisation, bypassing MPI entirely.


That can avoid problems arising from MPI performance, but not those 
arising from PP-vs-PME load balance, or intra-PP load balance. The end 
of the .log files will suggest if these latter effects are strong 
contributors. Carsten's suggested solution is one good one.


Mark





*From:* Sara Campos srrcam...@gmail.com
*To:* gmx-users@gromacs.org
*Sent:* Monday, 20 February 2012, 17:12
*Subject:* [gmx-users] Scaling/performance on Gromacs 4

Dear GROMACS users

My group has had access to a quad processor, 64 core machine (4 x 
Opteron 6274 @ 2.2 GHz with 16 cores)

and I made some performance tests, using the following specifications:

System size: 299787 atoms
Number of MD steps: 1500
Electrostatics treatment: PME
Gromacs version: 4.0.4
MPI: LAM
Command ran: mpirun -ssi rpi tcp C mdrun_mpi ...

#CPUS  Time (s)   Steps/s
64 195.000 7.69
32 192.000 7.81
16 275.000 5.45
8  381.000 3.94
4  751.000 2.00
2 1001.000 1.50
1 2352.000 0.64

The scaling is not good. But the weirdest is the 64 processors performing
the same as 32. I see the plots from Dr. Hess on the GROMACS 4 paper 
on JCTC

and I do not understand why this is happening. Can anyone help?

Thanks in advance,
Sara

--
gmx-users mailing list gmx-users@gromacs.org 
mailto:gmx-users@gromacs.org

http://lists.gromacs.org/mailman/listinfo/gmx-users
Please search the archive at 
http://www.gromacs.org/Support/Mailing_Lists/Search before posting!

Please don't post (un)subscribe requests to the list. Use the
www interface or send it to gmx-users-requ...@gromacs.org 
mailto:gmx-users-requ...@gromacs.org.

Can't post? Read http://www.gromacs.org/Support/Mailing_Lists





-- 
gmx-users mailing listgmx-users@gromacs.org
http://lists.gromacs.org/mailman/listinfo/gmx-users
Please search the archive at 
http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
Please don't post (un)subscribe requests to the list. Use the 
www interface or send it to gmx-users-requ...@gromacs.org.
Can't post? Read http://www.gromacs.org/Support/Mailing_Lists