Re: [FEniCS] Performance regressions

Jan Blechta Tue, 31 Mar 2015 15:16:24 -0700

On Tue, 31 Mar 2015 21:56:10 +0200
Jack HALE <[email protected]> wrote:


> I ran into exactly this problem last week with OpenBLAS and threads
> alongside PETSc using MPI.
> 
> We have a new machine with 8 sockets with a 15 core Ivy Bridge Xeon,
> totally 120 cores and 240 hyperthreads.  The architecture is
> shared-memory NUMA with a total of 3TB of RAM. By default each MPI
> process (one per physical core) was launching 240 BLAS threads. As you
> can imagine the results weren't pretty. We were seeing slowdowns on
> the order of 10-15 times on a machine this size.
> 
> I have a set of scripts here to compile FEniCS from scratch,
> specifically for OpenBLAS you need to compile with the USE_THREAD=0
> flag.
> 
> https://bitbucket.org/unilucompmech/fenics-gaia-cluster/src/4c1053d825026d253972dc49f613a2935405862a/build-openblas.sh?at=master
> 
> Additionally, it is important to properly bind MPI processes to
> sockets (e.g. MPI processes can float between cores on a socket, but
> not across sockets) and also to map processes across cores first so
> that MPI processes that communicate with each other most share the
> fastest memory (cache hopefully!).
> 
> This can be achieved using the OpenMPI arguments:
> 
> mpirun --report-bindings --bind-to socket --map-by core <your usual
> arguments>

Interesting. Is it a big deal on clusters?

> 
> Finally on a shared memory system you should really be using the vader
> backend ideally in conjunction with the xpmem kernel module, or if you
> are on a default kernel > 3.2 the cma module. We are limited to the
> latter due to our HPC being based on Debian Wheezy. Again, you need to
> adjust the compile options for OpenMPI:
> 
> https://bitbucket.org/unilucompmech/fenics-gaia-cluster/src/4c1053d825026d253972dc49f613a2935405862a/build-openmpi.sh?at=master
> 
> I also had to patch OpenMPI to get this to work. But next version it
> should be fixed.

If I understand it correctly,

 src/openmpi>  ./configure --with-cma

(or vader+xpmem variant) should not be used on clusters but is
beneficial on shared memory systems (like laptops and workstations),
right?

Jan

> 
> Finally a more detailed description of all this is here:
> 
> https://bitbucket.org/unilucompmech/fenics-gaia-cluster/
> 
> Hope this helps people out a bit, all of this will be more relevant to
> end users now we are seeing desktop machines with 8 cores on multiple
> sockets.
> 
> Cheers,
> -----
> Dr. Jack S. Hale
> University of Luxembourg
> _______________________________________________
> fenics mailing list
> [email protected]
> http://fenicsproject.org/mailman/listinfo/fenics

_______________________________________________
fenics mailing list
[email protected]
http://fenicsproject.org/mailman/listinfo/fenics

Re: [FEniCS] Performance regressions

Reply via email to