On Tue, 31 Mar 2015 21:56:10 +0200 Jack HALE <[email protected]> wrote:
> I ran into exactly this problem last week with OpenBLAS and threads > alongside PETSc using MPI. > > We have a new machine with 8 sockets with a 15 core Ivy Bridge Xeon, > totally 120 cores and 240 hyperthreads. The architecture is > shared-memory NUMA with a total of 3TB of RAM. By default each MPI > process (one per physical core) was launching 240 BLAS threads. As you > can imagine the results weren't pretty. We were seeing slowdowns on > the order of 10-15 times on a machine this size. > > I have a set of scripts here to compile FEniCS from scratch, > specifically for OpenBLAS you need to compile with the USE_THREAD=0 > flag. > > https://bitbucket.org/unilucompmech/fenics-gaia-cluster/src/4c1053d825026d253972dc49f613a2935405862a/build-openblas.sh?at=master > > Additionally, it is important to properly bind MPI processes to > sockets (e.g. MPI processes can float between cores on a socket, but > not across sockets) and also to map processes across cores first so > that MPI processes that communicate with each other most share the > fastest memory (cache hopefully!). > > This can be achieved using the OpenMPI arguments: > > mpirun --report-bindings --bind-to socket --map-by core <your usual > arguments> Interesting. Is it a big deal on clusters? > > Finally on a shared memory system you should really be using the vader > backend ideally in conjunction with the xpmem kernel module, or if you > are on a default kernel > 3.2 the cma module. We are limited to the > latter due to our HPC being based on Debian Wheezy. Again, you need to > adjust the compile options for OpenMPI: > > https://bitbucket.org/unilucompmech/fenics-gaia-cluster/src/4c1053d825026d253972dc49f613a2935405862a/build-openmpi.sh?at=master > > I also had to patch OpenMPI to get this to work. But next version it > should be fixed. If I understand it correctly, src/openmpi> ./configure --with-cma (or vader+xpmem variant) should not be used on clusters but is beneficial on shared memory systems (like laptops and workstations), right? Jan > > Finally a more detailed description of all this is here: > > https://bitbucket.org/unilucompmech/fenics-gaia-cluster/ > > Hope this helps people out a bit, all of this will be more relevant to > end users now we are seeing desktop machines with 8 cores on multiple > sockets. > > Cheers, > ----- > Dr. Jack S. Hale > University of Luxembourg > _______________________________________________ > fenics mailing list > [email protected] > http://fenicsproject.org/mailman/listinfo/fenics _______________________________________________ fenics mailing list [email protected] http://fenicsproject.org/mailman/listinfo/fenics
