As an aside, with Slurm you can use: sbatch --ntasks-per-socket=<N>
I would hazard a guess that this uses the OpenMPI syntax as above to perform the binding to core! On 27 July 2015 at 09:47, Ralph Castain <r...@open-mpi.org> wrote: > As you say, it all depends on your kernel :-) > > If the numactl libraries are available, we will explicitly set the memory > policy to follow the bindings we apply. So doing as I suggested will cause > the first process to have its memory “bound” to the first socket, even > thought the process will also be using a core from the other region. If > your process spawns a few threads to ensure that core exercises the memory, > you’ll get plenty of cross-NUMA behavior to test against. > > Which is why we recommend that users “don’t do that” :-) > > > > On Jul 27, 2015, at 1:25 AM, Davide Cesari <dces...@arpa.emr.it> wrote: > > > > Hi Bill and Ralph, > > well the Linux kernel does all its best to allocate memory on the > local NUMA node if it's available, so it is difficult to convince it to do > something harmful in this sense. I think that a way to test such a > situation would be to start mpi processes on a node in an usual way > -reasonably the processes will be bound to a socket or a core-, wait for > the processes to allocate their working memory, then either migrate the > processes on the other NUMA node (usually ==socket) or migrate its memory > pages, the command-line tools distributed with the numactl package (numactl > or migratepages) can probably allow to perform such a vandalism; this would > put your system into a worst-case scenario from the NUMA point of view. > > In our system, I noticed in the past some strong slowdowns related > to NUMA in parallel processes when a single MPI process doing much more I/O > than the others tended to occupy all the local memory as disk cache, then > the processes on that NUMA node were forced to allocate memory on the other > NUMA node rather than reclaiming cache memory on the local node. I solved > this in a brutal way by cleaning the disk cache regularly on the computing > nodes. In my view this is the only case where (recent) Linux kernel does > not have a NUMA-aware behavior, I wonder whether there are HPC-optimized > patches or something has changed in this direction in recent kernel > development. > > > > Best regards, Davide > > > >> Date: Fri, 24 Jul 2015 13:36:55 -0700 > >> From: Ralph Castain <r...@open-mpi.org> > >> To: Open MPI Users <us...@open-mpi.org> > >> Subject: Re: [OMPI users] NUMA: Non-local memory access and > >> performance effects on OpenMPI > >> Hi Bill > >> > >> You actually can get OMPI to split a process across sockets. Let?s say > there are 4 cores/socket and 2 sockets/node. You could run two procs on the > same node, one split across sockets, by: > >> > >> mpirun -n 1 ?map-by core:pe=5 ./app : -n 1 ?map-by core:pe=3 ./app > >> > >> The first proc will run on all cores of the 1st socket plus the 1st > core of the 2nd socket. The second proc will run on the remaining 3 cores > of the 2nd socket. > >> > >> HTH > >> Ralph > >> > >> > >>> On Jul 24, 2015, at 12:48 PM, Lane, William <william.l...@cshs.org> > wrote: > >>> > >>> I'm just curious, if we run an OpenMPI job and it makes use of > non-local memory > >>> (i.e. memory tied to another socket) what kind of effects are seen on > performance? > >>> > >>> How would you go about testing the above? I can't think of any command > line parameter that > >>> would allow one to split an OpenMPI process across sockets. > >>> > >>> I'd imagine it would be pretty bad since you can't cache non-local > memory locally, > >>> the fact both the request and data have to flow through an IOH, the > local CPU would > >>> have to compete w/the non-local CPU for access to its own memory and > that doing this > >>> would have to implemented w/some sort of software semaphore locks > (which would add > >>> even more overhead). > >>> > >>> Bill L. > >>> IMPORTANT WARNING: This message is intended for the use of the person > or entity to which it is addressed and may contain information that is > privileged and confidential, the disclosure of which is governed by > applicable law. If the reader of this message is not the intended > recipient, or the employee or agent responsible for delivering it to the > intended recipient, you are hereby notified that any dissemination, > distribution or copying of this information is strictly prohibited. Thank > you for your cooperation. _______________________________________________ > >>> users mailing list > >>> us...@open-mpi.org <mailto:us...@open-mpi.org> > >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users < > http://www.open-mpi.org/mailman/listinfo.cgi/users> > >>> Link to this post: > http://www.open-mpi.org/community/lists/users/2015/07/27322.php < > http://www.open-mpi.org/community/lists/users/2015/07/27322.php> > > > > > > -- > > ============================= Davide Cesari ============================ > > Dott**(0.5) Davide Cesari > > ARPA-Emilia Romagna, Servizio IdroMeteoClima > > NWP modelling - Modellistica numerica previsionale > > Tel. +39 051525926 > > ======================================================================== > > _______________________________________________ > > users mailing list > > us...@open-mpi.org > > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > > Link to this post: > http://www.open-mpi.org/community/lists/users/2015/07/27331.php > > _______________________________________________ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2015/07/27332.php