users-boun...@open-mpi.org a écrit sur 18/08/2011 14:41:25 : > De : Ralph Castain <r...@open-mpi.org> > A : Open MPI Users <us...@open-mpi.org> > Date : 18/08/2011 14:45 > Objet : Re: [OMPI users] Bindings not detected with slurm (srun) > Envoyé par : users-boun...@open-mpi.org > > Afraid I am confused. I assume this refers to the trunk, yes?
I work with V1.5. > > I also assume you are talking about launching an application > directly from srun as opposed to using mpirun - yes? Yes > > In that case, I fail to understand what difference it makes > regarding this proposed change. The application process is being > directly bound by slurm, so what paffinity thinks is irrelevant, > except perhaps for some debugging I suppose. Is that what you are > concerned about? I have a framework that has to check if the processes are bound. This framework uses the macro OPAL_PAFFINITY_PROCESS_IS_BOUND and really needs that all processes are bound. That runs well except when I use srun with slurm configured to bind each single rank with a singleton. For exemple, I use nodes with 8 sockets of 4 cores. The command srun generates 32 cpusets (one for each core) and binds the 32 processes, one on each cpuset. Then the macro returns *bound=false, and my framework considers that the processes are not bound and doesn't do the job correctly. The patch modifies the macro to return *bound=true when a single process is bound to a cpuset of one core. > > I'd just like to know what problem is actually being solved here. I > agree that, if there is only one processor in a system, you are > effectively "bound". > > > On Aug 18, 2011, at 2:25 AM, pascal.dev...@bull.net wrote: > > > Hi all, > > > > When slurm is configured with the following parameters > > TaskPlugin=task/affinity > > TaskPluginParam=Cpusets > > srun binds the processes by placing them into different > > cpusets, each containing a single core. > > > > e.g. "srun -N 2 -n 4" will create 2 cpusets in each of the two allocated > > nodes and place the four ranks there, each single rank with a singleton as > > a cpu constraint. > > > > The issue in that case is in the macro OPAL_PAFFINITY_PROCESS_IS_BOUND (in > > opal/mca/paffinity/paffinity.h): > > . opal_paffinity_base_get_processor_info() fills in num_processors with 1 > > (this is the size of each cpu_set) > > . num_bound is set to 1 too > > and this implies *bound=false > > > > So, the binding is correctly done by slurm and not detected by MPI. > > > > To support the cpuset binding done by slurm, I propose the following patch: > > > > hg diff opal/mca/paffinity/paffinity.h > > diff -r 4d8c8a39b06f opal/mca/paffinity/paffinity.h > > --- a/opal/mca/paffinity/paffinity.h Thu Apr 21 17:38:00 2011 +0200 > > +++ b/opal/mca/paffinity/paffinity.h Tue Jul 12 15:44:59 2011 +0200 > > @@ -218,7 +218,8 @@ > > num_bound++; \ > > } \ > > } \ > > - if (0 < num_bound && num_bound < num_processors) { \ > > + if (0 < num_bound && ((num_processors == 1) || \ > > + (num_bound < num_processors))) { \ > > *(bound) = true; \ > > } \ > > } \ > > > > > > _______________________________________________ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users