Re: [OMPI users] Bindings not detected with slurm (srun)

pascal . deveze Mon, 22 Aug 2011 09:20:44 -0400

users-boun...@open-mpi.org a écrit sur 18/08/2011 14:41:25 :

> De : Ralph Castain <r...@open-mpi.org>
> A : Open MPI Users <us...@open-mpi.org>
> Date : 18/08/2011 14:45
> Objet : Re: [OMPI users] Bindings not detected with slurm (srun)
> Envoyé par : users-boun...@open-mpi.org
>
> Afraid I am confused. I assume this refers to the trunk, yes?


I work with V1.5.

>
> I also assume you are talking about launching an application
> directly from srun as opposed to using mpirun - yes?

Yes

>
> In that case, I fail to understand what difference it makes
> regarding this proposed change. The application process is being
> directly bound by slurm, so what paffinity thinks is irrelevant,
> except perhaps for some debugging I suppose. Is that what you are
> concerned about?

I have a framework that has to check if the processes are bound. This
framework
uses the macro OPAL_PAFFINITY_PROCESS_IS_BOUND and really needs that all
processes are bound.

That runs well except when I use srun with slurm configured to bind
each single rank with a singleton.

For exemple, I use nodes with 8 sockets of 4 cores. The command srun
generates 32 cpusets (one for each core) and binds the 32 processes, one
on each cpuset.
Then the macro returns *bound=false, and my framework considers that the
processes are not bound  and doesn't do the job correctly.

The patch modifies the macro to return *bound=true when a single
process is bound to a cpuset of one core.

>
> I'd just like to know what problem is actually being solved here. I
> agree that, if there is only one processor in a system, you are
> effectively "bound".
>
>
> On Aug 18, 2011, at 2:25 AM, pascal.dev...@bull.net wrote:
>
> > Hi all,
> >
> > When slurm is configured with the following parameters
> >   TaskPlugin=task/affinity
> >   TaskPluginParam=Cpusets
> > srun binds the processes by placing them into different
> > cpusets, each containing a single core.
> >
> > e.g. "srun -N 2 -n 4" will create 2 cpusets in each of the two
allocated
> > nodes and place the four ranks there, each single rank with a singleton
as
> > a cpu constraint.
> >
> > The issue in that case is in the macro OPAL_PAFFINITY_PROCESS_IS_BOUND
(in
> > opal/mca/paffinity/paffinity.h):
> >  . opal_paffinity_base_get_processor_info() fills in num_processors
with 1
> > (this is the size of each cpu_set)
> >  . num_bound is set to 1 too
> > and this implies *bound=false
> >
> > So, the binding is correctly done by slurm and not detected by MPI.
> >
> > To support the cpuset binding done by slurm, I propose the following
patch:
> >
> > hg diff  opal/mca/paffinity/paffinity.h
> > diff -r 4d8c8a39b06f opal/mca/paffinity/paffinity.h
> > --- a/opal/mca/paffinity/paffinity.h    Thu Apr 21 17:38:00 2011 +0200
> > +++ b/opal/mca/paffinity/paffinity.h    Tue Jul 12 15:44:59 2011 +0200
> > @@ -218,7 +218,8 @@
> >                     num_bound++;                                    \
> >                 }                                                   \
> >             }                                                       \
> > -            if (0 < num_bound && num_bound < num_processors) {      \
> > +            if (0 < num_bound && ((num_processors == 1) ||          \
> > +                                  (num_bound < num_processors))) {  \
> >                 *(bound) = true;                                    \
> >             }                                                       \
> >         }                                                           \
> >
> >
> > _______________________________________________
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] Bindings not detected with slurm (srun)

Reply via email to