Re: [OMPI users] Bindings not detected with slurm (srun)

2011-08-22 Thread pascal . deveze

users-boun...@open-mpi.org a écrit sur 18/08/2011 14:41:25 :

> De : Ralph Castain 
> A : Open MPI Users 
> Date : 18/08/2011 14:45
> Objet : Re: [OMPI users] Bindings not detected with slurm (srun)
> Envoyé par : users-boun...@open-mpi.org
>
> Afraid I am confused. I assume this refers to the trunk, yes?

I work with V1.5.

>
> I also assume you are talking about launching an application
> directly from srun as opposed to using mpirun - yes?

Yes

>
> In that case, I fail to understand what difference it makes
> regarding this proposed change. The application process is being
> directly bound by slurm, so what paffinity thinks is irrelevant,
> except perhaps for some debugging I suppose. Is that what you are
> concerned about?

I have a framework that has to check if the processes are bound. This
framework
uses the macro OPAL_PAFFINITY_PROCESS_IS_BOUND and really needs that all
processes are bound.

That runs well except when I use srun with slurm configured to bind
each single rank with a singleton.

For exemple, I use nodes with 8 sockets of 4 cores. The command srun
generates 32 cpusets (one for each core) and binds the 32 processes, one
on each cpuset.
Then the macro returns *bound=false, and my framework considers that the
processes are not bound  and doesn't do the job correctly.

The patch modifies the macro to return *bound=true when a single
process is bound to a cpuset of one core.

>
> I'd just like to know what problem is actually being solved here. I
> agree that, if there is only one processor in a system, you are
> effectively "bound".
>
>
> On Aug 18, 2011, at 2:25 AM, pascal.dev...@bull.net wrote:
>
> > Hi all,
> >
> > When slurm is configured with the following parameters
> >   TaskPlugin=task/affinity
> >   TaskPluginParam=Cpusets
> > srun binds the processes by placing them into different
> > cpusets, each containing a single core.
> >
> > e.g. "srun -N 2 -n 4" will create 2 cpusets in each of the two
allocated
> > nodes and place the four ranks there, each single rank with a singleton
as
> > a cpu constraint.
> >
> > The issue in that case is in the macro OPAL_PAFFINITY_PROCESS_IS_BOUND
(in
> > opal/mca/paffinity/paffinity.h):
> >  . opal_paffinity_base_get_processor_info() fills in num_processors
with 1
> > (this is the size of each cpu_set)
> >  . num_bound is set to 1 too
> > and this implies *bound=false
> >
> > So, the binding is correctly done by slurm and not detected by MPI.
> >
> > To support the cpuset binding done by slurm, I propose the following
patch:
> >
> > hg diff  opal/mca/paffinity/paffinity.h
> > diff -r 4d8c8a39b06f opal/mca/paffinity/paffinity.h
> > --- a/opal/mca/paffinity/paffinity.hThu Apr 21 17:38:00 2011 +0200
> > +++ b/opal/mca/paffinity/paffinity.hTue Jul 12 15:44:59 2011 +0200
> > @@ -218,7 +218,8 @@
> > num_bound++;\
> > }   \
> > }   \
> > -if (0 < num_bound && num_bound < num_processors) {  \
> > +if (0 < num_bound && ((num_processors == 1) ||  \
> > +  (num_bound < num_processors))) {  \
> > *(bound) = true;\
> > }   \
> > }   \
> >
> >
> > ___
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users






Re: [OMPI users] Bindings not detected with slurm (srun)

2011-08-22 Thread Ralph Castain
Okay - thx! I'll install in trunk and schedule for 1.5


On Aug 22, 2011, at 7:20 AM, pascal.dev...@bull.net wrote:

> 
> users-boun...@open-mpi.org a écrit sur 18/08/2011 14:41:25 :
> 
>> De : Ralph Castain 
>> A : Open MPI Users 
>> Date : 18/08/2011 14:45
>> Objet : Re: [OMPI users] Bindings not detected with slurm (srun)
>> Envoyé par : users-boun...@open-mpi.org
>> 
>> Afraid I am confused. I assume this refers to the trunk, yes?
> 
> I work with V1.5.
> 
>> 
>> I also assume you are talking about launching an application
>> directly from srun as opposed to using mpirun - yes?
> 
> Yes
> 
>> 
>> In that case, I fail to understand what difference it makes
>> regarding this proposed change. The application process is being
>> directly bound by slurm, so what paffinity thinks is irrelevant,
>> except perhaps for some debugging I suppose. Is that what you are
>> concerned about?
> 
> I have a framework that has to check if the processes are bound. This
> framework
> uses the macro OPAL_PAFFINITY_PROCESS_IS_BOUND and really needs that all
> processes are bound.
> 
> That runs well except when I use srun with slurm configured to bind
> each single rank with a singleton.
> 
> For exemple, I use nodes with 8 sockets of 4 cores. The command srun
> generates 32 cpusets (one for each core) and binds the 32 processes, one
> on each cpuset.
> Then the macro returns *bound=false, and my framework considers that the
> processes are not bound  and doesn't do the job correctly.
> 
> The patch modifies the macro to return *bound=true when a single
> process is bound to a cpuset of one core.
> 
>> 
>> I'd just like to know what problem is actually being solved here. I
>> agree that, if there is only one processor in a system, you are
>> effectively "bound".
>> 
>> 
>> On Aug 18, 2011, at 2:25 AM, pascal.dev...@bull.net wrote:
>> 
>>> Hi all,
>>> 
>>> When slurm is configured with the following parameters
>>>  TaskPlugin=task/affinity
>>>  TaskPluginParam=Cpusets
>>> srun binds the processes by placing them into different
>>> cpusets, each containing a single core.
>>> 
>>> e.g. "srun -N 2 -n 4" will create 2 cpusets in each of the two
> allocated
>>> nodes and place the four ranks there, each single rank with a singleton
> as
>>> a cpu constraint.
>>> 
>>> The issue in that case is in the macro OPAL_PAFFINITY_PROCESS_IS_BOUND
> (in
>>> opal/mca/paffinity/paffinity.h):
>>> . opal_paffinity_base_get_processor_info() fills in num_processors
> with 1
>>> (this is the size of each cpu_set)
>>> . num_bound is set to 1 too
>>> and this implies *bound=false
>>> 
>>> So, the binding is correctly done by slurm and not detected by MPI.
>>> 
>>> To support the cpuset binding done by slurm, I propose the following
> patch:
>>> 
>>> hg diff  opal/mca/paffinity/paffinity.h
>>> diff -r 4d8c8a39b06f opal/mca/paffinity/paffinity.h
>>> --- a/opal/mca/paffinity/paffinity.hThu Apr 21 17:38:00 2011 +0200
>>> +++ b/opal/mca/paffinity/paffinity.hTue Jul 12 15:44:59 2011 +0200
>>> @@ -218,7 +218,8 @@
>>>num_bound++;\
>>>}   \
>>>}   \
>>> -if (0 < num_bound && num_bound < num_processors) {  \
>>> +if (0 < num_bound && ((num_processors == 1) ||  \
>>> +  (num_bound < num_processors))) {  \
>>>*(bound) = true;\
>>>}   \
>>>}   \
>>> 
>>> 
>>> ___
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
>> 
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] MPIIO and EXT3 file systems

2011-08-22 Thread Rob Latham
On Thu, Aug 18, 2011 at 08:46:46AM -0700, Tom Rosmond wrote:
> We have a large fortran application designed to run doing IO with either
> mpi_io or fortran direct access.  On a linux workstation (16 AMD cores)
> running openmpi 1.5.3 and Intel fortran 12.0 we are having trouble with
> random failures with the mpi_io option which do not occur with
> conventional fortran direct access.  We are using ext3 file systems, and
> I have seen some references hinting of similar problems with the
> ext3/mpiio combination.  The application with the mpi_io option runs
> flawlessly on Cray architectures with Lustre file systems, so we are
> also suspicious of the ext3/mpiio combination.  Does anyone else have
> experience with this combination that could shed some light on the
> problem, and hopefully some suggested solutions?

I'm glad to hear you're having success with mpi-io on Cray/Lustre.
That platform was a bit touchy for a while, but has gotten better over
the last two years.

My first guess would be that your linux workstation does not implement
a "strict enough" file system lock.  ROMIO relies on the "fcntl" locks
to provide exclusive access to files at some points in the code.  

Does your application use collective I/O ?  It sounds like if you can
swap fortran and mpi-io so easily that maybe you do not.  If there's
a way to make collective MPI-IO calls, that will eliminate many of the
fcntl lock calls.  

Do you use MPI datatypes to describe either a file view or the
application data?   These noncontiguous in memory and/or noncontiguous
in file access patterns will also trigger fcntl lock calls.  You can
use an MPI-IO hint to disable data sieving, at a potentially
disastrous performance cost. 

==rob

-- 
Rob Latham
Mathematics and Computer Science Division
Argonne National Lab, IL USA


Re: [OMPI users] MPIIO and EXT3 file systems

2011-08-22 Thread Tom Rosmond

On Mon, 2011-08-22 at 10:23 -0500, Rob Latham wrote:
> On Thu, Aug 18, 2011 at 08:46:46AM -0700, Tom Rosmond wrote:
> > We have a large fortran application designed to run doing IO with either
> > mpi_io or fortran direct access.  On a linux workstation (16 AMD cores)
> > running openmpi 1.5.3 and Intel fortran 12.0 we are having trouble with
> > random failures with the mpi_io option which do not occur with
> > conventional fortran direct access.  We are using ext3 file systems, and
> > I have seen some references hinting of similar problems with the
> > ext3/mpiio combination.  The application with the mpi_io option runs
> > flawlessly on Cray architectures with Lustre file systems, so we are
> > also suspicious of the ext3/mpiio combination.  Does anyone else have
> > experience with this combination that could shed some light on the
> > problem, and hopefully some suggested solutions?
> 
> I'm glad to hear you're having success with mpi-io on Cray/Lustre.
> That platform was a bit touchy for a while, but has gotten better over
> the last two years.
> 
> My first guess would be that your linux workstation does not implement
> a "strict enough" file system lock.  ROMIO relies on the "fcntl" locks
> to provide exclusive access to files at some points in the code.  
> 
> Does your application use collective I/O ?  It sounds like if you can
> swap fortran and mpi-io so easily that maybe you do not.  If there's
> a way to make collective MPI-IO calls, that will eliminate many of the
> fcntl lock calls.  
> 
Rob

Yes, we are using collective I/O (mpi_file_write_at_all,
mpi_file_read_at_all).  The swaping of fortran and mpi-io are just
branches in the code at strategic locations.  Although the mpi-io files
are readable with fortran direct access, we don't do so from within the
application because of different data organization in the files.  

> Do you use MPI datatypes to describe either a file view or the
> application data?   These noncontiguous in memory and/or noncontiguous
> in file access patterns will also trigger fcntl lock calls.  You can
> use an MPI-IO hint to disable data sieving, at a potentially
> disastrous performance cost. 

Yes, we use an 'mpi_type_indexed' datatype to describe the data
organization.  

Any thoughts about the XFS vs EXT3 question?

Thanks for the help

T. Rosmond


> 
> ==rob
>