Re: [OMPI users] Bindings not detected with slurm (srun)
users-boun...@open-mpi.org a écrit sur 18/08/2011 14:41:25 : > De : Ralph Castain > A : Open MPI Users > Date : 18/08/2011 14:45 > Objet : Re: [OMPI users] Bindings not detected with slurm (srun) > Envoyé par : users-boun...@open-mpi.org > > Afraid I am confused. I assume this refers to the trunk, yes? I work with V1.5. > > I also assume you are talking about launching an application > directly from srun as opposed to using mpirun - yes? Yes > > In that case, I fail to understand what difference it makes > regarding this proposed change. The application process is being > directly bound by slurm, so what paffinity thinks is irrelevant, > except perhaps for some debugging I suppose. Is that what you are > concerned about? I have a framework that has to check if the processes are bound. This framework uses the macro OPAL_PAFFINITY_PROCESS_IS_BOUND and really needs that all processes are bound. That runs well except when I use srun with slurm configured to bind each single rank with a singleton. For exemple, I use nodes with 8 sockets of 4 cores. The command srun generates 32 cpusets (one for each core) and binds the 32 processes, one on each cpuset. Then the macro returns *bound=false, and my framework considers that the processes are not bound and doesn't do the job correctly. The patch modifies the macro to return *bound=true when a single process is bound to a cpuset of one core. > > I'd just like to know what problem is actually being solved here. I > agree that, if there is only one processor in a system, you are > effectively "bound". > > > On Aug 18, 2011, at 2:25 AM, pascal.dev...@bull.net wrote: > > > Hi all, > > > > When slurm is configured with the following parameters > > TaskPlugin=task/affinity > > TaskPluginParam=Cpusets > > srun binds the processes by placing them into different > > cpusets, each containing a single core. > > > > e.g. "srun -N 2 -n 4" will create 2 cpusets in each of the two allocated > > nodes and place the four ranks there, each single rank with a singleton as > > a cpu constraint. > > > > The issue in that case is in the macro OPAL_PAFFINITY_PROCESS_IS_BOUND (in > > opal/mca/paffinity/paffinity.h): > > . opal_paffinity_base_get_processor_info() fills in num_processors with 1 > > (this is the size of each cpu_set) > > . num_bound is set to 1 too > > and this implies *bound=false > > > > So, the binding is correctly done by slurm and not detected by MPI. > > > > To support the cpuset binding done by slurm, I propose the following patch: > > > > hg diff opal/mca/paffinity/paffinity.h > > diff -r 4d8c8a39b06f opal/mca/paffinity/paffinity.h > > --- a/opal/mca/paffinity/paffinity.hThu Apr 21 17:38:00 2011 +0200 > > +++ b/opal/mca/paffinity/paffinity.hTue Jul 12 15:44:59 2011 +0200 > > @@ -218,7 +218,8 @@ > > num_bound++;\ > > } \ > > } \ > > -if (0 < num_bound && num_bound < num_processors) { \ > > +if (0 < num_bound && ((num_processors == 1) || \ > > + (num_bound < num_processors))) { \ > > *(bound) = true;\ > > } \ > > } \ > > > > > > ___ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] Bindings not detected with slurm (srun)
Okay - thx! I'll install in trunk and schedule for 1.5 On Aug 22, 2011, at 7:20 AM, pascal.dev...@bull.net wrote: > > users-boun...@open-mpi.org a écrit sur 18/08/2011 14:41:25 : > >> De : Ralph Castain >> A : Open MPI Users >> Date : 18/08/2011 14:45 >> Objet : Re: [OMPI users] Bindings not detected with slurm (srun) >> Envoyé par : users-boun...@open-mpi.org >> >> Afraid I am confused. I assume this refers to the trunk, yes? > > I work with V1.5. > >> >> I also assume you are talking about launching an application >> directly from srun as opposed to using mpirun - yes? > > Yes > >> >> In that case, I fail to understand what difference it makes >> regarding this proposed change. The application process is being >> directly bound by slurm, so what paffinity thinks is irrelevant, >> except perhaps for some debugging I suppose. Is that what you are >> concerned about? > > I have a framework that has to check if the processes are bound. This > framework > uses the macro OPAL_PAFFINITY_PROCESS_IS_BOUND and really needs that all > processes are bound. > > That runs well except when I use srun with slurm configured to bind > each single rank with a singleton. > > For exemple, I use nodes with 8 sockets of 4 cores. The command srun > generates 32 cpusets (one for each core) and binds the 32 processes, one > on each cpuset. > Then the macro returns *bound=false, and my framework considers that the > processes are not bound and doesn't do the job correctly. > > The patch modifies the macro to return *bound=true when a single > process is bound to a cpuset of one core. > >> >> I'd just like to know what problem is actually being solved here. I >> agree that, if there is only one processor in a system, you are >> effectively "bound". >> >> >> On Aug 18, 2011, at 2:25 AM, pascal.dev...@bull.net wrote: >> >>> Hi all, >>> >>> When slurm is configured with the following parameters >>> TaskPlugin=task/affinity >>> TaskPluginParam=Cpusets >>> srun binds the processes by placing them into different >>> cpusets, each containing a single core. >>> >>> e.g. "srun -N 2 -n 4" will create 2 cpusets in each of the two > allocated >>> nodes and place the four ranks there, each single rank with a singleton > as >>> a cpu constraint. >>> >>> The issue in that case is in the macro OPAL_PAFFINITY_PROCESS_IS_BOUND > (in >>> opal/mca/paffinity/paffinity.h): >>> . opal_paffinity_base_get_processor_info() fills in num_processors > with 1 >>> (this is the size of each cpu_set) >>> . num_bound is set to 1 too >>> and this implies *bound=false >>> >>> So, the binding is correctly done by slurm and not detected by MPI. >>> >>> To support the cpuset binding done by slurm, I propose the following > patch: >>> >>> hg diff opal/mca/paffinity/paffinity.h >>> diff -r 4d8c8a39b06f opal/mca/paffinity/paffinity.h >>> --- a/opal/mca/paffinity/paffinity.hThu Apr 21 17:38:00 2011 +0200 >>> +++ b/opal/mca/paffinity/paffinity.hTue Jul 12 15:44:59 2011 +0200 >>> @@ -218,7 +218,8 @@ >>>num_bound++;\ >>>} \ >>>} \ >>> -if (0 < num_bound && num_bound < num_processors) { \ >>> +if (0 < num_bound && ((num_processors == 1) || \ >>> + (num_bound < num_processors))) { \ >>>*(bound) = true;\ >>>} \ >>>} \ >>> >>> >>> ___ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> >> ___ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] MPIIO and EXT3 file systems
On Thu, Aug 18, 2011 at 08:46:46AM -0700, Tom Rosmond wrote: > We have a large fortran application designed to run doing IO with either > mpi_io or fortran direct access. On a linux workstation (16 AMD cores) > running openmpi 1.5.3 and Intel fortran 12.0 we are having trouble with > random failures with the mpi_io option which do not occur with > conventional fortran direct access. We are using ext3 file systems, and > I have seen some references hinting of similar problems with the > ext3/mpiio combination. The application with the mpi_io option runs > flawlessly on Cray architectures with Lustre file systems, so we are > also suspicious of the ext3/mpiio combination. Does anyone else have > experience with this combination that could shed some light on the > problem, and hopefully some suggested solutions? I'm glad to hear you're having success with mpi-io on Cray/Lustre. That platform was a bit touchy for a while, but has gotten better over the last two years. My first guess would be that your linux workstation does not implement a "strict enough" file system lock. ROMIO relies on the "fcntl" locks to provide exclusive access to files at some points in the code. Does your application use collective I/O ? It sounds like if you can swap fortran and mpi-io so easily that maybe you do not. If there's a way to make collective MPI-IO calls, that will eliminate many of the fcntl lock calls. Do you use MPI datatypes to describe either a file view or the application data? These noncontiguous in memory and/or noncontiguous in file access patterns will also trigger fcntl lock calls. You can use an MPI-IO hint to disable data sieving, at a potentially disastrous performance cost. ==rob -- Rob Latham Mathematics and Computer Science Division Argonne National Lab, IL USA
Re: [OMPI users] MPIIO and EXT3 file systems
On Mon, 2011-08-22 at 10:23 -0500, Rob Latham wrote: > On Thu, Aug 18, 2011 at 08:46:46AM -0700, Tom Rosmond wrote: > > We have a large fortran application designed to run doing IO with either > > mpi_io or fortran direct access. On a linux workstation (16 AMD cores) > > running openmpi 1.5.3 and Intel fortran 12.0 we are having trouble with > > random failures with the mpi_io option which do not occur with > > conventional fortran direct access. We are using ext3 file systems, and > > I have seen some references hinting of similar problems with the > > ext3/mpiio combination. The application with the mpi_io option runs > > flawlessly on Cray architectures with Lustre file systems, so we are > > also suspicious of the ext3/mpiio combination. Does anyone else have > > experience with this combination that could shed some light on the > > problem, and hopefully some suggested solutions? > > I'm glad to hear you're having success with mpi-io on Cray/Lustre. > That platform was a bit touchy for a while, but has gotten better over > the last two years. > > My first guess would be that your linux workstation does not implement > a "strict enough" file system lock. ROMIO relies on the "fcntl" locks > to provide exclusive access to files at some points in the code. > > Does your application use collective I/O ? It sounds like if you can > swap fortran and mpi-io so easily that maybe you do not. If there's > a way to make collective MPI-IO calls, that will eliminate many of the > fcntl lock calls. > Rob Yes, we are using collective I/O (mpi_file_write_at_all, mpi_file_read_at_all). The swaping of fortran and mpi-io are just branches in the code at strategic locations. Although the mpi-io files are readable with fortran direct access, we don't do so from within the application because of different data organization in the files. > Do you use MPI datatypes to describe either a file view or the > application data? These noncontiguous in memory and/or noncontiguous > in file access patterns will also trigger fcntl lock calls. You can > use an MPI-IO hint to disable data sieving, at a potentially > disastrous performance cost. Yes, we use an 'mpi_type_indexed' datatype to describe the data organization. Any thoughts about the XFS vs EXT3 question? Thanks for the help T. Rosmond > > ==rob >