Re: [OMPI users] Forcing MPI processes to end

2017-11-16 Thread Aurelien Bouteiller
Adam. Your MPI program is incorrect. You need to replace the finalize on
the process that found the error with MPIAbort

On Nov 16, 2017 10:38, "Adam Sylvester"  wrote:

> I'm using Open MPI 2.1.0 for this but I'm not sure if this is more of an
> Open MPI-specific implementation question or what the MPI standard
> guarantees.
>
> I have an application which runs across multiple ranks, eventually
> reaching an MPI_Gather() call.  Along the way, if one of the ranks
> encounters an error, it will call report the error to a log, call
> MPI_Finalize(), and exit with a non-zero return code.  If this happens
> prior to the other ranks making it to the gather, it seems like mpirun
> notices this and the process ends on all ranks.  This is what I want to
> happen - it's a legitimate error, so all processes should be freed up so
> the next job can run.  It seems like if the other ranks make it into the
> MPI_Gather() before the one rank reports an error, the other ranks wait in
> the MPI_Gather() forever.
>
> Is there something simple I can do to guarantee that if any process calls
> MPI_Finalize(), all my ranks terminate?
>
> Thanks.
> -Adam
>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
>
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] OMPI 2.1.2 and SLURM compatibility

2017-11-16 Thread r...@open-mpi.org
What Charles said was true but not quite complete. We still support the older 
PMI libraries but you likely have to point us to wherever slurm put them.

However,we definitely recommend using PMIx as you will get a faster launch 

Sent from my iPad

> On Nov 16, 2017, at 9:11 AM, Bennet Fauber  wrote:
> 
> Charlie,
> 
> Thanks a ton!  Yes, we are missing two of the three steps.
> 
> Will report back after we get pmix installed and after we rebuild
> Slurm.  We do have a new enough version of it, at least, so we might
> have missed the target, but we did at least hit the barn.  ;-)
> 
> 
> 
>> On Thu, Nov 16, 2017 at 10:54 AM, Charles A Taylor  wrote:
>> Hi Bennet,
>> 
>> Three things...
>> 
>> 1. OpenMPI 2.x requires PMIx in lieu of pmi1/pmi2.
>> 
>> 2. You will need slurm 16.05 or greater built with —with-pmix
>> 
>> 2a. You will need pmix 1.1.5 which you can get from github.
>> (https://github.com/pmix/tarballs).
>> 
>> 3. then, to launch your mpi tasks on the allocated resources,
>> 
>>   srun —mpi=pmix ./hello-mpi
>> 
>> I’m replying to the list because,
>> 
>> a) this information is harder to find than you might think.
>> b) someone/anyone can correct me if I’’m giving a bum steer.
>> 
>> Hope this helps,
>> 
>> Charlie Taylor
>> University of Florida
>> 
>> On Nov 16, 2017, at 10:34 AM, Bennet Fauber  wrote:
>> 
>> I think that OpenMPI is supposed to support SLURM integration such that
>> 
>>   srun ./hello-mpi
>> 
>> should work?  I built OMPI 2.1.2 with
>> 
>> export CONFIGURE_FLAGS='--disable-dlopen --enable-shared'
>> export COMPILERS='CC=gcc CXX=g++ FC=gfortran F77=gfortran'
>> 
>> CMD="./configure \
>>   --prefix=${PREFIX} \
>>   --mandir=${PREFIX}/share/man \
>>   --with-slurm \
>>   --with-pmi \
>>   --with-lustre \
>>   --with-verbs \
>>   $CONFIGURE_FLAGS \
>>   $COMPILERS
>> 
>> I have a simple hello-mpi.c (source included below), which compiles
>> and runs with mpirun, both on the login node and in a job.  However,
>> when I try to use srun in place of mpirun, I get instead a hung job,
>> which upon cancellation produces this output.
>> 
>> [bn2.stage.arc-ts.umich.edu:116377] PMI_Init [pmix_s1.c:162:s1_init]:
>> PMI is not initialized
>> [bn1.stage.arc-ts.umich.edu:36866] PMI_Init [pmix_s1.c:162:s1_init]:
>> PMI is not initialized
>> [warn] opal_libevent2022_event_active: event has no event_base set.
>> [warn] opal_libevent2022_event_active: event has no event_base set.
>> slurmstepd: error: *** STEP 86.0 ON bn1 CANCELLED AT 2017-11-16T10:03:24 ***
>> srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
>> slurmstepd: error: *** JOB 86 ON bn1 CANCELLED AT 2017-11-16T10:03:24 ***
>> 
>> The SLURM web page suggests that OMPI 2.x and later support PMIx, and
>> to use `srun --mpi=pimx`, however that no longer seems to be an
>> option, and using the `openmpi` type isn't working (neither is pmi2).
>> 
>> [bennet@beta-build hello]$ srun --mpi=list
>> srun: MPI types are...
>> srun: mpi/pmi2
>> srun: mpi/lam
>> srun: mpi/openmpi
>> srun: mpi/mpich1_shmem
>> srun: mpi/none
>> srun: mpi/mvapich
>> srun: mpi/mpich1_p4
>> srun: mpi/mpichgm
>> srun: mpi/mpichmx
>> 
>> To get the Intel PMI to work with srun, I have to set
>> 
>>   I_MPI_PMI_LIBRARY=/usr/lib64/libpmi.so
>> 
>> Is there a comparable environment variable that must be set to enable
>> `srun` to work?
>> 
>> Am I missing a build option or misspecifying one?
>> 
>> -- bennet
>> 
>> 
>> Source of hello-mpi.c
>> ==
>> #include 
>> #include 
>> #include "mpi.h"
>> 
>> int main(int argc, char **argv){
>> 
>> int rank;  /* rank of process */
>> int numprocs;  /* size of COMM_WORLD */
>> int namelen;
>> int tag=10;/* expected tag */
>> int message;   /* Recv'd message */
>> char processor_name[MPI_MAX_PROCESSOR_NAME];
>> MPI_Status status; /* status of recv */
>> 
>> /* call Init, size, and rank */
>> MPI_Init(, );
>> MPI_Comm_size(MPI_COMM_WORLD, );
>> MPI_Comm_rank(MPI_COMM_WORLD, );
>> MPI_Get_processor_name(processor_name, );
>> 
>> printf("Process %d on %s out of %d\n", rank, processor_name, numprocs);
>> 
>> if(rank != 0){
>>   MPI_Recv(,/*buffer for message */
>>   1,/*MAX count to recv */
>> MPI_INT,/*type to recv */
>>   0,/*recv from 0 only */
>> tag,/*tag of messgae */
>>  MPI_COMM_WORLD,/*communicator to use */
>> );   /*status object */
>>   printf("Hello from process %d!\n",rank);
>> }
>> else{
>>   /* rank 0 ONLY executes this */
>>   printf("MPI_COMM_WORLD is %d processes big!\n", numprocs);
>>   int x;
>>   for(x=1; x>  MPI_Send(,  /*send x to process x */
>>1,  /*number to send */
>>  MPI_INT,  /*type to send */
>>x,  /*rank to send to */
>>  tag,  /*tag for message */
>>

Re: [OMPI users] OMPI 2.1.2 and SLURM compatibility

2017-11-16 Thread Bennet Fauber
Charlie,

Thanks a ton!  Yes, we are missing two of the three steps.

Will report back after we get pmix installed and after we rebuild
Slurm.  We do have a new enough version of it, at least, so we might
have missed the target, but we did at least hit the barn.  ;-)



On Thu, Nov 16, 2017 at 10:54 AM, Charles A Taylor  wrote:
> Hi Bennet,
>
> Three things...
>
> 1. OpenMPI 2.x requires PMIx in lieu of pmi1/pmi2.
>
> 2. You will need slurm 16.05 or greater built with —with-pmix
>
> 2a. You will need pmix 1.1.5 which you can get from github.
> (https://github.com/pmix/tarballs).
>
> 3. then, to launch your mpi tasks on the allocated resources,
>
>srun —mpi=pmix ./hello-mpi
>
> I’m replying to the list because,
>
> a) this information is harder to find than you might think.
> b) someone/anyone can correct me if I’’m giving a bum steer.
>
> Hope this helps,
>
> Charlie Taylor
> University of Florida
>
> On Nov 16, 2017, at 10:34 AM, Bennet Fauber  wrote:
>
> I think that OpenMPI is supposed to support SLURM integration such that
>
>srun ./hello-mpi
>
> should work?  I built OMPI 2.1.2 with
>
> export CONFIGURE_FLAGS='--disable-dlopen --enable-shared'
> export COMPILERS='CC=gcc CXX=g++ FC=gfortran F77=gfortran'
>
> CMD="./configure \
>--prefix=${PREFIX} \
>--mandir=${PREFIX}/share/man \
>--with-slurm \
>--with-pmi \
>--with-lustre \
>--with-verbs \
>$CONFIGURE_FLAGS \
>$COMPILERS
>
> I have a simple hello-mpi.c (source included below), which compiles
> and runs with mpirun, both on the login node and in a job.  However,
> when I try to use srun in place of mpirun, I get instead a hung job,
> which upon cancellation produces this output.
>
> [bn2.stage.arc-ts.umich.edu:116377] PMI_Init [pmix_s1.c:162:s1_init]:
> PMI is not initialized
> [bn1.stage.arc-ts.umich.edu:36866] PMI_Init [pmix_s1.c:162:s1_init]:
> PMI is not initialized
> [warn] opal_libevent2022_event_active: event has no event_base set.
> [warn] opal_libevent2022_event_active: event has no event_base set.
> slurmstepd: error: *** STEP 86.0 ON bn1 CANCELLED AT 2017-11-16T10:03:24 ***
> srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
> slurmstepd: error: *** JOB 86 ON bn1 CANCELLED AT 2017-11-16T10:03:24 ***
>
> The SLURM web page suggests that OMPI 2.x and later support PMIx, and
> to use `srun --mpi=pimx`, however that no longer seems to be an
> option, and using the `openmpi` type isn't working (neither is pmi2).
>
> [bennet@beta-build hello]$ srun --mpi=list
> srun: MPI types are...
> srun: mpi/pmi2
> srun: mpi/lam
> srun: mpi/openmpi
> srun: mpi/mpich1_shmem
> srun: mpi/none
> srun: mpi/mvapich
> srun: mpi/mpich1_p4
> srun: mpi/mpichgm
> srun: mpi/mpichmx
>
> To get the Intel PMI to work with srun, I have to set
>
>I_MPI_PMI_LIBRARY=/usr/lib64/libpmi.so
>
> Is there a comparable environment variable that must be set to enable
> `srun` to work?
>
> Am I missing a build option or misspecifying one?
>
> -- bennet
>
>
> Source of hello-mpi.c
> ==
> #include 
> #include 
> #include "mpi.h"
>
> int main(int argc, char **argv){
>
>  int rank;  /* rank of process */
>  int numprocs;  /* size of COMM_WORLD */
>  int namelen;
>  int tag=10;/* expected tag */
>  int message;   /* Recv'd message */
>  char processor_name[MPI_MAX_PROCESSOR_NAME];
>  MPI_Status status; /* status of recv */
>
>  /* call Init, size, and rank */
>  MPI_Init(, );
>  MPI_Comm_size(MPI_COMM_WORLD, );
>  MPI_Comm_rank(MPI_COMM_WORLD, );
>  MPI_Get_processor_name(processor_name, );
>
>  printf("Process %d on %s out of %d\n", rank, processor_name, numprocs);
>
>  if(rank != 0){
>MPI_Recv(,/*buffer for message */
>1,/*MAX count to recv */
>  MPI_INT,/*type to recv */
>0,/*recv from 0 only */
>  tag,/*tag of messgae */
>   MPI_COMM_WORLD,/*communicator to use */
>  );   /*status object */
>printf("Hello from process %d!\n",rank);
>  }
>  else{
>/* rank 0 ONLY executes this */
>printf("MPI_COMM_WORLD is %d processes big!\n", numprocs);
>int x;
>for(x=1; x   MPI_Send(,  /*send x to process x */
> 1,  /*number to send */
>   MPI_INT,  /*type to send */
> x,  /*rank to send to */
>   tag,  /*tag for message */
> MPI_COMM_WORLD);/*communicator to use */
>}
>  } /* end else */
>
>
> /* always call at end */
> MPI_Finalize();
>
> return 0;
> }
> ___
> users mailing list
> users@lists.open-mpi.org
> 

[OMPI users] OMPI 2.1.2 and SLURM compatibility

2017-11-16 Thread Bennet Fauber
I think that OpenMPI is supposed to support SLURM integration such that

srun ./hello-mpi

should work?  I built OMPI 2.1.2 with

export CONFIGURE_FLAGS='--disable-dlopen --enable-shared'
export COMPILERS='CC=gcc CXX=g++ FC=gfortran F77=gfortran'

CMD="./configure \
--prefix=${PREFIX} \
--mandir=${PREFIX}/share/man \
--with-slurm \
--with-pmi \
--with-lustre \
--with-verbs \
$CONFIGURE_FLAGS \
$COMPILERS

I have a simple hello-mpi.c (source included below), which compiles
and runs with mpirun, both on the login node and in a job.  However,
when I try to use srun in place of mpirun, I get instead a hung job,
which upon cancellation produces this output.

[bn2.stage.arc-ts.umich.edu:116377] PMI_Init [pmix_s1.c:162:s1_init]:
PMI is not initialized
[bn1.stage.arc-ts.umich.edu:36866] PMI_Init [pmix_s1.c:162:s1_init]:
PMI is not initialized
[warn] opal_libevent2022_event_active: event has no event_base set.
[warn] opal_libevent2022_event_active: event has no event_base set.
slurmstepd: error: *** STEP 86.0 ON bn1 CANCELLED AT 2017-11-16T10:03:24 ***
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
slurmstepd: error: *** JOB 86 ON bn1 CANCELLED AT 2017-11-16T10:03:24 ***

The SLURM web page suggests that OMPI 2.x and later support PMIx, and
to use `srun --mpi=pimx`, however that no longer seems to be an
option, and using the `openmpi` type isn't working (neither is pmi2).

[bennet@beta-build hello]$ srun --mpi=list
srun: MPI types are...
srun: mpi/pmi2
srun: mpi/lam
srun: mpi/openmpi
srun: mpi/mpich1_shmem
srun: mpi/none
srun: mpi/mvapich
srun: mpi/mpich1_p4
srun: mpi/mpichgm
srun: mpi/mpichmx

To get the Intel PMI to work with srun, I have to set

I_MPI_PMI_LIBRARY=/usr/lib64/libpmi.so

Is there a comparable environment variable that must be set to enable
`srun` to work?

Am I missing a build option or misspecifying one?

-- bennet


Source of hello-mpi.c
==
#include 
#include 
#include "mpi.h"

int main(int argc, char **argv){

  int rank;  /* rank of process */
  int numprocs;  /* size of COMM_WORLD */
  int namelen;
  int tag=10;/* expected tag */
  int message;   /* Recv'd message */
  char processor_name[MPI_MAX_PROCESSOR_NAME];
  MPI_Status status; /* status of recv */

  /* call Init, size, and rank */
  MPI_Init(, );
  MPI_Comm_size(MPI_COMM_WORLD, );
  MPI_Comm_rank(MPI_COMM_WORLD, );
  MPI_Get_processor_name(processor_name, );

  printf("Process %d on %s out of %d\n", rank, processor_name, numprocs);

  if(rank != 0){
MPI_Recv(,/*buffer for message */
1,/*MAX count to recv */
  MPI_INT,/*type to recv */
0,/*recv from 0 only */
  tag,/*tag of messgae */
   MPI_COMM_WORLD,/*communicator to use */
  );   /*status object */
printf("Hello from process %d!\n",rank);
  }
  else{
/* rank 0 ONLY executes this */
printf("MPI_COMM_WORLD is %d processes big!\n", numprocs);
int x;
for(x=1; x

Re: [OMPI users] --map-by

2017-11-16 Thread Noam Bernstein

> On Nov 16, 2017, at 9:49 AM, r...@open-mpi.org wrote:
> 
> Do not include the “bind-to core” option.the mapping directive already forces 
> that 

Same error message, unfortunately. And no, I’m not setting a global binding 
policy, as far as I can tell:

env | grep OMPI_MCA
OMPI_MCA_hwloc_base_report_bindings=1
[compute-7-6:15083] SETTING BINDING TO CORE
--
A request for multiple cpus-per-proc was given, but a directive
was also give to map to an object level that cannot support that
directive.

Please specify a mapping level that has more than one cpu, or
else let us define a default mapping that will allow multiple
cpus-per-proc.
--

Noam



||
|U.S. NAVAL|
|_RESEARCH_|
LABORATORY
Noam Bernstein, Ph.D.
Center for Materials Physics and Technology
U.S. Naval Research Laboratory
T +1 202 404 8628  F +1 202 404 7546
https://www.nrl.navy.mil 
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] --map-by

2017-11-16 Thread r...@open-mpi.org
Do not include the “bind-to core” option.the mapping directive already forces 
that 

Sent from my iPad

> On Nov 16, 2017, at 7:44 AM, Noam Bernstein  
> wrote:
> 
> Hi all - I’m trying to run mixed MPI/OpenMP, so I ideally want binding of 
> each MPI process to a small set of cores (to allow for the OpenMP threads).   
> From the mpirun docs at 
> https://www.open-mpi.org//doc/current/man1/mpirun.1.php
> I got the example that I thought corresponded to what I want,
> % mpirun ... --map-by core:PE=2 --bind-to core
> So I tried
> mpirun -x OMP_NUM_THREADS --map-by core:PE=4 --bind-to core -np 32   python 
> …..
> 
> However, when I run this (with openmpi 3.0.0 or with 1.8.8) I get the 
> following error:
> A request for multiple cpus-per-proc was given, but a directive
> was also give to map to an object level that cannot support that
> directive.
> 
> Please specify a mapping level that has more than one cpu, or
> else let us define a default mapping that will allow multiple
> cpus-per-proc.
> 
> Am I doing something wrong, or is there a mistake in the docs, and it should 
> bind to something other than core?
> 
>   
> thanks,
>   
> Noam
> 
> 
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

[OMPI users] --map-by

2017-11-16 Thread Noam Bernstein
Hi all - I’m trying to run mixed MPI/OpenMP, so I ideally want binding of each 
MPI process to a small set of cores (to allow for the OpenMP threads).   From 
the mpirun docs at 
https://www.open-mpi.org//doc/current/man1/mpirun.1.php 

I got the example that I thought corresponded to what I want,
% mpirun ... --map-by core:PE=2 --bind-to core
So I tried
mpirun -x OMP_NUM_THREADS --map-by core:PE=4 --bind-to core -np 32   python …..

However, when I run this (with openmpi 3.0.0 or with 1.8.8) I get the following 
error:
A request for multiple cpus-per-proc was given, but a directive
was also give to map to an object level that cannot support that
directive.

Please specify a mapping level that has more than one cpu, or
else let us define a default mapping that will allow multiple
cpus-per-proc.

Am I doing something wrong, or is there a mistake in the docs, and it should 
bind to something other than core?


thanks,

Noam


___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] Invalid results with OpenMPI on Ubuntu Artful because of --enable-heterogeneous

2017-11-16 Thread Xavier Besseron
Thanks for looking at it!

Apparently, someone requested support for heterogeneous machines long time
ago:
https://bugs.launchpad.net/ubuntu/+source/openmpi/+bug/419074


Xavier



On Mon, Nov 13, 2017 at 7:56 PM, Gilles Gouaillardet <
gilles.gouaillar...@gmail.com> wrote:

> Xavier,
>
> i confirm there is a bug when using MPI_ANY_SOURCE with Open MPI
> configure'd with --enable-heterogeneous
>
> i made https://github.com/open-mpi/ompi/pull/4501 in order to fix
> that, and will merge and backport once reviewed
>
>
> Cheers,
>
> Gilles
>
> On Mon, Nov 13, 2017 at 8:46 AM, Gilles Gouaillardet
>  wrote:
> > Xavier,
> >
> > thanks for the report, i will have a look at it.
> >
> > is the bug triggered by MPI_ANY_SOURCE ?
> > /* e.g. does it work if you MPI_Irecv(..., myrank, ...) ? */
> >
> >
> > Unless ubuntu wants out of the box support between heterogeneous nodes
> > (for example x86_64 and ppc64),
> > there is little to no point in configuring Open MPI with the
> > --enable-heterogeneous option */
> >
> >
> > Cheers,
> >
> > Gilles
> >
> > On Mon, Nov 13, 2017 at 7:56 AM, Xavier Besseron 
> wrote:
> >> Dear all,
> >>
> >> I want to share with you the follow issue with the OpenMPI shipped with
> the
> >> latest Ubuntu Artful. It is OpenMPI 2.1.1 compiled with option
> >> --enable-heterogeneous.
> >>
> >> Looking at this issue https://github.com/open-mpi/ompi/issues/171, it
> >> appears that this option is broken and should not be used.
> >> This option is being used in Debian/Ubuntu since 2010
> >> (http://changelogs.ubuntu.com/changelogs/pool/universe/o/
> openmpi/openmpi_2.1.1-6/changelog)
> >> and is still used so far. Apparently, nobody complained so far.
> >>
> >> However, now I complain :-)
> >> I've found a simple example for which this option causes invalid
> results in
> >> OpenMPI.
> >>
> >>
> >> int A = 666, B = 42;
> >> MPI_Irecv(, 1, MPI_INT, MPI_ANY_SOURCE, tag, comm, );
> >> MPI_Send(, 1, MPI_INT, my_rank, tag, comm);
> >> MPI_Wait(, );
> >>
> >> # After that, when compiled with --enable-heterogeneous, we have A != B
> >>
> >> This happens with just a single process. The full example is in
> attachment
> >> (to be run with "mpirun -n 1 ./bug_openmpi_artful").
> >> I extracted and simplified the code from the Zoltan library with which I
> >> initially noticed the issue.
> >>
> >> I find it annoying that Ubuntu distributes a broken OpenMPI.
> >> I've also tested OpenMPI 2.1.1, 2.1.2 and 3.0.0 and using
> >> --enable-heterogeneous causes the bug systematically.
> >>
> >>
> >> Finally, my points/questions are:
> >>
> >> - To share with you this small example in case you want to debug it
> >>
> >> - What is the status of issue https://github.com/open-mpi/
> ompi/issues/171 ?
> >> Is this option still considered broken?
> >> If yes, I encourage you to remove it or mark as deprecated to avoid this
> >> kind of mistake in the future.
> >>
> >> - To get the feedback of OpenMPI developers on the use of this option,
> which
> >> might convince the Debian/Ubuntu maintainer to remove this flag.
> >> I have opened a bug on Ubuntu for it
> >> https://bugs.launchpad.net/ubuntu/+source/openmpi/+bug/1731938
> >>
> >>
> >> Thanks!
> >>
> >> Xavier
> >>
> >>
> >> --
> >> Dr Xavier BESSERON
> >> Research associate
> >> FSTC, University of Luxembourg
> >> Campus Belval, Office MNO E04 0415-040
> >> Phone: +352 46 66 44 5418
> >> http://luxdem.uni.lu/
> >>
> >>
> >> ___
> >> users mailing list
> >> users@lists.open-mpi.org
> >> https://lists.open-mpi.org/mailman/listinfo/users
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
>
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users