Re: [OMPI users] srun works, mpirun does not

2018-06-18 Thread r...@open-mpi.org
Hmmm...well, the error has changed from your initial report. Turning off the 
firewall was the solution to that problem.

This problem is different - it isn’t the orted that failed in the log you sent, 
but the application proc that couldn’t initialize. It looks like that app was 
compiled against some earlier version of OMPI? It is looking for something that 
no longer exists. I saw that you compiled it with a simple “gcc” instead of our 
wrapper compiler “mpicc” - any particular reason? My guess is that your compile 
picked up some older version of OMPI on the system.

Ralph


> On Jun 17, 2018, at 2:51 PM, Bennet Fauber  wrote:
> 
> I rebuilt with --enable-debug, then ran with
> 
> [bennet@cavium-hpc ~]$ salloc -N 1 --ntasks-per-node=24
> salloc: Pending job allocation 158
> salloc: job 158 queued and waiting for resources
> salloc: job 158 has been allocated resources
> salloc: Granted job allocation 158
> 
> [bennet@cavium-hpc ~]$ srun ./test_mpi
> The sum = 0.866386
> Elapsed time is:  5.426759
> The sum = 0.866386
> Elapsed time is:  5.424068
> The sum = 0.866386
> Elapsed time is:  5.426195
> The sum = 0.866386
> Elapsed time is:  5.426059
> The sum = 0.866386
> Elapsed time is:  5.423192
> The sum = 0.866386
> Elapsed time is:  5.426252
> The sum = 0.866386
> Elapsed time is:  5.425444
> The sum = 0.866386
> Elapsed time is:  5.423647
> The sum = 0.866386
> Elapsed time is:  5.426082
> The sum = 0.866386
> Elapsed time is:  5.425936
> The sum = 0.866386
> Elapsed time is:  5.423964
> Total time is:  59.677830
> 
> [bennet@cavium-hpc ~]$ mpirun --mca plm_base_verbose 10 ./test_mpi
> 2>&1 | tee debug2.log
> 
> The zipped debug log should be attached.
> 
> I did that after using systemctl to turn off the firewall on the login
> node from which the mpirun is executed, as well as on the host on
> which it runs.
> 
> [bennet@cavium-hpc ~]$ mpirun hostname
> --
> An ORTE daemon has unexpectedly failed after launch and before
> communicating back to mpirun. This could be caused by a number
> of factors, including an inability to create a connection back
> to mpirun due to a lack of common network interfaces and/or no
> route found between them. Please check network connectivity
> (including firewalls and network routing requirements).
> --
> 
> [bennet@cavium-hpc ~]$ squeue
> JOBID PARTITION NAME USER ST   TIME  NODES
> NODELIST(REASON)
>   158  standard bash   bennet  R  14:30  1 cav01
> [bennet@cavium-hpc ~]$ srun hostname
> cav01.arc-ts.umich.edu
> [ repeated 23 more times ]
> 
> As always, your help is much appreciated,
> 
> -- bennet
> 
> On Sun, Jun 17, 2018 at 1:06 PM r...@open-mpi.org  wrote:
>> 
>> Add --enable-debug to your OMPI configure cmd line, and then add --mca 
>> plm_base_verbose 10 to your mpirun cmd line. For some reason, the remote 
>> daemon isn’t starting - this will give you some info as to why.
>> 
>> 
>>> On Jun 17, 2018, at 9:07 AM, Bennet Fauber  wrote:
>>> 
>>> I have a compiled binary that will run with srun but not with mpirun.
>>> The attempts to run with mpirun all result in failures to initialize.
>>> I have tried this on one node, and on two nodes, with firewall turned
>>> on and with it off.
>>> 
>>> Am I missing some command line option for mpirun?
>>> 
>>> OMPI built from this configure command
>>> 
>>> $ ./configure --prefix=/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-b
>>> --mandir=/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-b/share/man
>>> --with-pmix=/opt/pmix/2.0.2 --with-libevent=external
>>> --with-hwloc=external --with-slurm --disable-dlopen CC=gcc CXX=g++
>>> FC=gfortran
>>> 
>>> All tests from `make check` passed, see below.
>>> 
>>> [bennet@cavium-hpc ~]$ mpicc --show
>>> gcc -I/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-b/include -pthread
>>> -L/opt/pmix/2.0.2/lib -Wl,-rpath -Wl,/opt/pmix/2.0.2/lib -Wl,-rpath
>>> -Wl,/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-b/lib
>>> -Wl,--enable-new-dtags
>>> -L/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-b/lib -lmpi
>>> 
>>> The test_mpi was compiled with
>>> 
>>> $ gcc -o test_mpi test_mpi.c -lm
>>> 
>>> This is the runtime library path
>>> 
>>> [bennet@cavium-hpc ~]$ echo $LD_LIBRARY_PATH
>>> /opt/slurm/lib64:/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-b/lib:/opt/arm/gcc-7.1.0_Generic-AArch64_RHEL-7_aarch64-linux/lib64:/opt/arm/gcc-7.1.0_Generic-AArch64_RHEL-7_aarch64-linux/lib:/opt/slurm/lib64:/opt/pmix/2.0.2/lib:/sw/arcts/centos7/hpc-utils/lib
>>> 
>>> 
>>> These commands are given in exact sequence in which they were entered
>>> at a console.
>>> 
>>> [bennet@cavium-hpc ~]$ salloc -N 1 --ntasks-per-node=24
>>> salloc: Pending job allocation 156
>>> salloc: job 156 queued and waiting for resources
>>> salloc: job 156 has been allocated resources
>>> salloc: Granted job allocation 156
>>> 
>>> [bennet@cavium-hpc ~]$ mpirun ./test_mpi
>>> -

Re: [OMPI users] srun works, mpirun does not

2018-06-18 Thread Bennet Fauber
To eliminate possibilities, I removed all other versions of OpenMPI
from the system, and rebuilt using the same build script as was used
to generate the prior report.

[bennet@cavium-hpc bennet]$ ./ompi-3.1.0bd.sh
Checking compilers and things
OMPI is ompi
COMP_NAME is gcc_7_1_0
SRC_ROOT is /sw/arcts/centos7/src
PREFIX_ROOT is /sw/arcts/centos7
PREFIX is /sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-bd
CONFIGURE_FLAGS are
COMPILERS are CC=gcc CXX=g++ FC=gfortran

Currently Loaded Modules:
  1) gcc/7.1.0

 gcc (ARM-build-14) 7.1.0
Copyright (C) 2017 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

Using the following configure command

./configure --prefix=/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-bd
   --mandir=/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-bd/share/man
--with-pmix=/opt/pmix/2.0.2 --with-libevent=external
--with-hwloc=external --with-slurm --disable-dlopen
--enable-debug  CC=gcc CXX=g++ FC=gfortran

The tar ball is

2e783873f6b206aa71f745762fa15da5
/sw/arcts/centos7/src/ompi/openmpi-3.1.0.tar.gz

I still get

[bennet@cavium-hpc ~]$ salloc -N 1 --ntasks-per-node=24
salloc: Pending job allocation 165
salloc: job 165 queued and waiting for resources
salloc: job 165 has been allocated resources
salloc: Granted job allocation 165
[bennet@cavium-hpc ~]$ srun ./test_mpi
The sum = 0.866386
Elapsed time is:  5.425549
The sum = 0.866386
Elapsed time is:  5.422826
The sum = 0.866386
Elapsed time is:  5.427676
The sum = 0.866386
Elapsed time is:  5.424928
The sum = 0.866386
Elapsed time is:  5.422060
The sum = 0.866386
Elapsed time is:  5.425431
The sum = 0.866386
Elapsed time is:  5.424350
The sum = 0.866386
Elapsed time is:  5.423037
The sum = 0.866386
Elapsed time is:  5.427727
The sum = 0.866386
Elapsed time is:  5.424922
The sum = 0.866386
Elapsed time is:  5.424279
Total time is:  59.672992

[bennet@cavium-hpc ~]$ mpirun ./test_mpi
--
An ORTE daemon has unexpectedly failed after launch and before
communicating back to mpirun. This could be caused by a number
of factors, including an inability to create a connection back
to mpirun due to a lack of common network interfaces and/or no
route found between them. Please check network connectivity
(including firewalls and network routing requirements).
--

I reran with

[bennet@cavium-hpc ~]$ mpirun --mca plm_base_verbose 10 ./test_mpi
2>&1 | tee debug3.log

and the gzipped log is attached.

I thought to try it with a different test program, which spits the error
[cavium-hpc.arc-ts.umich.edu:42853] [[58987,1],0] ORTE_ERROR_LOG: Not
found in file base/ess_base_std_app.c at line 219
[cavium-hpc.arc-ts.umich.edu:42854] [[58987,1],1] ORTE_ERROR_LOG: Not
found in file base/ess_base_std_app.c at line 219
--
It looks like orte_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  store DAEMON URI failed
  --> Returned value Not found (-13) instead of ORTE_SUCCESS


At one point, I am almost certain that OMPI mpirun did work, and I am
at a loss to explain why it no longer does.

I have also tried the 3.1.1rc1 version.  I am now going to try 3.0.0,
and we'll try downgrading SLURM to a prior version.

-- bennet


-- bennetOn Mon, Jun 18, 2018 at 10:56 AM r...@open-mpi.org
 wrote:
>
> Hmmm...well, the error has changed from your initial report. Turning off the 
> firewall was the solution to that problem.
>
> This problem is different - it isn’t the orted that failed in the log you 
> sent, but the application proc that couldn’t initialize. It looks like that 
> app was compiled against some earlier version of OMPI? It is looking for 
> something that no longer exists. I saw that you compiled it with a simple 
> “gcc” instead of our wrapper compiler “mpicc” - any particular reason? My 
> guess is that your compile picked up some older version of OMPI on the system.
>
> Ralph
>
>
> > On Jun 17, 2018, at 2:51 PM, Bennet Fauber  wrote:
> >
> > I rebuilt with --enable-debug, then ran with
> >
> > [bennet@cavium-hpc ~]$ salloc -N 1 --ntasks-per-node=24
> > salloc: Pending job allocation 158
> > salloc: job 158 queued and waiting for resources
> > salloc: job 158 has been allocated resources
> > salloc: Granted job allocation 158
> >
> > [bennet@cavium-hpc ~]$ srun ./test_mpi
> > The sum = 0.866386
> > Elapsed time is:  5.426759
> > The sum = 0.866386
> > Elapsed time is:  5.424068
> > The sum = 0.866386
> > Elapsed ti

Re: [OMPI users] srun works, mpirun does not

2018-06-18 Thread r...@open-mpi.org
I doubt Slurm is the issue. For grins, lets try adding “--mca plm rsh” to your 
mpirun cmd line and see if that works.


> On Jun 18, 2018, at 12:57 PM, Bennet Fauber  wrote:
> 
> To eliminate possibilities, I removed all other versions of OpenMPI
> from the system, and rebuilt using the same build script as was used
> to generate the prior report.
> 
> [bennet@cavium-hpc bennet]$ ./ompi-3.1.0bd.sh
> Checking compilers and things
> OMPI is ompi
> COMP_NAME is gcc_7_1_0
> SRC_ROOT is /sw/arcts/centos7/src
> PREFIX_ROOT is /sw/arcts/centos7
> PREFIX is /sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-bd
> CONFIGURE_FLAGS are
> COMPILERS are CC=gcc CXX=g++ FC=gfortran
> 
> Currently Loaded Modules:
>  1) gcc/7.1.0
> 
> gcc (ARM-build-14) 7.1.0
> Copyright (C) 2017 Free Software Foundation, Inc.
> This is free software; see the source for copying conditions.  There is NO
> warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
> 
> Using the following configure command
> 
> ./configure --prefix=/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-bd
>   --mandir=/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-bd/share/man
> --with-pmix=/opt/pmix/2.0.2 --with-libevent=external
> --with-hwloc=external --with-slurm --disable-dlopen
> --enable-debug  CC=gcc CXX=g++ FC=gfortran
> 
> The tar ball is
> 
> 2e783873f6b206aa71f745762fa15da5
> /sw/arcts/centos7/src/ompi/openmpi-3.1.0.tar.gz
> 
> I still get
> 
> [bennet@cavium-hpc ~]$ salloc -N 1 --ntasks-per-node=24
> salloc: Pending job allocation 165
> salloc: job 165 queued and waiting for resources
> salloc: job 165 has been allocated resources
> salloc: Granted job allocation 165
> [bennet@cavium-hpc ~]$ srun ./test_mpi
> The sum = 0.866386
> Elapsed time is:  5.425549
> The sum = 0.866386
> Elapsed time is:  5.422826
> The sum = 0.866386
> Elapsed time is:  5.427676
> The sum = 0.866386
> Elapsed time is:  5.424928
> The sum = 0.866386
> Elapsed time is:  5.422060
> The sum = 0.866386
> Elapsed time is:  5.425431
> The sum = 0.866386
> Elapsed time is:  5.424350
> The sum = 0.866386
> Elapsed time is:  5.423037
> The sum = 0.866386
> Elapsed time is:  5.427727
> The sum = 0.866386
> Elapsed time is:  5.424922
> The sum = 0.866386
> Elapsed time is:  5.424279
> Total time is:  59.672992
> 
> [bennet@cavium-hpc ~]$ mpirun ./test_mpi
> --
> An ORTE daemon has unexpectedly failed after launch and before
> communicating back to mpirun. This could be caused by a number
> of factors, including an inability to create a connection back
> to mpirun due to a lack of common network interfaces and/or no
> route found between them. Please check network connectivity
> (including firewalls and network routing requirements).
> --
> 
> I reran with
> 
> [bennet@cavium-hpc ~]$ mpirun --mca plm_base_verbose 10 ./test_mpi
> 2>&1 | tee debug3.log
> 
> and the gzipped log is attached.
> 
> I thought to try it with a different test program, which spits the error
> [cavium-hpc.arc-ts.umich.edu:42853] [[58987,1],0] ORTE_ERROR_LOG: Not
> found in file base/ess_base_std_app.c at line 219
> [cavium-hpc.arc-ts.umich.edu:42854] [[58987,1],1] ORTE_ERROR_LOG: Not
> found in file base/ess_base_std_app.c at line 219
> --
> It looks like orte_init failed for some reason; your parallel process is
> likely to abort.  There are many reasons that a parallel process can
> fail during orte_init; some of which are due to configuration or
> environment problems.  This failure appears to be an internal failure;
> here's some additional information (which may only be relevant to an
> Open MPI developer):
> 
>  store DAEMON URI failed
>  --> Returned value Not found (-13) instead of ORTE_SUCCESS
> 
> 
> At one point, I am almost certain that OMPI mpirun did work, and I am
> at a loss to explain why it no longer does.
> 
> I have also tried the 3.1.1rc1 version.  I am now going to try 3.0.0,
> and we'll try downgrading SLURM to a prior version.
> 
> -- bennet
> 
> 
> -- bennetOn Mon, Jun 18, 2018 at 10:56 AM r...@open-mpi.org
>  wrote:
>> 
>> Hmmm...well, the error has changed from your initial report. Turning off the 
>> firewall was the solution to that problem.
>> 
>> This problem is different - it isn’t the orted that failed in the log you 
>> sent, but the application proc that couldn’t initialize. It looks like that 
>> app was compiled against some earlier version of OMPI? It is looking for 
>> something that no longer exists. I saw that you compiled it with a simple 
>> “gcc” instead of our wrapper compiler “mpicc” - any particular reason? My 
>> guess is that your compile picked up some older version of OMPI on the 
>> system.
>> 
>> Ralph
>> 
>> 
>>> On Jun 17, 2018, at 2:51 PM, Bennet Fauber  wrote:
>>> 
>>> I rebuilt with --enable-debug, then ran with
>>> 
>>> [benne

Re: [OMPI users] srun works, mpirun does not

2018-06-18 Thread Bennet Fauber
No such luck.  If it matters, mpirun does seem to work with processes
on the local node that have no internal MPI code.  That is,

[bennet@cavium-hpc ~]$ mpirun -np 4 hello
Hello, ARM
Hello, ARM
Hello, ARM
Hello, ARM

but it fails with a similar error if run while a SLURM job is active; i.e.,

[bennet@cavium-hpc ~]$ mpirun hello
--
ORTE has lost communication with a remote daemon.

  HNP daemon   : [[26589,0],0] on node cavium-hpc
  Remote daemon: [[26589,0],1] on node cav01

This is usually due to either a failure of the TCP network
connection to the node, or possibly an internal failure of
the daemon itself. We cannot recover from this failure, and
therefore will terminate the job.
--

That makes sense, I guess.

I'll keep you posted as to what happens with 3.0.0 and downgrading SLURM.


Thanks,   -- bennet


On Mon, Jun 18, 2018 at 4:05 PM r...@open-mpi.org  wrote:
>
> I doubt Slurm is the issue. For grins, lets try adding “--mca plm rsh” to 
> your mpirun cmd line and see if that works.
>
>
> > On Jun 18, 2018, at 12:57 PM, Bennet Fauber  wrote:
> >
> > To eliminate possibilities, I removed all other versions of OpenMPI
> > from the system, and rebuilt using the same build script as was used
> > to generate the prior report.
> >
> > [bennet@cavium-hpc bennet]$ ./ompi-3.1.0bd.sh
> > Checking compilers and things
> > OMPI is ompi
> > COMP_NAME is gcc_7_1_0
> > SRC_ROOT is /sw/arcts/centos7/src
> > PREFIX_ROOT is /sw/arcts/centos7
> > PREFIX is /sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-bd
> > CONFIGURE_FLAGS are
> > COMPILERS are CC=gcc CXX=g++ FC=gfortran
> >
> > Currently Loaded Modules:
> >  1) gcc/7.1.0
> >
> > gcc (ARM-build-14) 7.1.0
> > Copyright (C) 2017 Free Software Foundation, Inc.
> > This is free software; see the source for copying conditions.  There is NO
> > warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
> >
> > Using the following configure command
> >
> > ./configure --prefix=/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-bd
> >   --mandir=/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-bd/share/man
> > --with-pmix=/opt/pmix/2.0.2 --with-libevent=external
> > --with-hwloc=external --with-slurm --disable-dlopen
> > --enable-debug  CC=gcc CXX=g++ FC=gfortran
> >
> > The tar ball is
> >
> > 2e783873f6b206aa71f745762fa15da5
> > /sw/arcts/centos7/src/ompi/openmpi-3.1.0.tar.gz
> >
> > I still get
> >
> > [bennet@cavium-hpc ~]$ salloc -N 1 --ntasks-per-node=24
> > salloc: Pending job allocation 165
> > salloc: job 165 queued and waiting for resources
> > salloc: job 165 has been allocated resources
> > salloc: Granted job allocation 165
> > [bennet@cavium-hpc ~]$ srun ./test_mpi
> > The sum = 0.866386
> > Elapsed time is:  5.425549
> > The sum = 0.866386
> > Elapsed time is:  5.422826
> > The sum = 0.866386
> > Elapsed time is:  5.427676
> > The sum = 0.866386
> > Elapsed time is:  5.424928
> > The sum = 0.866386
> > Elapsed time is:  5.422060
> > The sum = 0.866386
> > Elapsed time is:  5.425431
> > The sum = 0.866386
> > Elapsed time is:  5.424350
> > The sum = 0.866386
> > Elapsed time is:  5.423037
> > The sum = 0.866386
> > Elapsed time is:  5.427727
> > The sum = 0.866386
> > Elapsed time is:  5.424922
> > The sum = 0.866386
> > Elapsed time is:  5.424279
> > Total time is:  59.672992
> >
> > [bennet@cavium-hpc ~]$ mpirun ./test_mpi
> > --
> > An ORTE daemon has unexpectedly failed after launch and before
> > communicating back to mpirun. This could be caused by a number
> > of factors, including an inability to create a connection back
> > to mpirun due to a lack of common network interfaces and/or no
> > route found between them. Please check network connectivity
> > (including firewalls and network routing requirements).
> > --
> >
> > I reran with
> >
> > [bennet@cavium-hpc ~]$ mpirun --mca plm_base_verbose 10 ./test_mpi
> > 2>&1 | tee debug3.log
> >
> > and the gzipped log is attached.
> >
> > I thought to try it with a different test program, which spits the error
> > [cavium-hpc.arc-ts.umich.edu:42853] [[58987,1],0] ORTE_ERROR_LOG: Not
> > found in file base/ess_base_std_app.c at line 219
> > [cavium-hpc.arc-ts.umich.edu:42854] [[58987,1],1] ORTE_ERROR_LOG: Not
> > found in file base/ess_base_std_app.c at line 219
> > --
> > It looks like orte_init failed for some reason; your parallel process is
> > likely to abort.  There are many reasons that a parallel process can
> > fail during orte_init; some of which are due to configuration or
> > environment problems.  This failure appears to be an internal failure;
> > here's some additional information (which may only be relevant to an
> > Op

Re: [OMPI users] srun works, mpirun does not

2018-06-18 Thread Bennet Fauber
If it's of any use, 3.0.0 seems to hang at

Making check in class
make[2]: Entering directory `/tmp/build/openmpi-3.0.0/test/class'
make  ompi_rb_tree opal_bitmap opal_hash_table opal_proc_table
opal_tree opal_list opal_value_array opal_pointer_array opal_lifo
opal_fifo
make[3]: Entering directory `/tmp/build/openmpi-3.0.0/test/class'
make[3]: `ompi_rb_tree' is up to date.
make[3]: `opal_bitmap' is up to date.
make[3]: `opal_hash_table' is up to date.
make[3]: `opal_proc_table' is up to date.
make[3]: `opal_tree' is up to date.
make[3]: `opal_list' is up to date.
make[3]: `opal_value_array' is up to date.
make[3]: `opal_pointer_array' is up to date.
make[3]: `opal_lifo' is up to date.
make[3]: `opal_fifo' is up to date.
make[3]: Leaving directory `/tmp/build/openmpi-3.0.0/test/class'
make  check-TESTS
make[3]: Entering directory `/tmp/build/openmpi-3.0.0/test/class'
make[4]: Entering directory `/tmp/build/openmpi-3.0.0/test/class'

I have to interrupt it, but it's been many minutes, and usually these
have not been behaving this way.

-- bennet

On Mon, Jun 18, 2018 at 4:21 PM Bennet Fauber  wrote:
>
> No such luck.  If it matters, mpirun does seem to work with processes
> on the local node that have no internal MPI code.  That is,
>
> [bennet@cavium-hpc ~]$ mpirun -np 4 hello
> Hello, ARM
> Hello, ARM
> Hello, ARM
> Hello, ARM
>
> but it fails with a similar error if run while a SLURM job is active; i.e.,
>
> [bennet@cavium-hpc ~]$ mpirun hello
> --
> ORTE has lost communication with a remote daemon.
>
>   HNP daemon   : [[26589,0],0] on node cavium-hpc
>   Remote daemon: [[26589,0],1] on node cav01
>
> This is usually due to either a failure of the TCP network
> connection to the node, or possibly an internal failure of
> the daemon itself. We cannot recover from this failure, and
> therefore will terminate the job.
> --
>
> That makes sense, I guess.
>
> I'll keep you posted as to what happens with 3.0.0 and downgrading SLURM.
>
>
> Thanks,   -- bennet
>
>
> On Mon, Jun 18, 2018 at 4:05 PM r...@open-mpi.org  wrote:
> >
> > I doubt Slurm is the issue. For grins, lets try adding “--mca plm rsh” to 
> > your mpirun cmd line and see if that works.
> >
> >
> > > On Jun 18, 2018, at 12:57 PM, Bennet Fauber  wrote:
> > >
> > > To eliminate possibilities, I removed all other versions of OpenMPI
> > > from the system, and rebuilt using the same build script as was used
> > > to generate the prior report.
> > >
> > > [bennet@cavium-hpc bennet]$ ./ompi-3.1.0bd.sh
> > > Checking compilers and things
> > > OMPI is ompi
> > > COMP_NAME is gcc_7_1_0
> > > SRC_ROOT is /sw/arcts/centos7/src
> > > PREFIX_ROOT is /sw/arcts/centos7
> > > PREFIX is /sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-bd
> > > CONFIGURE_FLAGS are
> > > COMPILERS are CC=gcc CXX=g++ FC=gfortran
> > >
> > > Currently Loaded Modules:
> > >  1) gcc/7.1.0
> > >
> > > gcc (ARM-build-14) 7.1.0
> > > Copyright (C) 2017 Free Software Foundation, Inc.
> > > This is free software; see the source for copying conditions.  There is NO
> > > warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR 
> > > PURPOSE.
> > >
> > > Using the following configure command
> > >
> > > ./configure --prefix=/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-bd
> > >   --mandir=/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-bd/share/man
> > > --with-pmix=/opt/pmix/2.0.2 --with-libevent=external
> > > --with-hwloc=external --with-slurm --disable-dlopen
> > > --enable-debug  CC=gcc CXX=g++ FC=gfortran
> > >
> > > The tar ball is
> > >
> > > 2e783873f6b206aa71f745762fa15da5
> > > /sw/arcts/centos7/src/ompi/openmpi-3.1.0.tar.gz
> > >
> > > I still get
> > >
> > > [bennet@cavium-hpc ~]$ salloc -N 1 --ntasks-per-node=24
> > > salloc: Pending job allocation 165
> > > salloc: job 165 queued and waiting for resources
> > > salloc: job 165 has been allocated resources
> > > salloc: Granted job allocation 165
> > > [bennet@cavium-hpc ~]$ srun ./test_mpi
> > > The sum = 0.866386
> > > Elapsed time is:  5.425549
> > > The sum = 0.866386
> > > Elapsed time is:  5.422826
> > > The sum = 0.866386
> > > Elapsed time is:  5.427676
> > > The sum = 0.866386
> > > Elapsed time is:  5.424928
> > > The sum = 0.866386
> > > Elapsed time is:  5.422060
> > > The sum = 0.866386
> > > Elapsed time is:  5.425431
> > > The sum = 0.866386
> > > Elapsed time is:  5.424350
> > > The sum = 0.866386
> > > Elapsed time is:  5.423037
> > > The sum = 0.866386
> > > Elapsed time is:  5.427727
> > > The sum = 0.866386
> > > Elapsed time is:  5.424922
> > > The sum = 0.866386
> > > Elapsed time is:  5.424279
> > > Total time is:  59.672992
> > >
> > > [bennet@cavium-hpc ~]$ mpirun ./test_mpi
> > > --
> > > An ORTE daemon has unexpectedly failed after launch and before
> > > communicating bac

Re: [OMPI users] srun works, mpirun does not

2018-06-18 Thread r...@open-mpi.org
This is on an ARM processor? I suspect that is the root of the problems as we 
aren’t seeing anything like this elsewhere.


> On Jun 18, 2018, at 1:27 PM, Bennet Fauber  wrote:
> 
> If it's of any use, 3.0.0 seems to hang at
> 
> Making check in class
> make[2]: Entering directory `/tmp/build/openmpi-3.0.0/test/class'
> make  ompi_rb_tree opal_bitmap opal_hash_table opal_proc_table
> opal_tree opal_list opal_value_array opal_pointer_array opal_lifo
> opal_fifo
> make[3]: Entering directory `/tmp/build/openmpi-3.0.0/test/class'
> make[3]: `ompi_rb_tree' is up to date.
> make[3]: `opal_bitmap' is up to date.
> make[3]: `opal_hash_table' is up to date.
> make[3]: `opal_proc_table' is up to date.
> make[3]: `opal_tree' is up to date.
> make[3]: `opal_list' is up to date.
> make[3]: `opal_value_array' is up to date.
> make[3]: `opal_pointer_array' is up to date.
> make[3]: `opal_lifo' is up to date.
> make[3]: `opal_fifo' is up to date.
> make[3]: Leaving directory `/tmp/build/openmpi-3.0.0/test/class'
> make  check-TESTS
> make[3]: Entering directory `/tmp/build/openmpi-3.0.0/test/class'
> make[4]: Entering directory `/tmp/build/openmpi-3.0.0/test/class'
> 
> I have to interrupt it, but it's been many minutes, and usually these
> have not been behaving this way.
> 
> -- bennet
> 
> On Mon, Jun 18, 2018 at 4:21 PM Bennet Fauber  wrote:
>> 
>> No such luck.  If it matters, mpirun does seem to work with processes
>> on the local node that have no internal MPI code.  That is,
>> 
>> [bennet@cavium-hpc ~]$ mpirun -np 4 hello
>> Hello, ARM
>> Hello, ARM
>> Hello, ARM
>> Hello, ARM
>> 
>> but it fails with a similar error if run while a SLURM job is active; i.e.,
>> 
>> [bennet@cavium-hpc ~]$ mpirun hello
>> --
>> ORTE has lost communication with a remote daemon.
>> 
>>  HNP daemon   : [[26589,0],0] on node cavium-hpc
>>  Remote daemon: [[26589,0],1] on node cav01
>> 
>> This is usually due to either a failure of the TCP network
>> connection to the node, or possibly an internal failure of
>> the daemon itself. We cannot recover from this failure, and
>> therefore will terminate the job.
>> --
>> 
>> That makes sense, I guess.
>> 
>> I'll keep you posted as to what happens with 3.0.0 and downgrading SLURM.
>> 
>> 
>> Thanks,   -- bennet
>> 
>> 
>> On Mon, Jun 18, 2018 at 4:05 PM r...@open-mpi.org  wrote:
>>> 
>>> I doubt Slurm is the issue. For grins, lets try adding “--mca plm rsh” to 
>>> your mpirun cmd line and see if that works.
>>> 
>>> 
 On Jun 18, 2018, at 12:57 PM, Bennet Fauber  wrote:
 
 To eliminate possibilities, I removed all other versions of OpenMPI
 from the system, and rebuilt using the same build script as was used
 to generate the prior report.
 
 [bennet@cavium-hpc bennet]$ ./ompi-3.1.0bd.sh
 Checking compilers and things
 OMPI is ompi
 COMP_NAME is gcc_7_1_0
 SRC_ROOT is /sw/arcts/centos7/src
 PREFIX_ROOT is /sw/arcts/centos7
 PREFIX is /sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-bd
 CONFIGURE_FLAGS are
 COMPILERS are CC=gcc CXX=g++ FC=gfortran
 
 Currently Loaded Modules:
 1) gcc/7.1.0
 
 gcc (ARM-build-14) 7.1.0
 Copyright (C) 2017 Free Software Foundation, Inc.
 This is free software; see the source for copying conditions.  There is NO
 warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
 
 Using the following configure command
 
 ./configure --prefix=/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-bd
  --mandir=/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-bd/share/man
 --with-pmix=/opt/pmix/2.0.2 --with-libevent=external
 --with-hwloc=external --with-slurm --disable-dlopen
 --enable-debug  CC=gcc CXX=g++ FC=gfortran
 
 The tar ball is
 
 2e783873f6b206aa71f745762fa15da5
 /sw/arcts/centos7/src/ompi/openmpi-3.1.0.tar.gz
 
 I still get
 
 [bennet@cavium-hpc ~]$ salloc -N 1 --ntasks-per-node=24
 salloc: Pending job allocation 165
 salloc: job 165 queued and waiting for resources
 salloc: job 165 has been allocated resources
 salloc: Granted job allocation 165
 [bennet@cavium-hpc ~]$ srun ./test_mpi
 The sum = 0.866386
 Elapsed time is:  5.425549
 The sum = 0.866386
 Elapsed time is:  5.422826
 The sum = 0.866386
 Elapsed time is:  5.427676
 The sum = 0.866386
 Elapsed time is:  5.424928
 The sum = 0.866386
 Elapsed time is:  5.422060
 The sum = 0.866386
 Elapsed time is:  5.425431
 The sum = 0.866386
 Elapsed time is:  5.424350
 The sum = 0.866386
 Elapsed time is:  5.423037
 The sum = 0.866386
 Elapsed time is:  5.427727
 The sum = 0.866386
 Elapsed time is:  5.424922
 The sum = 0.866386
 Elapsed time is:  5.424279
 Total time is:  59.672992
 
>

Re: [OMPI users] Fwd: srun works, mpirun does not

2018-06-18 Thread Ryan Novosielski
What MPI is SLURM set to use/how was that compiled? Out of the box, the SLURM 
MPI is set to “none”, or was last I checked, and so isn’t necessarily doing 
MPI. Now, I did try this with OpenMPI 2.1.1 and it looked right either way 
(OpenMPI built with “--with-pmi"), but for MVAPICH2 this definitely made a 
difference:

[novosirj@amarel1 novosirj]$ srun --mpi=none -N 4 -n 16 --ntasks-per-node=4 
./mpi_hello_world-intel-17.0.4-mvapich2-2.2
Hello world from processor slepner031.amarel.rutgers.edu, rank 0 out of 1 
processors
Hello world from processor slepner031.amarel.rutgers.edu, rank 0 out of 1 
processors
Hello world from processor slepner035.amarel.rutgers.edu, rank 0 out of 1 
processors
Hello world from processor slepner035.amarel.rutgers.edu, rank 0 out of 1 
processors
Hello world from processor slepner035.amarel.rutgers.edu, rank 0 out of 1 
processors
Hello world from processor slepner035.amarel.rutgers.edu, rank 0 out of 1 
processors
Hello world from processor slepner031.amarel.rutgers.edu, rank 0 out of 1 
processors
Hello world from processor slepner031.amarel.rutgers.edu, rank 0 out of 1 
processors
Hello world from processor slepner028.amarel.rutgers.edu, rank 0 out of 1 
processors
Hello world from processor slepner028.amarel.rutgers.edu, rank 0 out of 1 
processors
Hello world from processor slepner028.amarel.rutgers.edu, rank 0 out of 1 
processors
Hello world from processor slepner028.amarel.rutgers.edu, rank 0 out of 1 
processors
Hello world from processor slepner032.amarel.rutgers.edu, rank 0 out of 1 
processors
Hello world from processor slepner032.amarel.rutgers.edu, rank 0 out of 1 
processors
Hello world from processor slepner032.amarel.rutgers.edu, rank 0 out of 1 
processors
[slepner032.amarel.rutgers.edu:mpi_rank_0][error_sighandler] Caught error: Bus 
error (signal 7)
srun: error: slepner032: task 10: Bus error

[novosirj@amarel1 novosirj]$ srun --mpi=pmi2 -N 4 -n 16 --ntasks-per-node=4 
./mpi_hello_world-intel-17.0.4-mvapich2-2.2
Hello world from processor slepner028.amarel.rutgers.edu, rank 0 out of 16 
processors
Hello world from processor slepner028.amarel.rutgers.edu, rank 1 out of 16 
processors
Hello world from processor slepner028.amarel.rutgers.edu, rank 2 out of 16 
processors
Hello world from processor slepner028.amarel.rutgers.edu, rank 3 out of 16 
processors
Hello world from processor slepner035.amarel.rutgers.edu, rank 12 out of 16 
processors
Hello world from processor slepner035.amarel.rutgers.edu, rank 13 out of 16 
processors
Hello world from processor slepner035.amarel.rutgers.edu, rank 14 out of 16 
processors
Hello world from processor slepner035.amarel.rutgers.edu, rank 15 out of 16 
processors
Hello world from processor slepner031.amarel.rutgers.edu, rank 4 out of 16 
processors
Hello world from processor slepner031.amarel.rutgers.edu, rank 5 out of 16 
processors
Hello world from processor slepner031.amarel.rutgers.edu, rank 6 out of 16 
processors
Hello world from processor slepner031.amarel.rutgers.edu, rank 7 out of 16 
processors
Hello world from processor slepner032.amarel.rutgers.edu, rank 8 out of 16 
processors
Hello world from processor slepner032.amarel.rutgers.edu, rank 9 out of 16 
processors
Hello world from processor slepner032.amarel.rutgers.edu, rank 10 out of 16 
processors
Hello world from processor slepner032.amarel.rutgers.edu, rank 11 out of 16 
processors

> On Jun 17, 2018, at 5:51 PM, Bennet Fauber  wrote:
> 
> I rebuilt with --enable-debug, then ran with
> 
> [bennet@cavium-hpc ~]$ salloc -N 1 --ntasks-per-node=24
> salloc: Pending job allocation 158
> salloc: job 158 queued and waiting for resources
> salloc: job 158 has been allocated resources
> salloc: Granted job allocation 158
> 
> [bennet@cavium-hpc ~]$ srun ./test_mpi
> The sum = 0.866386
> Elapsed time is:  5.426759
> The sum = 0.866386
> Elapsed time is:  5.424068
> The sum = 0.866386
> Elapsed time is:  5.426195
> The sum = 0.866386
> Elapsed time is:  5.426059
> The sum = 0.866386
> Elapsed time is:  5.423192
> The sum = 0.866386
> Elapsed time is:  5.426252
> The sum = 0.866386
> Elapsed time is:  5.425444
> The sum = 0.866386
> Elapsed time is:  5.423647
> The sum = 0.866386
> Elapsed time is:  5.426082
> The sum = 0.866386
> Elapsed time is:  5.425936
> The sum = 0.866386
> Elapsed time is:  5.423964
> Total time is:  59.677830
> 
> [bennet@cavium-hpc ~]$ mpirun --mca plm_base_verbose 10 ./test_mpi
> 2>&1 | tee debug2.log
> 
> The zipped debug log should be attached.
> 
> I did that after using systemctl to turn off the firewall on the login
> node from which the mpirun is executed, as well as on the host on
> which it runs.
> 
> [bennet@cavium-hpc ~]$ mpirun hostname
> --
> An ORTE daemon has unexpectedly failed after launch and before
> communicating back to mpirun. This could be caused by a number
> of factors, including an inability to create a connection back
> to mpirun due to a lack of

[OMPI users] Enforcing specific interface and subnet usage

2018-06-18 Thread Maksym Planeta
Hello,

I want to force OpenMPI to use TCP and in particular use a particular subnet. 
Unfortunately, I can't manage to do that.

Here is what I try:

$BIN/mpirun --mca pml ob1 --mca btl tcp,self --mca ptl_tcp_remote_connections 1 
--mca btl_tcp_if_include '10.233.0.0/19' -np 4  --oversubscribe -H ib1n,ib2n 
bash -c 'echo $PMIX_SERVER_URI2'

The expected result would be a list of IP addresses in 10.233.0.0 subnet, but 
instead I get this:

2659516416.2;tcp4://127.0.0.1:46777
2659516416.2;tcp4://127.0.0.1:46777
2659516416.1;tcp4://127.0.0.1:45055
2659516416.1;tcp4://127.0.0.1:45055

Could you help me to debug this problem somehow?

The IP addresses are completely available in the desired subnet

$BIN/mpirun --mca pml ob1 --mca btl tcp,self  --mca ptl_tcp_remote_connections 
1 --mca btl_tcp_if_include '10.233.0.0/19' -np 4  --oversubscribe -H ib1n,ib2n 
ip addr show dev br0

Returns a set of bridges looking like:

9: br0:  mtu 1500 qdisc noqueue state UP group 
default qlen 1000
link/ether 94:de:80:ba:37:e4 brd ff:ff:ff:ff:ff:ff
inet 141.76.49.17/26 brd 141.76.49.63 scope global br0
   valid_lft forever preferred_lft forever
inet 10.233.0.82/19 scope global br0
   valid_lft forever preferred_lft forever
inet6 2002:8d4c:3001:48:40de:80ff:feba:37e4/64 scope global deprecated 
mngtmpaddr dynamic 
   valid_lft 59528sec preferred_lft 0sec
inet6 fe80::96de:80ff:feba:37e4/64 scope link tentative dadfailed 
   valid_lft forever preferred_lft forever


What is more boggling is that if I attache with a debugger at 
opal/mca/pmix/pmix3x/pmix/src/mca/ptl/tcp/ptl_tcp_components.c around line 500 
I see that mca_ptl_tcp_component.remote_connections is false. This means that 
the way I set up component parameters is ignored.

-- 
Regards,
Maksym Planeta



smime.p7s
Description: S/MIME Cryptographic Signature
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] Enforcing specific interface and subnet usage

2018-06-18 Thread r...@open-mpi.org
I’m not entirely sure I understand what you are trying to do. The 
PMIX_SERVER_URI2 envar tells local clients how to connect to their local PMIx 
server (i.e., the OMPI daemon on that node). This is always done over the 
loopback device since it is a purely local connection that is never used for 
MPI messages.

I’m sure that the tcp/btl is using your indicated subnet as that would be used 
for internode messages.


> On Jun 18, 2018, at 3:52 PM, Maksym Planeta  
> wrote:
> 
> Hello,
> 
> I want to force OpenMPI to use TCP and in particular use a particular subnet. 
> Unfortunately, I can't manage to do that.
> 
> Here is what I try:
> 
> $BIN/mpirun --mca pml ob1 --mca btl tcp,self --mca ptl_tcp_remote_connections 
> 1 --mca btl_tcp_if_include '10.233.0.0/19' -np 4  --oversubscribe -H 
> ib1n,ib2n bash -c 'echo $PMIX_SERVER_URI2'
> 
> The expected result would be a list of IP addresses in 10.233.0.0 subnet, but 
> instead I get this:
> 
> 2659516416.2;tcp4://127.0.0.1:46777
> 2659516416.2;tcp4://127.0.0.1:46777
> 2659516416.1;tcp4://127.0.0.1:45055
> 2659516416.1;tcp4://127.0.0.1:45055
> 
> Could you help me to debug this problem somehow?
> 
> The IP addresses are completely available in the desired subnet
> 
> $BIN/mpirun --mca pml ob1 --mca btl tcp,self  --mca 
> ptl_tcp_remote_connections 1 --mca btl_tcp_if_include '10.233.0.0/19' -np 4  
> --oversubscribe -H ib1n,ib2n ip addr show dev br0
> 
> Returns a set of bridges looking like:
> 
> 9: br0:  mtu 1500 qdisc noqueue state UP 
> group default qlen 1000
>link/ether 94:de:80:ba:37:e4 brd ff:ff:ff:ff:ff:ff
>inet 141.76.49.17/26 brd 141.76.49.63 scope global br0
>   valid_lft forever preferred_lft forever
>inet 10.233.0.82/19 scope global br0
>   valid_lft forever preferred_lft forever
>inet6 2002:8d4c:3001:48:40de:80ff:feba:37e4/64 scope global deprecated 
> mngtmpaddr dynamic 
>   valid_lft 59528sec preferred_lft 0sec
>inet6 fe80::96de:80ff:feba:37e4/64 scope link tentative dadfailed 
>   valid_lft forever preferred_lft forever
> 
> 
> What is more boggling is that if I attache with a debugger at 
> opal/mca/pmix/pmix3x/pmix/src/mca/ptl/tcp/ptl_tcp_components.c around line 
> 500 I see that mca_ptl_tcp_component.remote_connections is false. This means 
> that the way I set up component parameters is ignored.
> 
> -- 
> Regards,
> Maksym Planeta
> 
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users

___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] Fwd: srun works, mpirun does not

2018-06-18 Thread Bennet Fauber
Ryan,

With srun it's fine.  Only with mpirun is there a problem, and that is
both on a single node and on multiple nodes.  SLURM was built against
pmix 2.0.2, and I am pretty sure that SLURM's default is pmix.  We are
running a recent patch of SLURM, I think.  SLURM and OMPI are both
being built using the same installation of pmix.

[bennet@cavium-hpc etc]$ srun --version
slurm 17.11.7

[bennet@cavium-hpc etc]$ grep pmi slurm.conf
MpiDefault=pmix

[bennet@cavium-hpc pmix]$ srun --mpi=list
srun: MPI types are...
srun: pmix_v2
srun: openmpi
srun: none
srun: pmi2
srun: pmix

I think I said that I was pretty sure I had got this to work with both
mpirun and srun at one point, but I am unable to find the magic a
second time.




On Mon, Jun 18, 2018 at 4:44 PM Ryan Novosielski  wrote:
>
> What MPI is SLURM set to use/how was that compiled? Out of the box, the SLURM 
> MPI is set to “none”, or was last I checked, and so isn’t necessarily doing 
> MPI. Now, I did try this with OpenMPI 2.1.1 and it looked right either way 
> (OpenMPI built with “--with-pmi"), but for MVAPICH2 this definitely made a 
> difference:
>
> [novosirj@amarel1 novosirj]$ srun --mpi=none -N 4 -n 16 --ntasks-per-node=4 
> ./mpi_hello_world-intel-17.0.4-mvapich2-2.2
> Hello world from processor slepner031.amarel.rutgers.edu, rank 0 out of 1 
> processors
> Hello world from processor slepner031.amarel.rutgers.edu, rank 0 out of 1 
> processors
> Hello world from processor slepner035.amarel.rutgers.edu, rank 0 out of 1 
> processors
> Hello world from processor slepner035.amarel.rutgers.edu, rank 0 out of 1 
> processors
> Hello world from processor slepner035.amarel.rutgers.edu, rank 0 out of 1 
> processors
> Hello world from processor slepner035.amarel.rutgers.edu, rank 0 out of 1 
> processors
> Hello world from processor slepner031.amarel.rutgers.edu, rank 0 out of 1 
> processors
> Hello world from processor slepner031.amarel.rutgers.edu, rank 0 out of 1 
> processors
> Hello world from processor slepner028.amarel.rutgers.edu, rank 0 out of 1 
> processors
> Hello world from processor slepner028.amarel.rutgers.edu, rank 0 out of 1 
> processors
> Hello world from processor slepner028.amarel.rutgers.edu, rank 0 out of 1 
> processors
> Hello world from processor slepner028.amarel.rutgers.edu, rank 0 out of 1 
> processors
> Hello world from processor slepner032.amarel.rutgers.edu, rank 0 out of 1 
> processors
> Hello world from processor slepner032.amarel.rutgers.edu, rank 0 out of 1 
> processors
> Hello world from processor slepner032.amarel.rutgers.edu, rank 0 out of 1 
> processors
> [slepner032.amarel.rutgers.edu:mpi_rank_0][error_sighandler] Caught error: 
> Bus error (signal 7)
> srun: error: slepner032: task 10: Bus error
>
> [novosirj@amarel1 novosirj]$ srun --mpi=pmi2 -N 4 -n 16 --ntasks-per-node=4 
> ./mpi_hello_world-intel-17.0.4-mvapich2-2.2
> Hello world from processor slepner028.amarel.rutgers.edu, rank 0 out of 16 
> processors
> Hello world from processor slepner028.amarel.rutgers.edu, rank 1 out of 16 
> processors
> Hello world from processor slepner028.amarel.rutgers.edu, rank 2 out of 16 
> processors
> Hello world from processor slepner028.amarel.rutgers.edu, rank 3 out of 16 
> processors
> Hello world from processor slepner035.amarel.rutgers.edu, rank 12 out of 16 
> processors
> Hello world from processor slepner035.amarel.rutgers.edu, rank 13 out of 16 
> processors
> Hello world from processor slepner035.amarel.rutgers.edu, rank 14 out of 16 
> processors
> Hello world from processor slepner035.amarel.rutgers.edu, rank 15 out of 16 
> processors
> Hello world from processor slepner031.amarel.rutgers.edu, rank 4 out of 16 
> processors
> Hello world from processor slepner031.amarel.rutgers.edu, rank 5 out of 16 
> processors
> Hello world from processor slepner031.amarel.rutgers.edu, rank 6 out of 16 
> processors
> Hello world from processor slepner031.amarel.rutgers.edu, rank 7 out of 16 
> processors
> Hello world from processor slepner032.amarel.rutgers.edu, rank 8 out of 16 
> processors
> Hello world from processor slepner032.amarel.rutgers.edu, rank 9 out of 16 
> processors
> Hello world from processor slepner032.amarel.rutgers.edu, rank 10 out of 16 
> processors
> Hello world from processor slepner032.amarel.rutgers.edu, rank 11 out of 16 
> processors
>
> > On Jun 17, 2018, at 5:51 PM, Bennet Fauber  wrote:
> >
> > I rebuilt with --enable-debug, then ran with
> >
> > [bennet@cavium-hpc ~]$ salloc -N 1 --ntasks-per-node=24
> > salloc: Pending job allocation 158
> > salloc: job 158 queued and waiting for resources
> > salloc: job 158 has been allocated resources
> > salloc: Granted job allocation 158
> >
> > [bennet@cavium-hpc ~]$ srun ./test_mpi
> > The sum = 0.866386
> > Elapsed time is:  5.426759
> > The sum = 0.866386
> > Elapsed time is:  5.424068
> > The sum = 0.866386
> > Elapsed time is:  5.426195
> > The sum = 0.866386
> > Elapsed time is:  5.426059
> > The sum = 0.866386
> > Elapsed time is:  5

Re: [OMPI users] Fwd: srun works, mpirun does not

2018-06-18 Thread Bennet Fauber
Well, this is kind of interesting.  I can strip the configure line
back and get mpirun to work on one node, but then neither srun nor
mpirun within a SLURM job will run.  I can add back configure options
to get to

./configure \
--prefix=${PREFIX} \
--mandir=${PREFIX}/share/man \
--with-pmix=/opt/pmix/2.0.2 \
--with-slurm

and the situation does not seem to change.  Then I add libevent,

./configure \
--prefix=${PREFIX} \
--mandir=${PREFIX}/share/man \
--with-pmix=/opt/pmix/2.0.2 \
--with-libevent=external \
--with-slurm

and it works again with srun but fails to run the binary with mpirun.

It is late, and I am baffled.

On Mon, Jun 18, 2018 at 9:02 PM Bennet Fauber  wrote:
>
> Ryan,
>
> With srun it's fine.  Only with mpirun is there a problem, and that is
> both on a single node and on multiple nodes.  SLURM was built against
> pmix 2.0.2, and I am pretty sure that SLURM's default is pmix.  We are
> running a recent patch of SLURM, I think.  SLURM and OMPI are both
> being built using the same installation of pmix.
>
> [bennet@cavium-hpc etc]$ srun --version
> slurm 17.11.7
>
> [bennet@cavium-hpc etc]$ grep pmi slurm.conf
> MpiDefault=pmix
>
> [bennet@cavium-hpc pmix]$ srun --mpi=list
> srun: MPI types are...
> srun: pmix_v2
> srun: openmpi
> srun: none
> srun: pmi2
> srun: pmix
>
> I think I said that I was pretty sure I had got this to work with both
> mpirun and srun at one point, but I am unable to find the magic a
> second time.
>
>
>
>
> On Mon, Jun 18, 2018 at 4:44 PM Ryan Novosielski  wrote:
> >
> > What MPI is SLURM set to use/how was that compiled? Out of the box, the 
> > SLURM MPI is set to “none”, or was last I checked, and so isn’t necessarily 
> > doing MPI. Now, I did try this with OpenMPI 2.1.1 and it looked right 
> > either way (OpenMPI built with “--with-pmi"), but for MVAPICH2 this 
> > definitely made a difference:
> >
> > [novosirj@amarel1 novosirj]$ srun --mpi=none -N 4 -n 16 --ntasks-per-node=4 
> > ./mpi_hello_world-intel-17.0.4-mvapich2-2.2
> > Hello world from processor slepner031.amarel.rutgers.edu, rank 0 out of 1 
> > processors
> > Hello world from processor slepner031.amarel.rutgers.edu, rank 0 out of 1 
> > processors
> > Hello world from processor slepner035.amarel.rutgers.edu, rank 0 out of 1 
> > processors
> > Hello world from processor slepner035.amarel.rutgers.edu, rank 0 out of 1 
> > processors
> > Hello world from processor slepner035.amarel.rutgers.edu, rank 0 out of 1 
> > processors
> > Hello world from processor slepner035.amarel.rutgers.edu, rank 0 out of 1 
> > processors
> > Hello world from processor slepner031.amarel.rutgers.edu, rank 0 out of 1 
> > processors
> > Hello world from processor slepner031.amarel.rutgers.edu, rank 0 out of 1 
> > processors
> > Hello world from processor slepner028.amarel.rutgers.edu, rank 0 out of 1 
> > processors
> > Hello world from processor slepner028.amarel.rutgers.edu, rank 0 out of 1 
> > processors
> > Hello world from processor slepner028.amarel.rutgers.edu, rank 0 out of 1 
> > processors
> > Hello world from processor slepner028.amarel.rutgers.edu, rank 0 out of 1 
> > processors
> > Hello world from processor slepner032.amarel.rutgers.edu, rank 0 out of 1 
> > processors
> > Hello world from processor slepner032.amarel.rutgers.edu, rank 0 out of 1 
> > processors
> > Hello world from processor slepner032.amarel.rutgers.edu, rank 0 out of 1 
> > processors
> > [slepner032.amarel.rutgers.edu:mpi_rank_0][error_sighandler] Caught error: 
> > Bus error (signal 7)
> > srun: error: slepner032: task 10: Bus error
> >
> > [novosirj@amarel1 novosirj]$ srun --mpi=pmi2 -N 4 -n 16 --ntasks-per-node=4 
> > ./mpi_hello_world-intel-17.0.4-mvapich2-2.2
> > Hello world from processor slepner028.amarel.rutgers.edu, rank 0 out of 16 
> > processors
> > Hello world from processor slepner028.amarel.rutgers.edu, rank 1 out of 16 
> > processors
> > Hello world from processor slepner028.amarel.rutgers.edu, rank 2 out of 16 
> > processors
> > Hello world from processor slepner028.amarel.rutgers.edu, rank 3 out of 16 
> > processors
> > Hello world from processor slepner035.amarel.rutgers.edu, rank 12 out of 16 
> > processors
> > Hello world from processor slepner035.amarel.rutgers.edu, rank 13 out of 16 
> > processors
> > Hello world from processor slepner035.amarel.rutgers.edu, rank 14 out of 16 
> > processors
> > Hello world from processor slepner035.amarel.rutgers.edu, rank 15 out of 16 
> > processors
> > Hello world from processor slepner031.amarel.rutgers.edu, rank 4 out of 16 
> > processors
> > Hello world from processor slepner031.amarel.rutgers.edu, rank 5 out of 16 
> > processors
> > Hello world from processor slepner031.amarel.rutgers.edu, rank 6 out of 16 
> > processors
> > Hello world from processor slepner031.amarel.rutgers.edu, rank 7 out of 16 
> > processors
> > Hello world from processor slepner032.amarel.rutgers.edu, rank 8 out of 16 
> > processors
> > Hello world from pr