[OMPI users] Fwd: srun works, mpirun does not

2018-06-17 Thread Bennet Fauber
I have a compiled binary that will run with srun but not with mpirun.
The attempts to run with mpirun all result in failures to initialize.
I have tried this on one node, and on two nodes, with firewall turned
on and with it off.

Am I missing some command line option for mpirun?

OMPI built from this configure command

  $ ./configure --prefix=/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-b
--mandir=/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-b/share/man
--with-pmix=/opt/pmix/2.0.2 --with-libevent=external
--with-hwloc=external --with-slurm --disable-dlopen CC=gcc CXX=g++
FC=gfortran

All tests from `make check` passed, see below.

[bennet@cavium-hpc ~]$ mpicc --show
gcc -I/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-b/include -pthread
-L/opt/pmix/2.0.2/lib -Wl,-rpath -Wl,/opt/pmix/2.0.2/lib -Wl,-rpath
-Wl,/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-b/lib
-Wl,--enable-new-dtags
-L/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-b/lib -lmpi

The test_mpi was compiled with

$ gcc -o test_mpi test_mpi.c -lm

This is the runtime library path

[bennet@cavium-hpc ~]$ echo $LD_LIBRARY_PATH
/opt/slurm/lib64:/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-b/lib:/opt/arm/gcc-7.1.0_Generic-AArch64_RHEL-7_aarch64-linux/lib64:/opt/arm/gcc-7.1.0_Generic-AArch64_RHEL-7_aarch64-linux/lib:/opt/slurm/lib64:/opt/pmix/2.0.2/lib:/sw/arcts/centos7/hpc-utils/lib


These commands are given in exact sequence in which they were entered
at a console.

[bennet@cavium-hpc ~]$ salloc -N 1 --ntasks-per-node=24
salloc: Pending job allocation 156
salloc: job 156 queued and waiting for resources
salloc: job 156 has been allocated resources
salloc: Granted job allocation 156

[bennet@cavium-hpc ~]$ mpirun ./test_mpi
--
An ORTE daemon has unexpectedly failed after launch and before
communicating back to mpirun. This could be caused by a number
of factors, including an inability to create a connection back
to mpirun due to a lack of common network interfaces and/or no
route found between them. Please check network connectivity
(including firewalls and network routing requirements).
--

[bennet@cavium-hpc ~]$ srun ./test_mpi
The sum = 0.866386
Elapsed time is:  5.425439
The sum = 0.866386
Elapsed time is:  5.427427
The sum = 0.866386
Elapsed time is:  5.422579
The sum = 0.866386
Elapsed time is:  5.424168
The sum = 0.866386
Elapsed time is:  5.423951
The sum = 0.866386
Elapsed time is:  5.422414
The sum = 0.866386
Elapsed time is:  5.427156
The sum = 0.866386
Elapsed time is:  5.424834
The sum = 0.866386
Elapsed time is:  5.425103
The sum = 0.866386
Elapsed time is:  5.422415
The sum = 0.866386
Elapsed time is:  5.422948
Total time is:  59.668622

Thanks,-- bennet


make check results
--

make  check-TESTS
make[3]: Entering directory `/tmp/build/openmpi-3.1.0/ompi/debuggers'
make[4]: Entering directory `/tmp/build/openmpi-3.1.0/ompi/debuggers'
PASS: predefined_gap_test
PASS: predefined_pad_test
SKIP: dlopen_test

Testsuite summary for Open MPI 3.1.0

# TOTAL: 3
# PASS:  2
# SKIP:  1
# XFAIL: 0
# FAIL:  0
# XPASS: 0
# ERROR: 0

[ elided ]
PASS: atomic_cmpset_noinline
- 5 threads: Passed
PASS: atomic_cmpset_noinline
- 8 threads: Passed

Testsuite summary for Open MPI 3.1.0

# TOTAL: 8
# PASS:  8
# SKIP:  0
# XFAIL: 0
# FAIL:  0
# XPASS: 0
# ERROR: 0

[ elided ]
make[4]: Entering directory `/tmp/build/openmpi-3.1.0/test/class'
PASS: ompi_rb_tree
PASS: opal_bitmap
PASS: opal_hash_table
PASS: opal_proc_table
PASS: opal_tree
PASS: opal_list
PASS: opal_value_array
PASS: opal_pointer_array
PASS: opal_lifo
PASS: opal_fifo

Testsuite summary for Open MPI 3.1.0

# TOTAL: 10
# PASS:  10
# SKIP:  0
# XFAIL: 0
# FAIL:  0
# XPASS: 0
# ERROR: 0

[ elided ]
make  opal_thread opal_condition
make[3]: Entering directory `/tmp/build/openmpi-3.1.0/test/threads'
  CC   opal_thread.o
  CCLD opal_thread
  CC   opal_condition.o
  CCLD opal_condition
make[3]: Leaving directory `/tmp/build/openmpi-3.1.0/test/threads'
make  check-TESTS
make[3]: Entering directory `/tmp/build/openmpi-3.1.0/test/threads'
make[4]: Entering directory `/tmp/build/openmpi-3.1.0/test/threads'
===

Re: [OMPI users] Fwd: srun works, mpirun does not

2018-06-17 Thread r...@open-mpi.org
Add --enable-debug to your OMPI configure cmd line, and then add --mca 
plm_base_verbose 10 to your mpirun cmd line. For some reason, the remote daemon 
isn’t starting - this will give you some info as to why.


> On Jun 17, 2018, at 9:07 AM, Bennet Fauber  wrote:
> 
> I have a compiled binary that will run with srun but not with mpirun.
> The attempts to run with mpirun all result in failures to initialize.
> I have tried this on one node, and on two nodes, with firewall turned
> on and with it off.
> 
> Am I missing some command line option for mpirun?
> 
> OMPI built from this configure command
> 
>  $ ./configure --prefix=/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-b
> --mandir=/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-b/share/man
> --with-pmix=/opt/pmix/2.0.2 --with-libevent=external
> --with-hwloc=external --with-slurm --disable-dlopen CC=gcc CXX=g++
> FC=gfortran
> 
> All tests from `make check` passed, see below.
> 
> [bennet@cavium-hpc ~]$ mpicc --show
> gcc -I/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-b/include -pthread
> -L/opt/pmix/2.0.2/lib -Wl,-rpath -Wl,/opt/pmix/2.0.2/lib -Wl,-rpath
> -Wl,/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-b/lib
> -Wl,--enable-new-dtags
> -L/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-b/lib -lmpi
> 
> The test_mpi was compiled with
> 
> $ gcc -o test_mpi test_mpi.c -lm
> 
> This is the runtime library path
> 
> [bennet@cavium-hpc ~]$ echo $LD_LIBRARY_PATH
> /opt/slurm/lib64:/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-b/lib:/opt/arm/gcc-7.1.0_Generic-AArch64_RHEL-7_aarch64-linux/lib64:/opt/arm/gcc-7.1.0_Generic-AArch64_RHEL-7_aarch64-linux/lib:/opt/slurm/lib64:/opt/pmix/2.0.2/lib:/sw/arcts/centos7/hpc-utils/lib
> 
> 
> These commands are given in exact sequence in which they were entered
> at a console.
> 
> [bennet@cavium-hpc ~]$ salloc -N 1 --ntasks-per-node=24
> salloc: Pending job allocation 156
> salloc: job 156 queued and waiting for resources
> salloc: job 156 has been allocated resources
> salloc: Granted job allocation 156
> 
> [bennet@cavium-hpc ~]$ mpirun ./test_mpi
> --
> An ORTE daemon has unexpectedly failed after launch and before
> communicating back to mpirun. This could be caused by a number
> of factors, including an inability to create a connection back
> to mpirun due to a lack of common network interfaces and/or no
> route found between them. Please check network connectivity
> (including firewalls and network routing requirements).
> --
> 
> [bennet@cavium-hpc ~]$ srun ./test_mpi
> The sum = 0.866386
> Elapsed time is:  5.425439
> The sum = 0.866386
> Elapsed time is:  5.427427
> The sum = 0.866386
> Elapsed time is:  5.422579
> The sum = 0.866386
> Elapsed time is:  5.424168
> The sum = 0.866386
> Elapsed time is:  5.423951
> The sum = 0.866386
> Elapsed time is:  5.422414
> The sum = 0.866386
> Elapsed time is:  5.427156
> The sum = 0.866386
> Elapsed time is:  5.424834
> The sum = 0.866386
> Elapsed time is:  5.425103
> The sum = 0.866386
> Elapsed time is:  5.422415
> The sum = 0.866386
> Elapsed time is:  5.422948
> Total time is:  59.668622
> 
> Thanks,-- bennet
> 
> 
> make check results
> --
> 
> make  check-TESTS
> make[3]: Entering directory `/tmp/build/openmpi-3.1.0/ompi/debuggers'
> make[4]: Entering directory `/tmp/build/openmpi-3.1.0/ompi/debuggers'
> PASS: predefined_gap_test
> PASS: predefined_pad_test
> SKIP: dlopen_test
> 
> Testsuite summary for Open MPI 3.1.0
> 
> # TOTAL: 3
> # PASS:  2
> # SKIP:  1
> # XFAIL: 0
> # FAIL:  0
> # XPASS: 0
> # ERROR: 0
> 
> [ elided ]
> PASS: atomic_cmpset_noinline
>- 5 threads: Passed
> PASS: atomic_cmpset_noinline
>- 8 threads: Passed
> 
> Testsuite summary for Open MPI 3.1.0
> 
> # TOTAL: 8
> # PASS:  8
> # SKIP:  0
> # XFAIL: 0
> # FAIL:  0
> # XPASS: 0
> # ERROR: 0
> 
> [ elided ]
> make[4]: Entering directory `/tmp/build/openmpi-3.1.0/test/class'
> PASS: ompi_rb_tree
> PASS: opal_bitmap
> PASS: opal_hash_table
> PASS: opal_proc_table
> PASS: opal_tree
> PASS: opal_list
> PASS: opal_value_array
> PASS: opal_pointer_array
> PASS: opal_lifo
> PASS: opal_fifo
> 
> Testsuite summary for Open MPI 3.1.0
> 
> # TOTAL: 10
> # PASS:  10
> # SKIP:  0
> # XFAIL: 0
> # FAIL:  0
> # XPASS: 0
> # ERROR: 0
> ==

Re: [OMPI users] Fwd: srun works, mpirun does not

2018-06-17 Thread Bennet Fauber
I rebuilt with --enable-debug, then ran with

[bennet@cavium-hpc ~]$ salloc -N 1 --ntasks-per-node=24
salloc: Pending job allocation 158
salloc: job 158 queued and waiting for resources
salloc: job 158 has been allocated resources
salloc: Granted job allocation 158

[bennet@cavium-hpc ~]$ srun ./test_mpi
The sum = 0.866386
Elapsed time is:  5.426759
The sum = 0.866386
Elapsed time is:  5.424068
The sum = 0.866386
Elapsed time is:  5.426195
The sum = 0.866386
Elapsed time is:  5.426059
The sum = 0.866386
Elapsed time is:  5.423192
The sum = 0.866386
Elapsed time is:  5.426252
The sum = 0.866386
Elapsed time is:  5.425444
The sum = 0.866386
Elapsed time is:  5.423647
The sum = 0.866386
Elapsed time is:  5.426082
The sum = 0.866386
Elapsed time is:  5.425936
The sum = 0.866386
Elapsed time is:  5.423964
Total time is:  59.677830

[bennet@cavium-hpc ~]$ mpirun --mca plm_base_verbose 10 ./test_mpi
2>&1 | tee debug2.log

The zipped debug log should be attached.

I did that after using systemctl to turn off the firewall on the login
node from which the mpirun is executed, as well as on the host on
which it runs.

[bennet@cavium-hpc ~]$ mpirun hostname
--
An ORTE daemon has unexpectedly failed after launch and before
communicating back to mpirun. This could be caused by a number
of factors, including an inability to create a connection back
to mpirun due to a lack of common network interfaces and/or no
route found between them. Please check network connectivity
(including firewalls and network routing requirements).
--

[bennet@cavium-hpc ~]$ squeue
 JOBID PARTITION NAME USER ST   TIME  NODES
NODELIST(REASON)
   158  standard bash   bennet  R  14:30  1 cav01
[bennet@cavium-hpc ~]$ srun hostname
cav01.arc-ts.umich.edu
[ repeated 23 more times ]

As always, your help is much appreciated,

-- bennet

On Sun, Jun 17, 2018 at 1:06 PM r...@open-mpi.org  wrote:
>
> Add --enable-debug to your OMPI configure cmd line, and then add --mca 
> plm_base_verbose 10 to your mpirun cmd line. For some reason, the remote 
> daemon isn’t starting - this will give you some info as to why.
>
>
> > On Jun 17, 2018, at 9:07 AM, Bennet Fauber  wrote:
> >
> > I have a compiled binary that will run with srun but not with mpirun.
> > The attempts to run with mpirun all result in failures to initialize.
> > I have tried this on one node, and on two nodes, with firewall turned
> > on and with it off.
> >
> > Am I missing some command line option for mpirun?
> >
> > OMPI built from this configure command
> >
> >  $ ./configure --prefix=/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-b
> > --mandir=/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-b/share/man
> > --with-pmix=/opt/pmix/2.0.2 --with-libevent=external
> > --with-hwloc=external --with-slurm --disable-dlopen CC=gcc CXX=g++
> > FC=gfortran
> >
> > All tests from `make check` passed, see below.
> >
> > [bennet@cavium-hpc ~]$ mpicc --show
> > gcc -I/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-b/include -pthread
> > -L/opt/pmix/2.0.2/lib -Wl,-rpath -Wl,/opt/pmix/2.0.2/lib -Wl,-rpath
> > -Wl,/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-b/lib
> > -Wl,--enable-new-dtags
> > -L/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-b/lib -lmpi
> >
> > The test_mpi was compiled with
> >
> > $ gcc -o test_mpi test_mpi.c -lm
> >
> > This is the runtime library path
> >
> > [bennet@cavium-hpc ~]$ echo $LD_LIBRARY_PATH
> > /opt/slurm/lib64:/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-b/lib:/opt/arm/gcc-7.1.0_Generic-AArch64_RHEL-7_aarch64-linux/lib64:/opt/arm/gcc-7.1.0_Generic-AArch64_RHEL-7_aarch64-linux/lib:/opt/slurm/lib64:/opt/pmix/2.0.2/lib:/sw/arcts/centos7/hpc-utils/lib
> >
> >
> > These commands are given in exact sequence in which they were entered
> > at a console.
> >
> > [bennet@cavium-hpc ~]$ salloc -N 1 --ntasks-per-node=24
> > salloc: Pending job allocation 156
> > salloc: job 156 queued and waiting for resources
> > salloc: job 156 has been allocated resources
> > salloc: Granted job allocation 156
> >
> > [bennet@cavium-hpc ~]$ mpirun ./test_mpi
> > --
> > An ORTE daemon has unexpectedly failed after launch and before
> > communicating back to mpirun. This could be caused by a number
> > of factors, including an inability to create a connection back
> > to mpirun due to a lack of common network interfaces and/or no
> > route found between them. Please check network connectivity
> > (including firewalls and network routing requirements).
> > --
> >
> > [bennet@cavium-hpc ~]$ srun ./test_mpi
> > The sum = 0.866386
> > Elapsed time is:  5.425439
> > The sum = 0.866386
> > Elapsed time is:  5.427427
> > The sum = 0.866386
> > Elapsed time is:  5.422579
> > The sum = 0.866386
> > Elapsed tim