No such luck. If it matters, mpirun does seem to work with processes on the local node that have no internal MPI code. That is,
[bennet@cavium-hpc ~]$ mpirun -np 4 hello Hello, ARM Hello, ARM Hello, ARM Hello, ARM but it fails with a similar error if run while a SLURM job is active; i.e., [bennet@cavium-hpc ~]$ mpirun hello -------------------------------------------------------------------------- ORTE has lost communication with a remote daemon. HNP daemon : [[26589,0],0] on node cavium-hpc Remote daemon: [[26589,0],1] on node cav01 This is usually due to either a failure of the TCP network connection to the node, or possibly an internal failure of the daemon itself. We cannot recover from this failure, and therefore will terminate the job. -------------------------------------------------------------------------- That makes sense, I guess. I'll keep you posted as to what happens with 3.0.0 and downgrading SLURM. Thanks, -- bennet On Mon, Jun 18, 2018 at 4:05 PM r...@open-mpi.org <r...@open-mpi.org> wrote: > > I doubt Slurm is the issue. For grins, lets try adding “--mca plm rsh” to > your mpirun cmd line and see if that works. > > > > On Jun 18, 2018, at 12:57 PM, Bennet Fauber <ben...@umich.edu> wrote: > > > > To eliminate possibilities, I removed all other versions of OpenMPI > > from the system, and rebuilt using the same build script as was used > > to generate the prior report. > > > > [bennet@cavium-hpc bennet]$ ./ompi-3.1.0bd.sh > > Checking compilers and things > > OMPI is ompi > > COMP_NAME is gcc_7_1_0 > > SRC_ROOT is /sw/arcts/centos7/src > > PREFIX_ROOT is /sw/arcts/centos7 > > PREFIX is /sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-bd > > CONFIGURE_FLAGS are > > COMPILERS are CC=gcc CXX=g++ FC=gfortran > > > > Currently Loaded Modules: > > 1) gcc/7.1.0 > > > > gcc (ARM-build-14) 7.1.0 > > Copyright (C) 2017 Free Software Foundation, Inc. > > This is free software; see the source for copying conditions. There is NO > > warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. > > > > Using the following configure command > > > > ./configure --prefix=/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-bd > > --mandir=/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-bd/share/man > > --with-pmix=/opt/pmix/2.0.2 --with-libevent=external > > --with-hwloc=external --with-slurm --disable-dlopen > > --enable-debug CC=gcc CXX=g++ FC=gfortran > > > > The tar ball is > > > > 2e783873f6b206aa71f745762fa15da5 > > /sw/arcts/centos7/src/ompi/openmpi-3.1.0.tar.gz > > > > I still get > > > > [bennet@cavium-hpc ~]$ salloc -N 1 --ntasks-per-node=24 > > salloc: Pending job allocation 165 > > salloc: job 165 queued and waiting for resources > > salloc: job 165 has been allocated resources > > salloc: Granted job allocation 165 > > [bennet@cavium-hpc ~]$ srun ./test_mpi > > The sum = 0.866386 > > Elapsed time is: 5.425549 > > The sum = 0.866386 > > Elapsed time is: 5.422826 > > The sum = 0.866386 > > Elapsed time is: 5.427676 > > The sum = 0.866386 > > Elapsed time is: 5.424928 > > The sum = 0.866386 > > Elapsed time is: 5.422060 > > The sum = 0.866386 > > Elapsed time is: 5.425431 > > The sum = 0.866386 > > Elapsed time is: 5.424350 > > The sum = 0.866386 > > Elapsed time is: 5.423037 > > The sum = 0.866386 > > Elapsed time is: 5.427727 > > The sum = 0.866386 > > Elapsed time is: 5.424922 > > The sum = 0.866386 > > Elapsed time is: 5.424279 > > Total time is: 59.672992 > > > > [bennet@cavium-hpc ~]$ mpirun ./test_mpi > > -------------------------------------------------------------------------- > > An ORTE daemon has unexpectedly failed after launch and before > > communicating back to mpirun. This could be caused by a number > > of factors, including an inability to create a connection back > > to mpirun due to a lack of common network interfaces and/or no > > route found between them. Please check network connectivity > > (including firewalls and network routing requirements). > > -------------------------------------------------------------------------- > > > > I reran with > > > > [bennet@cavium-hpc ~]$ mpirun --mca plm_base_verbose 10 ./test_mpi > > 2>&1 | tee debug3.log > > > > and the gzipped log is attached. > > > > I thought to try it with a different test program, which spits the error > > [cavium-hpc.arc-ts.umich.edu:42853] [[58987,1],0] ORTE_ERROR_LOG: Not > > found in file base/ess_base_std_app.c at line 219 > > [cavium-hpc.arc-ts.umich.edu:42854] [[58987,1],1] ORTE_ERROR_LOG: Not > > found in file base/ess_base_std_app.c at line 219 > > -------------------------------------------------------------------------- > > It looks like orte_init failed for some reason; your parallel process is > > likely to abort. There are many reasons that a parallel process can > > fail during orte_init; some of which are due to configuration or > > environment problems. This failure appears to be an internal failure; > > here's some additional information (which may only be relevant to an > > Open MPI developer): > > > > store DAEMON URI failed > > --> Returned value Not found (-13) instead of ORTE_SUCCESS > > > > > > At one point, I am almost certain that OMPI mpirun did work, and I am > > at a loss to explain why it no longer does. > > > > I have also tried the 3.1.1rc1 version. I am now going to try 3.0.0, > > and we'll try downgrading SLURM to a prior version. > > > > -- bennet > > > > > > -- bennetOn Mon, Jun 18, 2018 at 10:56 AM r...@open-mpi.org > > <r...@open-mpi.org> wrote: > >> > >> Hmmm...well, the error has changed from your initial report. Turning off > >> the firewall was the solution to that problem. > >> > >> This problem is different - it isn’t the orted that failed in the log you > >> sent, but the application proc that couldn’t initialize. It looks like > >> that app was compiled against some earlier version of OMPI? It is looking > >> for something that no longer exists. I saw that you compiled it with a > >> simple “gcc” instead of our wrapper compiler “mpicc” - any particular > >> reason? My guess is that your compile picked up some older version of OMPI > >> on the system. > >> > >> Ralph > >> > >> > >>> On Jun 17, 2018, at 2:51 PM, Bennet Fauber <ben...@umich.edu> wrote: > >>> > >>> I rebuilt with --enable-debug, then ran with > >>> > >>> [bennet@cavium-hpc ~]$ salloc -N 1 --ntasks-per-node=24 > >>> salloc: Pending job allocation 158 > >>> salloc: job 158 queued and waiting for resources > >>> salloc: job 158 has been allocated resources > >>> salloc: Granted job allocation 158 > >>> > >>> [bennet@cavium-hpc ~]$ srun ./test_mpi > >>> The sum = 0.866386 > >>> Elapsed time is: 5.426759 > >>> The sum = 0.866386 > >>> Elapsed time is: 5.424068 > >>> The sum = 0.866386 > >>> Elapsed time is: 5.426195 > >>> The sum = 0.866386 > >>> Elapsed time is: 5.426059 > >>> The sum = 0.866386 > >>> Elapsed time is: 5.423192 > >>> The sum = 0.866386 > >>> Elapsed time is: 5.426252 > >>> The sum = 0.866386 > >>> Elapsed time is: 5.425444 > >>> The sum = 0.866386 > >>> Elapsed time is: 5.423647 > >>> The sum = 0.866386 > >>> Elapsed time is: 5.426082 > >>> The sum = 0.866386 > >>> Elapsed time is: 5.425936 > >>> The sum = 0.866386 > >>> Elapsed time is: 5.423964 > >>> Total time is: 59.677830 > >>> > >>> [bennet@cavium-hpc ~]$ mpirun --mca plm_base_verbose 10 ./test_mpi > >>> 2>&1 | tee debug2.log > >>> > >>> The zipped debug log should be attached. > >>> > >>> I did that after using systemctl to turn off the firewall on the login > >>> node from which the mpirun is executed, as well as on the host on > >>> which it runs. > >>> > >>> [bennet@cavium-hpc ~]$ mpirun hostname > >>> -------------------------------------------------------------------------- > >>> An ORTE daemon has unexpectedly failed after launch and before > >>> communicating back to mpirun. This could be caused by a number > >>> of factors, including an inability to create a connection back > >>> to mpirun due to a lack of common network interfaces and/or no > >>> route found between them. Please check network connectivity > >>> (including firewalls and network routing requirements). > >>> -------------------------------------------------------------------------- > >>> > >>> [bennet@cavium-hpc ~]$ squeue > >>> JOBID PARTITION NAME USER ST TIME NODES > >>> NODELIST(REASON) > >>> 158 standard bash bennet R 14:30 1 cav01 > >>> [bennet@cavium-hpc ~]$ srun hostname > >>> cav01.arc-ts.umich.edu > >>> [ repeated 23 more times ] > >>> > >>> As always, your help is much appreciated, > >>> > >>> -- bennet > >>> > >>> On Sun, Jun 17, 2018 at 1:06 PM r...@open-mpi.org <r...@open-mpi.org> > >>> wrote: > >>>> > >>>> Add --enable-debug to your OMPI configure cmd line, and then add --mca > >>>> plm_base_verbose 10 to your mpirun cmd line. For some reason, the remote > >>>> daemon isn’t starting - this will give you some info as to why. > >>>> > >>>> > >>>>> On Jun 17, 2018, at 9:07 AM, Bennet Fauber <ben...@umich.edu> wrote: > >>>>> > >>>>> I have a compiled binary that will run with srun but not with mpirun. > >>>>> The attempts to run with mpirun all result in failures to initialize. > >>>>> I have tried this on one node, and on two nodes, with firewall turned > >>>>> on and with it off. > >>>>> > >>>>> Am I missing some command line option for mpirun? > >>>>> > >>>>> OMPI built from this configure command > >>>>> > >>>>> $ ./configure --prefix=/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-b > >>>>> --mandir=/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-b/share/man > >>>>> --with-pmix=/opt/pmix/2.0.2 --with-libevent=external > >>>>> --with-hwloc=external --with-slurm --disable-dlopen CC=gcc CXX=g++ > >>>>> FC=gfortran > >>>>> > >>>>> All tests from `make check` passed, see below. > >>>>> > >>>>> [bennet@cavium-hpc ~]$ mpicc --show > >>>>> gcc -I/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-b/include -pthread > >>>>> -L/opt/pmix/2.0.2/lib -Wl,-rpath -Wl,/opt/pmix/2.0.2/lib -Wl,-rpath > >>>>> -Wl,/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-b/lib > >>>>> -Wl,--enable-new-dtags > >>>>> -L/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-b/lib -lmpi > >>>>> > >>>>> The test_mpi was compiled with > >>>>> > >>>>> $ gcc -o test_mpi test_mpi.c -lm > >>>>> > >>>>> This is the runtime library path > >>>>> > >>>>> [bennet@cavium-hpc ~]$ echo $LD_LIBRARY_PATH > >>>>> /opt/slurm/lib64:/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-b/lib:/opt/arm/gcc-7.1.0_Generic-AArch64_RHEL-7_aarch64-linux/lib64:/opt/arm/gcc-7.1.0_Generic-AArch64_RHEL-7_aarch64-linux/lib:/opt/slurm/lib64:/opt/pmix/2.0.2/lib:/sw/arcts/centos7/hpc-utils/lib > >>>>> > >>>>> > >>>>> These commands are given in exact sequence in which they were entered > >>>>> at a console. > >>>>> > >>>>> [bennet@cavium-hpc ~]$ salloc -N 1 --ntasks-per-node=24 > >>>>> salloc: Pending job allocation 156 > >>>>> salloc: job 156 queued and waiting for resources > >>>>> salloc: job 156 has been allocated resources > >>>>> salloc: Granted job allocation 156 > >>>>> > >>>>> [bennet@cavium-hpc ~]$ mpirun ./test_mpi > >>>>> -------------------------------------------------------------------------- > >>>>> An ORTE daemon has unexpectedly failed after launch and before > >>>>> communicating back to mpirun. This could be caused by a number > >>>>> of factors, including an inability to create a connection back > >>>>> to mpirun due to a lack of common network interfaces and/or no > >>>>> route found between them. Please check network connectivity > >>>>> (including firewalls and network routing requirements). > >>>>> -------------------------------------------------------------------------- > >>>>> > >>>>> [bennet@cavium-hpc ~]$ srun ./test_mpi > >>>>> The sum = 0.866386 > >>>>> Elapsed time is: 5.425439 > >>>>> The sum = 0.866386 > >>>>> Elapsed time is: 5.427427 > >>>>> The sum = 0.866386 > >>>>> Elapsed time is: 5.422579 > >>>>> The sum = 0.866386 > >>>>> Elapsed time is: 5.424168 > >>>>> The sum = 0.866386 > >>>>> Elapsed time is: 5.423951 > >>>>> The sum = 0.866386 > >>>>> Elapsed time is: 5.422414 > >>>>> The sum = 0.866386 > >>>>> Elapsed time is: 5.427156 > >>>>> The sum = 0.866386 > >>>>> Elapsed time is: 5.424834 > >>>>> The sum = 0.866386 > >>>>> Elapsed time is: 5.425103 > >>>>> The sum = 0.866386 > >>>>> Elapsed time is: 5.422415 > >>>>> The sum = 0.866386 > >>>>> Elapsed time is: 5.422948 > >>>>> Total time is: 59.668622 > >>>>> > >>>>> Thanks, -- bennet > >>>>> > >>>>> > >>>>> make check results > >>>>> ---------------------------------------------- > >>>>> > >>>>> make check-TESTS > >>>>> make[3]: Entering directory `/tmp/build/openmpi-3.1.0/ompi/debuggers' > >>>>> make[4]: Entering directory `/tmp/build/openmpi-3.1.0/ompi/debuggers' > >>>>> PASS: predefined_gap_test > >>>>> PASS: predefined_pad_test > >>>>> SKIP: dlopen_test > >>>>> ============================================================================ > >>>>> Testsuite summary for Open MPI 3.1.0 > >>>>> ============================================================================ > >>>>> # TOTAL: 3 > >>>>> # PASS: 2 > >>>>> # SKIP: 1 > >>>>> # XFAIL: 0 > >>>>> # FAIL: 0 > >>>>> # XPASS: 0 > >>>>> # ERROR: 0 > >>>>> ============================================================================ > >>>>> [ elided ] > >>>>> PASS: atomic_cmpset_noinline > >>>>> - 5 threads: Passed > >>>>> PASS: atomic_cmpset_noinline > >>>>> - 8 threads: Passed > >>>>> ============================================================================ > >>>>> Testsuite summary for Open MPI 3.1.0 > >>>>> ============================================================================ > >>>>> # TOTAL: 8 > >>>>> # PASS: 8 > >>>>> # SKIP: 0 > >>>>> # XFAIL: 0 > >>>>> # FAIL: 0 > >>>>> # XPASS: 0 > >>>>> # ERROR: 0 > >>>>> ============================================================================ > >>>>> [ elided ] > >>>>> make[4]: Entering directory `/tmp/build/openmpi-3.1.0/test/class' > >>>>> PASS: ompi_rb_tree > >>>>> PASS: opal_bitmap > >>>>> PASS: opal_hash_table > >>>>> PASS: opal_proc_table > >>>>> PASS: opal_tree > >>>>> PASS: opal_list > >>>>> PASS: opal_value_array > >>>>> PASS: opal_pointer_array > >>>>> PASS: opal_lifo > >>>>> PASS: opal_fifo > >>>>> ============================================================================ > >>>>> Testsuite summary for Open MPI 3.1.0 > >>>>> ============================================================================ > >>>>> # TOTAL: 10 > >>>>> # PASS: 10 > >>>>> # SKIP: 0 > >>>>> # XFAIL: 0 > >>>>> # FAIL: 0 > >>>>> # XPASS: 0 > >>>>> # ERROR: 0 > >>>>> ============================================================================ > >>>>> [ elided ] > >>>>> make opal_thread opal_condition > >>>>> make[3]: Entering directory `/tmp/build/openmpi-3.1.0/test/threads' > >>>>> CC opal_thread.o > >>>>> CCLD opal_thread > >>>>> CC opal_condition.o > >>>>> CCLD opal_condition > >>>>> make[3]: Leaving directory `/tmp/build/openmpi-3.1.0/test/threads' > >>>>> make check-TESTS > >>>>> make[3]: Entering directory `/tmp/build/openmpi-3.1.0/test/threads' > >>>>> make[4]: Entering directory `/tmp/build/openmpi-3.1.0/test/threads' > >>>>> ============================================================================ > >>>>> Testsuite summary for Open MPI 3.1.0 > >>>>> ============================================================================ > >>>>> # TOTAL: 0 > >>>>> # PASS: 0 > >>>>> # SKIP: 0 > >>>>> # XFAIL: 0 > >>>>> # FAIL: 0 > >>>>> # XPASS: 0 > >>>>> # ERROR: 0 > >>>>> ============================================================================ > >>>>> [ elided ] > >>>>> make[4]: Entering directory `/tmp/build/openmpi-3.1.0/test/datatype' > >>>>> PASS: opal_datatype_test > >>>>> PASS: unpack_hetero > >>>>> PASS: checksum > >>>>> PASS: position > >>>>> PASS: position_noncontig > >>>>> PASS: ddt_test > >>>>> PASS: ddt_raw > >>>>> PASS: unpack_ooo > >>>>> PASS: ddt_pack > >>>>> PASS: external32 > >>>>> ============================================================================ > >>>>> Testsuite summary for Open MPI 3.1.0 > >>>>> ============================================================================ > >>>>> # TOTAL: 10 > >>>>> # PASS: 10 > >>>>> # SKIP: 0 > >>>>> # XFAIL: 0 > >>>>> # FAIL: 0 > >>>>> # XPASS: 0 > >>>>> # ERROR: 0 > >>>>> ============================================================================ > >>>>> [ elided ] > >>>>> make[4]: Entering directory `/tmp/build/openmpi-3.1.0/test/util' > >>>>> PASS: opal_bit_ops > >>>>> PASS: opal_path_nfs > >>>>> PASS: bipartite_graph > >>>>> ============================================================================ > >>>>> Testsuite summary for Open MPI 3.1.0 > >>>>> ============================================================================ > >>>>> # TOTAL: 3 > >>>>> # PASS: 3 > >>>>> # SKIP: 0 > >>>>> # XFAIL: 0 > >>>>> # FAIL: 0 > >>>>> # XPASS: 0 > >>>>> # ERROR: 0 > >>>>> ============================================================================ > >>>>> [ elided ] > >>>>> make[4]: Entering directory `/tmp/build/openmpi-3.1.0/test/dss' > >>>>> PASS: dss_buffer > >>>>> PASS: dss_cmp > >>>>> PASS: dss_payload > >>>>> PASS: dss_print > >>>>> ============================================================================ > >>>>> Testsuite summary for Open MPI 3.1.0 > >>>>> ============================================================================ > >>>>> # TOTAL: 4 > >>>>> # PASS: 4 > >>>>> # SKIP: 0 > >>>>> # XFAIL: 0 > >>>>> # FAIL: 0 > >>>>> # XPASS: 0 > >>>>> # ERROR: 0 > >>>>> ============================================================================ > >>>>> _______________________________________________ > >>>>> users mailing list > >>>>> users@lists.open-mpi.org > >>>>> https://lists.open-mpi.org/mailman/listinfo/users > >>>> > >>>> _______________________________________________ > >>>> users mailing list > >>>> users@lists.open-mpi.org > >>>> https://lists.open-mpi.org/mailman/listinfo/users > >>> <debug2.log.gz>_______________________________________________ > >>> users mailing list > >>> users@lists.open-mpi.org > >>> https://lists.open-mpi.org/mailman/listinfo/users > >> > >> _______________________________________________ > >> users mailing list > >> users@lists.open-mpi.org > >> https://lists.open-mpi.org/mailman/listinfo/users > > <debug3.log.gz>_______________________________________________ > > users mailing list > > users@lists.open-mpi.org > > https://lists.open-mpi.org/mailman/listinfo/users > > _______________________________________________ > users mailing list > users@lists.open-mpi.org > https://lists.open-mpi.org/mailman/listinfo/users _______________________________________________ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users