If it's of any use, 3.0.0 seems to hang at Making check in class make[2]: Entering directory `/tmp/build/openmpi-3.0.0/test/class' make ompi_rb_tree opal_bitmap opal_hash_table opal_proc_table opal_tree opal_list opal_value_array opal_pointer_array opal_lifo opal_fifo make[3]: Entering directory `/tmp/build/openmpi-3.0.0/test/class' make[3]: `ompi_rb_tree' is up to date. make[3]: `opal_bitmap' is up to date. make[3]: `opal_hash_table' is up to date. make[3]: `opal_proc_table' is up to date. make[3]: `opal_tree' is up to date. make[3]: `opal_list' is up to date. make[3]: `opal_value_array' is up to date. make[3]: `opal_pointer_array' is up to date. make[3]: `opal_lifo' is up to date. make[3]: `opal_fifo' is up to date. make[3]: Leaving directory `/tmp/build/openmpi-3.0.0/test/class' make check-TESTS make[3]: Entering directory `/tmp/build/openmpi-3.0.0/test/class' make[4]: Entering directory `/tmp/build/openmpi-3.0.0/test/class'
I have to interrupt it, but it's been many minutes, and usually these have not been behaving this way. -- bennet On Mon, Jun 18, 2018 at 4:21 PM Bennet Fauber <ben...@umich.edu> wrote: > > No such luck. If it matters, mpirun does seem to work with processes > on the local node that have no internal MPI code. That is, > > [bennet@cavium-hpc ~]$ mpirun -np 4 hello > Hello, ARM > Hello, ARM > Hello, ARM > Hello, ARM > > but it fails with a similar error if run while a SLURM job is active; i.e., > > [bennet@cavium-hpc ~]$ mpirun hello > -------------------------------------------------------------------------- > ORTE has lost communication with a remote daemon. > > HNP daemon : [[26589,0],0] on node cavium-hpc > Remote daemon: [[26589,0],1] on node cav01 > > This is usually due to either a failure of the TCP network > connection to the node, or possibly an internal failure of > the daemon itself. We cannot recover from this failure, and > therefore will terminate the job. > -------------------------------------------------------------------------- > > That makes sense, I guess. > > I'll keep you posted as to what happens with 3.0.0 and downgrading SLURM. > > > Thanks, -- bennet > > > On Mon, Jun 18, 2018 at 4:05 PM r...@open-mpi.org <r...@open-mpi.org> wrote: > > > > I doubt Slurm is the issue. For grins, lets try adding “--mca plm rsh” to > > your mpirun cmd line and see if that works. > > > > > > > On Jun 18, 2018, at 12:57 PM, Bennet Fauber <ben...@umich.edu> wrote: > > > > > > To eliminate possibilities, I removed all other versions of OpenMPI > > > from the system, and rebuilt using the same build script as was used > > > to generate the prior report. > > > > > > [bennet@cavium-hpc bennet]$ ./ompi-3.1.0bd.sh > > > Checking compilers and things > > > OMPI is ompi > > > COMP_NAME is gcc_7_1_0 > > > SRC_ROOT is /sw/arcts/centos7/src > > > PREFIX_ROOT is /sw/arcts/centos7 > > > PREFIX is /sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-bd > > > CONFIGURE_FLAGS are > > > COMPILERS are CC=gcc CXX=g++ FC=gfortran > > > > > > Currently Loaded Modules: > > > 1) gcc/7.1.0 > > > > > > gcc (ARM-build-14) 7.1.0 > > > Copyright (C) 2017 Free Software Foundation, Inc. > > > This is free software; see the source for copying conditions. There is NO > > > warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR > > > PURPOSE. > > > > > > Using the following configure command > > > > > > ./configure --prefix=/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-bd > > > --mandir=/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-bd/share/man > > > --with-pmix=/opt/pmix/2.0.2 --with-libevent=external > > > --with-hwloc=external --with-slurm --disable-dlopen > > > --enable-debug CC=gcc CXX=g++ FC=gfortran > > > > > > The tar ball is > > > > > > 2e783873f6b206aa71f745762fa15da5 > > > /sw/arcts/centos7/src/ompi/openmpi-3.1.0.tar.gz > > > > > > I still get > > > > > > [bennet@cavium-hpc ~]$ salloc -N 1 --ntasks-per-node=24 > > > salloc: Pending job allocation 165 > > > salloc: job 165 queued and waiting for resources > > > salloc: job 165 has been allocated resources > > > salloc: Granted job allocation 165 > > > [bennet@cavium-hpc ~]$ srun ./test_mpi > > > The sum = 0.866386 > > > Elapsed time is: 5.425549 > > > The sum = 0.866386 > > > Elapsed time is: 5.422826 > > > The sum = 0.866386 > > > Elapsed time is: 5.427676 > > > The sum = 0.866386 > > > Elapsed time is: 5.424928 > > > The sum = 0.866386 > > > Elapsed time is: 5.422060 > > > The sum = 0.866386 > > > Elapsed time is: 5.425431 > > > The sum = 0.866386 > > > Elapsed time is: 5.424350 > > > The sum = 0.866386 > > > Elapsed time is: 5.423037 > > > The sum = 0.866386 > > > Elapsed time is: 5.427727 > > > The sum = 0.866386 > > > Elapsed time is: 5.424922 > > > The sum = 0.866386 > > > Elapsed time is: 5.424279 > > > Total time is: 59.672992 > > > > > > [bennet@cavium-hpc ~]$ mpirun ./test_mpi > > > -------------------------------------------------------------------------- > > > An ORTE daemon has unexpectedly failed after launch and before > > > communicating back to mpirun. This could be caused by a number > > > of factors, including an inability to create a connection back > > > to mpirun due to a lack of common network interfaces and/or no > > > route found between them. Please check network connectivity > > > (including firewalls and network routing requirements). > > > -------------------------------------------------------------------------- > > > > > > I reran with > > > > > > [bennet@cavium-hpc ~]$ mpirun --mca plm_base_verbose 10 ./test_mpi > > > 2>&1 | tee debug3.log > > > > > > and the gzipped log is attached. > > > > > > I thought to try it with a different test program, which spits the error > > > [cavium-hpc.arc-ts.umich.edu:42853] [[58987,1],0] ORTE_ERROR_LOG: Not > > > found in file base/ess_base_std_app.c at line 219 > > > [cavium-hpc.arc-ts.umich.edu:42854] [[58987,1],1] ORTE_ERROR_LOG: Not > > > found in file base/ess_base_std_app.c at line 219 > > > -------------------------------------------------------------------------- > > > It looks like orte_init failed for some reason; your parallel process is > > > likely to abort. There are many reasons that a parallel process can > > > fail during orte_init; some of which are due to configuration or > > > environment problems. This failure appears to be an internal failure; > > > here's some additional information (which may only be relevant to an > > > Open MPI developer): > > > > > > store DAEMON URI failed > > > --> Returned value Not found (-13) instead of ORTE_SUCCESS > > > > > > > > > At one point, I am almost certain that OMPI mpirun did work, and I am > > > at a loss to explain why it no longer does. > > > > > > I have also tried the 3.1.1rc1 version. I am now going to try 3.0.0, > > > and we'll try downgrading SLURM to a prior version. > > > > > > -- bennet > > > > > > > > > -- bennetOn Mon, Jun 18, 2018 at 10:56 AM r...@open-mpi.org > > > <r...@open-mpi.org> wrote: > > >> > > >> Hmmm...well, the error has changed from your initial report. Turning off > > >> the firewall was the solution to that problem. > > >> > > >> This problem is different - it isn’t the orted that failed in the log > > >> you sent, but the application proc that couldn’t initialize. It looks > > >> like that app was compiled against some earlier version of OMPI? It is > > >> looking for something that no longer exists. I saw that you compiled it > > >> with a simple “gcc” instead of our wrapper compiler “mpicc” - any > > >> particular reason? My guess is that your compile picked up some older > > >> version of OMPI on the system. > > >> > > >> Ralph > > >> > > >> > > >>> On Jun 17, 2018, at 2:51 PM, Bennet Fauber <ben...@umich.edu> wrote: > > >>> > > >>> I rebuilt with --enable-debug, then ran with > > >>> > > >>> [bennet@cavium-hpc ~]$ salloc -N 1 --ntasks-per-node=24 > > >>> salloc: Pending job allocation 158 > > >>> salloc: job 158 queued and waiting for resources > > >>> salloc: job 158 has been allocated resources > > >>> salloc: Granted job allocation 158 > > >>> > > >>> [bennet@cavium-hpc ~]$ srun ./test_mpi > > >>> The sum = 0.866386 > > >>> Elapsed time is: 5.426759 > > >>> The sum = 0.866386 > > >>> Elapsed time is: 5.424068 > > >>> The sum = 0.866386 > > >>> Elapsed time is: 5.426195 > > >>> The sum = 0.866386 > > >>> Elapsed time is: 5.426059 > > >>> The sum = 0.866386 > > >>> Elapsed time is: 5.423192 > > >>> The sum = 0.866386 > > >>> Elapsed time is: 5.426252 > > >>> The sum = 0.866386 > > >>> Elapsed time is: 5.425444 > > >>> The sum = 0.866386 > > >>> Elapsed time is: 5.423647 > > >>> The sum = 0.866386 > > >>> Elapsed time is: 5.426082 > > >>> The sum = 0.866386 > > >>> Elapsed time is: 5.425936 > > >>> The sum = 0.866386 > > >>> Elapsed time is: 5.423964 > > >>> Total time is: 59.677830 > > >>> > > >>> [bennet@cavium-hpc ~]$ mpirun --mca plm_base_verbose 10 ./test_mpi > > >>> 2>&1 | tee debug2.log > > >>> > > >>> The zipped debug log should be attached. > > >>> > > >>> I did that after using systemctl to turn off the firewall on the login > > >>> node from which the mpirun is executed, as well as on the host on > > >>> which it runs. > > >>> > > >>> [bennet@cavium-hpc ~]$ mpirun hostname > > >>> -------------------------------------------------------------------------- > > >>> An ORTE daemon has unexpectedly failed after launch and before > > >>> communicating back to mpirun. This could be caused by a number > > >>> of factors, including an inability to create a connection back > > >>> to mpirun due to a lack of common network interfaces and/or no > > >>> route found between them. Please check network connectivity > > >>> (including firewalls and network routing requirements). > > >>> -------------------------------------------------------------------------- > > >>> > > >>> [bennet@cavium-hpc ~]$ squeue > > >>> JOBID PARTITION NAME USER ST TIME NODES > > >>> NODELIST(REASON) > > >>> 158 standard bash bennet R 14:30 1 cav01 > > >>> [bennet@cavium-hpc ~]$ srun hostname > > >>> cav01.arc-ts.umich.edu > > >>> [ repeated 23 more times ] > > >>> > > >>> As always, your help is much appreciated, > > >>> > > >>> -- bennet > > >>> > > >>> On Sun, Jun 17, 2018 at 1:06 PM r...@open-mpi.org <r...@open-mpi.org> > > >>> wrote: > > >>>> > > >>>> Add --enable-debug to your OMPI configure cmd line, and then add --mca > > >>>> plm_base_verbose 10 to your mpirun cmd line. For some reason, the > > >>>> remote daemon isn’t starting - this will give you some info as to why. > > >>>> > > >>>> > > >>>>> On Jun 17, 2018, at 9:07 AM, Bennet Fauber <ben...@umich.edu> wrote: > > >>>>> > > >>>>> I have a compiled binary that will run with srun but not with mpirun. > > >>>>> The attempts to run with mpirun all result in failures to initialize. > > >>>>> I have tried this on one node, and on two nodes, with firewall turned > > >>>>> on and with it off. > > >>>>> > > >>>>> Am I missing some command line option for mpirun? > > >>>>> > > >>>>> OMPI built from this configure command > > >>>>> > > >>>>> $ ./configure --prefix=/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-b > > >>>>> --mandir=/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-b/share/man > > >>>>> --with-pmix=/opt/pmix/2.0.2 --with-libevent=external > > >>>>> --with-hwloc=external --with-slurm --disable-dlopen CC=gcc CXX=g++ > > >>>>> FC=gfortran > > >>>>> > > >>>>> All tests from `make check` passed, see below. > > >>>>> > > >>>>> [bennet@cavium-hpc ~]$ mpicc --show > > >>>>> gcc -I/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-b/include -pthread > > >>>>> -L/opt/pmix/2.0.2/lib -Wl,-rpath -Wl,/opt/pmix/2.0.2/lib -Wl,-rpath > > >>>>> -Wl,/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-b/lib > > >>>>> -Wl,--enable-new-dtags > > >>>>> -L/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-b/lib -lmpi > > >>>>> > > >>>>> The test_mpi was compiled with > > >>>>> > > >>>>> $ gcc -o test_mpi test_mpi.c -lm > > >>>>> > > >>>>> This is the runtime library path > > >>>>> > > >>>>> [bennet@cavium-hpc ~]$ echo $LD_LIBRARY_PATH > > >>>>> /opt/slurm/lib64:/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-b/lib:/opt/arm/gcc-7.1.0_Generic-AArch64_RHEL-7_aarch64-linux/lib64:/opt/arm/gcc-7.1.0_Generic-AArch64_RHEL-7_aarch64-linux/lib:/opt/slurm/lib64:/opt/pmix/2.0.2/lib:/sw/arcts/centos7/hpc-utils/lib > > >>>>> > > >>>>> > > >>>>> These commands are given in exact sequence in which they were entered > > >>>>> at a console. > > >>>>> > > >>>>> [bennet@cavium-hpc ~]$ salloc -N 1 --ntasks-per-node=24 > > >>>>> salloc: Pending job allocation 156 > > >>>>> salloc: job 156 queued and waiting for resources > > >>>>> salloc: job 156 has been allocated resources > > >>>>> salloc: Granted job allocation 156 > > >>>>> > > >>>>> [bennet@cavium-hpc ~]$ mpirun ./test_mpi > > >>>>> -------------------------------------------------------------------------- > > >>>>> An ORTE daemon has unexpectedly failed after launch and before > > >>>>> communicating back to mpirun. This could be caused by a number > > >>>>> of factors, including an inability to create a connection back > > >>>>> to mpirun due to a lack of common network interfaces and/or no > > >>>>> route found between them. Please check network connectivity > > >>>>> (including firewalls and network routing requirements). > > >>>>> -------------------------------------------------------------------------- > > >>>>> > > >>>>> [bennet@cavium-hpc ~]$ srun ./test_mpi > > >>>>> The sum = 0.866386 > > >>>>> Elapsed time is: 5.425439 > > >>>>> The sum = 0.866386 > > >>>>> Elapsed time is: 5.427427 > > >>>>> The sum = 0.866386 > > >>>>> Elapsed time is: 5.422579 > > >>>>> The sum = 0.866386 > > >>>>> Elapsed time is: 5.424168 > > >>>>> The sum = 0.866386 > > >>>>> Elapsed time is: 5.423951 > > >>>>> The sum = 0.866386 > > >>>>> Elapsed time is: 5.422414 > > >>>>> The sum = 0.866386 > > >>>>> Elapsed time is: 5.427156 > > >>>>> The sum = 0.866386 > > >>>>> Elapsed time is: 5.424834 > > >>>>> The sum = 0.866386 > > >>>>> Elapsed time is: 5.425103 > > >>>>> The sum = 0.866386 > > >>>>> Elapsed time is: 5.422415 > > >>>>> The sum = 0.866386 > > >>>>> Elapsed time is: 5.422948 > > >>>>> Total time is: 59.668622 > > >>>>> > > >>>>> Thanks, -- bennet > > >>>>> > > >>>>> > > >>>>> make check results > > >>>>> ---------------------------------------------- > > >>>>> > > >>>>> make check-TESTS > > >>>>> make[3]: Entering directory `/tmp/build/openmpi-3.1.0/ompi/debuggers' > > >>>>> make[4]: Entering directory `/tmp/build/openmpi-3.1.0/ompi/debuggers' > > >>>>> PASS: predefined_gap_test > > >>>>> PASS: predefined_pad_test > > >>>>> SKIP: dlopen_test > > >>>>> ============================================================================ > > >>>>> Testsuite summary for Open MPI 3.1.0 > > >>>>> ============================================================================ > > >>>>> # TOTAL: 3 > > >>>>> # PASS: 2 > > >>>>> # SKIP: 1 > > >>>>> # XFAIL: 0 > > >>>>> # FAIL: 0 > > >>>>> # XPASS: 0 > > >>>>> # ERROR: 0 > > >>>>> ============================================================================ > > >>>>> [ elided ] > > >>>>> PASS: atomic_cmpset_noinline > > >>>>> - 5 threads: Passed > > >>>>> PASS: atomic_cmpset_noinline > > >>>>> - 8 threads: Passed > > >>>>> ============================================================================ > > >>>>> Testsuite summary for Open MPI 3.1.0 > > >>>>> ============================================================================ > > >>>>> # TOTAL: 8 > > >>>>> # PASS: 8 > > >>>>> # SKIP: 0 > > >>>>> # XFAIL: 0 > > >>>>> # FAIL: 0 > > >>>>> # XPASS: 0 > > >>>>> # ERROR: 0 > > >>>>> ============================================================================ > > >>>>> [ elided ] > > >>>>> make[4]: Entering directory `/tmp/build/openmpi-3.1.0/test/class' > > >>>>> PASS: ompi_rb_tree > > >>>>> PASS: opal_bitmap > > >>>>> PASS: opal_hash_table > > >>>>> PASS: opal_proc_table > > >>>>> PASS: opal_tree > > >>>>> PASS: opal_list > > >>>>> PASS: opal_value_array > > >>>>> PASS: opal_pointer_array > > >>>>> PASS: opal_lifo > > >>>>> PASS: opal_fifo > > >>>>> ============================================================================ > > >>>>> Testsuite summary for Open MPI 3.1.0 > > >>>>> ============================================================================ > > >>>>> # TOTAL: 10 > > >>>>> # PASS: 10 > > >>>>> # SKIP: 0 > > >>>>> # XFAIL: 0 > > >>>>> # FAIL: 0 > > >>>>> # XPASS: 0 > > >>>>> # ERROR: 0 > > >>>>> ============================================================================ > > >>>>> [ elided ] > > >>>>> make opal_thread opal_condition > > >>>>> make[3]: Entering directory `/tmp/build/openmpi-3.1.0/test/threads' > > >>>>> CC opal_thread.o > > >>>>> CCLD opal_thread > > >>>>> CC opal_condition.o > > >>>>> CCLD opal_condition > > >>>>> make[3]: Leaving directory `/tmp/build/openmpi-3.1.0/test/threads' > > >>>>> make check-TESTS > > >>>>> make[3]: Entering directory `/tmp/build/openmpi-3.1.0/test/threads' > > >>>>> make[4]: Entering directory `/tmp/build/openmpi-3.1.0/test/threads' > > >>>>> ============================================================================ > > >>>>> Testsuite summary for Open MPI 3.1.0 > > >>>>> ============================================================================ > > >>>>> # TOTAL: 0 > > >>>>> # PASS: 0 > > >>>>> # SKIP: 0 > > >>>>> # XFAIL: 0 > > >>>>> # FAIL: 0 > > >>>>> # XPASS: 0 > > >>>>> # ERROR: 0 > > >>>>> ============================================================================ > > >>>>> [ elided ] > > >>>>> make[4]: Entering directory `/tmp/build/openmpi-3.1.0/test/datatype' > > >>>>> PASS: opal_datatype_test > > >>>>> PASS: unpack_hetero > > >>>>> PASS: checksum > > >>>>> PASS: position > > >>>>> PASS: position_noncontig > > >>>>> PASS: ddt_test > > >>>>> PASS: ddt_raw > > >>>>> PASS: unpack_ooo > > >>>>> PASS: ddt_pack > > >>>>> PASS: external32 > > >>>>> ============================================================================ > > >>>>> Testsuite summary for Open MPI 3.1.0 > > >>>>> ============================================================================ > > >>>>> # TOTAL: 10 > > >>>>> # PASS: 10 > > >>>>> # SKIP: 0 > > >>>>> # XFAIL: 0 > > >>>>> # FAIL: 0 > > >>>>> # XPASS: 0 > > >>>>> # ERROR: 0 > > >>>>> ============================================================================ > > >>>>> [ elided ] > > >>>>> make[4]: Entering directory `/tmp/build/openmpi-3.1.0/test/util' > > >>>>> PASS: opal_bit_ops > > >>>>> PASS: opal_path_nfs > > >>>>> PASS: bipartite_graph > > >>>>> ============================================================================ > > >>>>> Testsuite summary for Open MPI 3.1.0 > > >>>>> ============================================================================ > > >>>>> # TOTAL: 3 > > >>>>> # PASS: 3 > > >>>>> # SKIP: 0 > > >>>>> # XFAIL: 0 > > >>>>> # FAIL: 0 > > >>>>> # XPASS: 0 > > >>>>> # ERROR: 0 > > >>>>> ============================================================================ > > >>>>> [ elided ] > > >>>>> make[4]: Entering directory `/tmp/build/openmpi-3.1.0/test/dss' > > >>>>> PASS: dss_buffer > > >>>>> PASS: dss_cmp > > >>>>> PASS: dss_payload > > >>>>> PASS: dss_print > > >>>>> ============================================================================ > > >>>>> Testsuite summary for Open MPI 3.1.0 > > >>>>> ============================================================================ > > >>>>> # TOTAL: 4 > > >>>>> # PASS: 4 > > >>>>> # SKIP: 0 > > >>>>> # XFAIL: 0 > > >>>>> # FAIL: 0 > > >>>>> # XPASS: 0 > > >>>>> # ERROR: 0 > > >>>>> ============================================================================ > > >>>>> _______________________________________________ > > >>>>> users mailing list > > >>>>> users@lists.open-mpi.org > > >>>>> https://lists.open-mpi.org/mailman/listinfo/users > > >>>> > > >>>> _______________________________________________ > > >>>> users mailing list > > >>>> users@lists.open-mpi.org > > >>>> https://lists.open-mpi.org/mailman/listinfo/users > > >>> <debug2.log.gz>_______________________________________________ > > >>> users mailing list > > >>> users@lists.open-mpi.org > > >>> https://lists.open-mpi.org/mailman/listinfo/users > > >> > > >> _______________________________________________ > > >> users mailing list > > >> users@lists.open-mpi.org > > >> https://lists.open-mpi.org/mailman/listinfo/users > > > <debug3.log.gz>_______________________________________________ > > > users mailing list > > > users@lists.open-mpi.org > > > https://lists.open-mpi.org/mailman/listinfo/users > > > > _______________________________________________ > > users mailing list > > users@lists.open-mpi.org > > https://lists.open-mpi.org/mailman/listinfo/users _______________________________________________ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users