No such luck.  If it matters, mpirun does seem to work with processes
on the local node that have no internal MPI code.  That is,

[bennet@cavium-hpc ~]$ mpirun -np 4 hello
Hello, ARM
Hello, ARM
Hello, ARM
Hello, ARM

but it fails with a similar error if run while a SLURM job is active; i.e.,

[bennet@cavium-hpc ~]$ mpirun hello
--------------------------------------------------------------------------
ORTE has lost communication with a remote daemon.

  HNP daemon   : [[26589,0],0] on node cavium-hpc
  Remote daemon: [[26589,0],1] on node cav01

This is usually due to either a failure of the TCP network
connection to the node, or possibly an internal failure of
the daemon itself. We cannot recover from this failure, and
therefore will terminate the job.
--------------------------------------------------------------------------

That makes sense, I guess.

I'll keep you posted as to what happens with 3.0.0 and downgrading SLURM.


Thanks,   -- bennet


On Mon, Jun 18, 2018 at 4:05 PM r...@open-mpi.org <r...@open-mpi.org> wrote:
>
> I doubt Slurm is the issue. For grins, lets try adding “--mca plm rsh” to 
> your mpirun cmd line and see if that works.
>
>
> > On Jun 18, 2018, at 12:57 PM, Bennet Fauber <ben...@umich.edu> wrote:
> >
> > To eliminate possibilities, I removed all other versions of OpenMPI
> > from the system, and rebuilt using the same build script as was used
> > to generate the prior report.
> >
> > [bennet@cavium-hpc bennet]$ ./ompi-3.1.0bd.sh
> > Checking compilers and things
> > OMPI is ompi
> > COMP_NAME is gcc_7_1_0
> > SRC_ROOT is /sw/arcts/centos7/src
> > PREFIX_ROOT is /sw/arcts/centos7
> > PREFIX is /sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-bd
> > CONFIGURE_FLAGS are
> > COMPILERS are CC=gcc CXX=g++ FC=gfortran
> >
> > Currently Loaded Modules:
> >  1) gcc/7.1.0
> >
> > gcc (ARM-build-14) 7.1.0
> > Copyright (C) 2017 Free Software Foundation, Inc.
> > This is free software; see the source for copying conditions.  There is NO
> > warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
> >
> > Using the following configure command
> >
> > ./configure     --prefix=/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-bd
> >   --mandir=/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-bd/share/man
> > --with-pmix=/opt/pmix/2.0.2     --with-libevent=external
> > --with-hwloc=external     --with-slurm     --disable-dlopen
> > --enable-debug          CC=gcc CXX=g++ FC=gfortran
> >
> > The tar ball is
> >
> > 2e783873f6b206aa71f745762fa15da5
> > /sw/arcts/centos7/src/ompi/openmpi-3.1.0.tar.gz
> >
> > I still get
> >
> > [bennet@cavium-hpc ~]$ salloc -N 1 --ntasks-per-node=24
> > salloc: Pending job allocation 165
> > salloc: job 165 queued and waiting for resources
> > salloc: job 165 has been allocated resources
> > salloc: Granted job allocation 165
> > [bennet@cavium-hpc ~]$ srun ./test_mpi
> > The sum = 0.866386
> > Elapsed time is:  5.425549
> > The sum = 0.866386
> > Elapsed time is:  5.422826
> > The sum = 0.866386
> > Elapsed time is:  5.427676
> > The sum = 0.866386
> > Elapsed time is:  5.424928
> > The sum = 0.866386
> > Elapsed time is:  5.422060
> > The sum = 0.866386
> > Elapsed time is:  5.425431
> > The sum = 0.866386
> > Elapsed time is:  5.424350
> > The sum = 0.866386
> > Elapsed time is:  5.423037
> > The sum = 0.866386
> > Elapsed time is:  5.427727
> > The sum = 0.866386
> > Elapsed time is:  5.424922
> > The sum = 0.866386
> > Elapsed time is:  5.424279
> > Total time is:  59.672992
> >
> > [bennet@cavium-hpc ~]$ mpirun ./test_mpi
> > --------------------------------------------------------------------------
> > An ORTE daemon has unexpectedly failed after launch and before
> > communicating back to mpirun. This could be caused by a number
> > of factors, including an inability to create a connection back
> > to mpirun due to a lack of common network interfaces and/or no
> > route found between them. Please check network connectivity
> > (including firewalls and network routing requirements).
> > --------------------------------------------------------------------------
> >
> > I reran with
> >
> > [bennet@cavium-hpc ~]$ mpirun --mca plm_base_verbose 10 ./test_mpi
> > 2>&1 | tee debug3.log
> >
> > and the gzipped log is attached.
> >
> > I thought to try it with a different test program, which spits the error
> > [cavium-hpc.arc-ts.umich.edu:42853] [[58987,1],0] ORTE_ERROR_LOG: Not
> > found in file base/ess_base_std_app.c at line 219
> > [cavium-hpc.arc-ts.umich.edu:42854] [[58987,1],1] ORTE_ERROR_LOG: Not
> > found in file base/ess_base_std_app.c at line 219
> > --------------------------------------------------------------------------
> > It looks like orte_init failed for some reason; your parallel process is
> > likely to abort.  There are many reasons that a parallel process can
> > fail during orte_init; some of which are due to configuration or
> > environment problems.  This failure appears to be an internal failure;
> > here's some additional information (which may only be relevant to an
> > Open MPI developer):
> >
> >  store DAEMON URI failed
> >  --> Returned value Not found (-13) instead of ORTE_SUCCESS
> >
> >
> > At one point, I am almost certain that OMPI mpirun did work, and I am
> > at a loss to explain why it no longer does.
> >
> > I have also tried the 3.1.1rc1 version.  I am now going to try 3.0.0,
> > and we'll try downgrading SLURM to a prior version.
> >
> > -- bennet
> >
> >
> > -- bennetOn Mon, Jun 18, 2018 at 10:56 AM r...@open-mpi.org
> > <r...@open-mpi.org> wrote:
> >>
> >> Hmmm...well, the error has changed from your initial report. Turning off 
> >> the firewall was the solution to that problem.
> >>
> >> This problem is different - it isn’t the orted that failed in the log you 
> >> sent, but the application proc that couldn’t initialize. It looks like 
> >> that app was compiled against some earlier version of OMPI? It is looking 
> >> for something that no longer exists. I saw that you compiled it with a 
> >> simple “gcc” instead of our wrapper compiler “mpicc” - any particular 
> >> reason? My guess is that your compile picked up some older version of OMPI 
> >> on the system.
> >>
> >> Ralph
> >>
> >>
> >>> On Jun 17, 2018, at 2:51 PM, Bennet Fauber <ben...@umich.edu> wrote:
> >>>
> >>> I rebuilt with --enable-debug, then ran with
> >>>
> >>> [bennet@cavium-hpc ~]$ salloc -N 1 --ntasks-per-node=24
> >>> salloc: Pending job allocation 158
> >>> salloc: job 158 queued and waiting for resources
> >>> salloc: job 158 has been allocated resources
> >>> salloc: Granted job allocation 158
> >>>
> >>> [bennet@cavium-hpc ~]$ srun ./test_mpi
> >>> The sum = 0.866386
> >>> Elapsed time is:  5.426759
> >>> The sum = 0.866386
> >>> Elapsed time is:  5.424068
> >>> The sum = 0.866386
> >>> Elapsed time is:  5.426195
> >>> The sum = 0.866386
> >>> Elapsed time is:  5.426059
> >>> The sum = 0.866386
> >>> Elapsed time is:  5.423192
> >>> The sum = 0.866386
> >>> Elapsed time is:  5.426252
> >>> The sum = 0.866386
> >>> Elapsed time is:  5.425444
> >>> The sum = 0.866386
> >>> Elapsed time is:  5.423647
> >>> The sum = 0.866386
> >>> Elapsed time is:  5.426082
> >>> The sum = 0.866386
> >>> Elapsed time is:  5.425936
> >>> The sum = 0.866386
> >>> Elapsed time is:  5.423964
> >>> Total time is:  59.677830
> >>>
> >>> [bennet@cavium-hpc ~]$ mpirun --mca plm_base_verbose 10 ./test_mpi
> >>> 2>&1 | tee debug2.log
> >>>
> >>> The zipped debug log should be attached.
> >>>
> >>> I did that after using systemctl to turn off the firewall on the login
> >>> node from which the mpirun is executed, as well as on the host on
> >>> which it runs.
> >>>
> >>> [bennet@cavium-hpc ~]$ mpirun hostname
> >>> --------------------------------------------------------------------------
> >>> An ORTE daemon has unexpectedly failed after launch and before
> >>> communicating back to mpirun. This could be caused by a number
> >>> of factors, including an inability to create a connection back
> >>> to mpirun due to a lack of common network interfaces and/or no
> >>> route found between them. Please check network connectivity
> >>> (including firewalls and network routing requirements).
> >>> --------------------------------------------------------------------------
> >>>
> >>> [bennet@cavium-hpc ~]$ squeue
> >>>            JOBID PARTITION     NAME     USER ST       TIME  NODES
> >>> NODELIST(REASON)
> >>>              158  standard     bash   bennet  R      14:30      1 cav01
> >>> [bennet@cavium-hpc ~]$ srun hostname
> >>> cav01.arc-ts.umich.edu
> >>> [ repeated 23 more times ]
> >>>
> >>> As always, your help is much appreciated,
> >>>
> >>> -- bennet
> >>>
> >>> On Sun, Jun 17, 2018 at 1:06 PM r...@open-mpi.org <r...@open-mpi.org> 
> >>> wrote:
> >>>>
> >>>> Add --enable-debug to your OMPI configure cmd line, and then add --mca 
> >>>> plm_base_verbose 10 to your mpirun cmd line. For some reason, the remote 
> >>>> daemon isn’t starting - this will give you some info as to why.
> >>>>
> >>>>
> >>>>> On Jun 17, 2018, at 9:07 AM, Bennet Fauber <ben...@umich.edu> wrote:
> >>>>>
> >>>>> I have a compiled binary that will run with srun but not with mpirun.
> >>>>> The attempts to run with mpirun all result in failures to initialize.
> >>>>> I have tried this on one node, and on two nodes, with firewall turned
> >>>>> on and with it off.
> >>>>>
> >>>>> Am I missing some command line option for mpirun?
> >>>>>
> >>>>> OMPI built from this configure command
> >>>>>
> >>>>> $ ./configure --prefix=/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-b
> >>>>> --mandir=/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-b/share/man
> >>>>> --with-pmix=/opt/pmix/2.0.2 --with-libevent=external
> >>>>> --with-hwloc=external --with-slurm --disable-dlopen CC=gcc CXX=g++
> >>>>> FC=gfortran
> >>>>>
> >>>>> All tests from `make check` passed, see below.
> >>>>>
> >>>>> [bennet@cavium-hpc ~]$ mpicc --show
> >>>>> gcc -I/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-b/include -pthread
> >>>>> -L/opt/pmix/2.0.2/lib -Wl,-rpath -Wl,/opt/pmix/2.0.2/lib -Wl,-rpath
> >>>>> -Wl,/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-b/lib
> >>>>> -Wl,--enable-new-dtags
> >>>>> -L/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-b/lib -lmpi
> >>>>>
> >>>>> The test_mpi was compiled with
> >>>>>
> >>>>> $ gcc -o test_mpi test_mpi.c -lm
> >>>>>
> >>>>> This is the runtime library path
> >>>>>
> >>>>> [bennet@cavium-hpc ~]$ echo $LD_LIBRARY_PATH
> >>>>> /opt/slurm/lib64:/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-b/lib:/opt/arm/gcc-7.1.0_Generic-AArch64_RHEL-7_aarch64-linux/lib64:/opt/arm/gcc-7.1.0_Generic-AArch64_RHEL-7_aarch64-linux/lib:/opt/slurm/lib64:/opt/pmix/2.0.2/lib:/sw/arcts/centos7/hpc-utils/lib
> >>>>>
> >>>>>
> >>>>> These commands are given in exact sequence in which they were entered
> >>>>> at a console.
> >>>>>
> >>>>> [bennet@cavium-hpc ~]$ salloc -N 1 --ntasks-per-node=24
> >>>>> salloc: Pending job allocation 156
> >>>>> salloc: job 156 queued and waiting for resources
> >>>>> salloc: job 156 has been allocated resources
> >>>>> salloc: Granted job allocation 156
> >>>>>
> >>>>> [bennet@cavium-hpc ~]$ mpirun ./test_mpi
> >>>>> --------------------------------------------------------------------------
> >>>>> An ORTE daemon has unexpectedly failed after launch and before
> >>>>> communicating back to mpirun. This could be caused by a number
> >>>>> of factors, including an inability to create a connection back
> >>>>> to mpirun due to a lack of common network interfaces and/or no
> >>>>> route found between them. Please check network connectivity
> >>>>> (including firewalls and network routing requirements).
> >>>>> --------------------------------------------------------------------------
> >>>>>
> >>>>> [bennet@cavium-hpc ~]$ srun ./test_mpi
> >>>>> The sum = 0.866386
> >>>>> Elapsed time is:  5.425439
> >>>>> The sum = 0.866386
> >>>>> Elapsed time is:  5.427427
> >>>>> The sum = 0.866386
> >>>>> Elapsed time is:  5.422579
> >>>>> The sum = 0.866386
> >>>>> Elapsed time is:  5.424168
> >>>>> The sum = 0.866386
> >>>>> Elapsed time is:  5.423951
> >>>>> The sum = 0.866386
> >>>>> Elapsed time is:  5.422414
> >>>>> The sum = 0.866386
> >>>>> Elapsed time is:  5.427156
> >>>>> The sum = 0.866386
> >>>>> Elapsed time is:  5.424834
> >>>>> The sum = 0.866386
> >>>>> Elapsed time is:  5.425103
> >>>>> The sum = 0.866386
> >>>>> Elapsed time is:  5.422415
> >>>>> The sum = 0.866386
> >>>>> Elapsed time is:  5.422948
> >>>>> Total time is:  59.668622
> >>>>>
> >>>>> Thanks,    -- bennet
> >>>>>
> >>>>>
> >>>>> make check results
> >>>>> ----------------------------------------------
> >>>>>
> >>>>> make  check-TESTS
> >>>>> make[3]: Entering directory `/tmp/build/openmpi-3.1.0/ompi/debuggers'
> >>>>> make[4]: Entering directory `/tmp/build/openmpi-3.1.0/ompi/debuggers'
> >>>>> PASS: predefined_gap_test
> >>>>> PASS: predefined_pad_test
> >>>>> SKIP: dlopen_test
> >>>>> ============================================================================
> >>>>> Testsuite summary for Open MPI 3.1.0
> >>>>> ============================================================================
> >>>>> # TOTAL: 3
> >>>>> # PASS:  2
> >>>>> # SKIP:  1
> >>>>> # XFAIL: 0
> >>>>> # FAIL:  0
> >>>>> # XPASS: 0
> >>>>> # ERROR: 0
> >>>>> ============================================================================
> >>>>> [ elided ]
> >>>>> PASS: atomic_cmpset_noinline
> >>>>>  - 5 threads: Passed
> >>>>> PASS: atomic_cmpset_noinline
> >>>>>  - 8 threads: Passed
> >>>>> ============================================================================
> >>>>> Testsuite summary for Open MPI 3.1.0
> >>>>> ============================================================================
> >>>>> # TOTAL: 8
> >>>>> # PASS:  8
> >>>>> # SKIP:  0
> >>>>> # XFAIL: 0
> >>>>> # FAIL:  0
> >>>>> # XPASS: 0
> >>>>> # ERROR: 0
> >>>>> ============================================================================
> >>>>> [ elided ]
> >>>>> make[4]: Entering directory `/tmp/build/openmpi-3.1.0/test/class'
> >>>>> PASS: ompi_rb_tree
> >>>>> PASS: opal_bitmap
> >>>>> PASS: opal_hash_table
> >>>>> PASS: opal_proc_table
> >>>>> PASS: opal_tree
> >>>>> PASS: opal_list
> >>>>> PASS: opal_value_array
> >>>>> PASS: opal_pointer_array
> >>>>> PASS: opal_lifo
> >>>>> PASS: opal_fifo
> >>>>> ============================================================================
> >>>>> Testsuite summary for Open MPI 3.1.0
> >>>>> ============================================================================
> >>>>> # TOTAL: 10
> >>>>> # PASS:  10
> >>>>> # SKIP:  0
> >>>>> # XFAIL: 0
> >>>>> # FAIL:  0
> >>>>> # XPASS: 0
> >>>>> # ERROR: 0
> >>>>> ============================================================================
> >>>>> [ elided ]
> >>>>> make  opal_thread opal_condition
> >>>>> make[3]: Entering directory `/tmp/build/openmpi-3.1.0/test/threads'
> >>>>> CC       opal_thread.o
> >>>>> CCLD     opal_thread
> >>>>> CC       opal_condition.o
> >>>>> CCLD     opal_condition
> >>>>> make[3]: Leaving directory `/tmp/build/openmpi-3.1.0/test/threads'
> >>>>> make  check-TESTS
> >>>>> make[3]: Entering directory `/tmp/build/openmpi-3.1.0/test/threads'
> >>>>> make[4]: Entering directory `/tmp/build/openmpi-3.1.0/test/threads'
> >>>>> ============================================================================
> >>>>> Testsuite summary for Open MPI 3.1.0
> >>>>> ============================================================================
> >>>>> # TOTAL: 0
> >>>>> # PASS:  0
> >>>>> # SKIP:  0
> >>>>> # XFAIL: 0
> >>>>> # FAIL:  0
> >>>>> # XPASS: 0
> >>>>> # ERROR: 0
> >>>>> ============================================================================
> >>>>> [ elided ]
> >>>>> make[4]: Entering directory `/tmp/build/openmpi-3.1.0/test/datatype'
> >>>>> PASS: opal_datatype_test
> >>>>> PASS: unpack_hetero
> >>>>> PASS: checksum
> >>>>> PASS: position
> >>>>> PASS: position_noncontig
> >>>>> PASS: ddt_test
> >>>>> PASS: ddt_raw
> >>>>> PASS: unpack_ooo
> >>>>> PASS: ddt_pack
> >>>>> PASS: external32
> >>>>> ============================================================================
> >>>>> Testsuite summary for Open MPI 3.1.0
> >>>>> ============================================================================
> >>>>> # TOTAL: 10
> >>>>> # PASS:  10
> >>>>> # SKIP:  0
> >>>>> # XFAIL: 0
> >>>>> # FAIL:  0
> >>>>> # XPASS: 0
> >>>>> # ERROR: 0
> >>>>> ============================================================================
> >>>>> [ elided ]
> >>>>> make[4]: Entering directory `/tmp/build/openmpi-3.1.0/test/util'
> >>>>> PASS: opal_bit_ops
> >>>>> PASS: opal_path_nfs
> >>>>> PASS: bipartite_graph
> >>>>> ============================================================================
> >>>>> Testsuite summary for Open MPI 3.1.0
> >>>>> ============================================================================
> >>>>> # TOTAL: 3
> >>>>> # PASS:  3
> >>>>> # SKIP:  0
> >>>>> # XFAIL: 0
> >>>>> # FAIL:  0
> >>>>> # XPASS: 0
> >>>>> # ERROR: 0
> >>>>> ============================================================================
> >>>>> [ elided ]
> >>>>> make[4]: Entering directory `/tmp/build/openmpi-3.1.0/test/dss'
> >>>>> PASS: dss_buffer
> >>>>> PASS: dss_cmp
> >>>>> PASS: dss_payload
> >>>>> PASS: dss_print
> >>>>> ============================================================================
> >>>>> Testsuite summary for Open MPI 3.1.0
> >>>>> ============================================================================
> >>>>> # TOTAL: 4
> >>>>> # PASS:  4
> >>>>> # SKIP:  0
> >>>>> # XFAIL: 0
> >>>>> # FAIL:  0
> >>>>> # XPASS: 0
> >>>>> # ERROR: 0
> >>>>> ============================================================================
> >>>>> _______________________________________________
> >>>>> users mailing list
> >>>>> users@lists.open-mpi.org
> >>>>> https://lists.open-mpi.org/mailman/listinfo/users
> >>>>
> >>>> _______________________________________________
> >>>> users mailing list
> >>>> users@lists.open-mpi.org
> >>>> https://lists.open-mpi.org/mailman/listinfo/users
> >>> <debug2.log.gz>_______________________________________________
> >>> users mailing list
> >>> users@lists.open-mpi.org
> >>> https://lists.open-mpi.org/mailman/listinfo/users
> >>
> >> _______________________________________________
> >> users mailing list
> >> users@lists.open-mpi.org
> >> https://lists.open-mpi.org/mailman/listinfo/users
> > <debug3.log.gz>_______________________________________________
> > users mailing list
> > users@lists.open-mpi.org
> > https://lists.open-mpi.org/mailman/listinfo/users
>
> _______________________________________________
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Reply via email to