If it's of any use, 3.0.0 seems to hang at

Making check in class
make[2]: Entering directory `/tmp/build/openmpi-3.0.0/test/class'
make  ompi_rb_tree opal_bitmap opal_hash_table opal_proc_table
opal_tree opal_list opal_value_array opal_pointer_array opal_lifo
opal_fifo
make[3]: Entering directory `/tmp/build/openmpi-3.0.0/test/class'
make[3]: `ompi_rb_tree' is up to date.
make[3]: `opal_bitmap' is up to date.
make[3]: `opal_hash_table' is up to date.
make[3]: `opal_proc_table' is up to date.
make[3]: `opal_tree' is up to date.
make[3]: `opal_list' is up to date.
make[3]: `opal_value_array' is up to date.
make[3]: `opal_pointer_array' is up to date.
make[3]: `opal_lifo' is up to date.
make[3]: `opal_fifo' is up to date.
make[3]: Leaving directory `/tmp/build/openmpi-3.0.0/test/class'
make  check-TESTS
make[3]: Entering directory `/tmp/build/openmpi-3.0.0/test/class'
make[4]: Entering directory `/tmp/build/openmpi-3.0.0/test/class'

I have to interrupt it, but it's been many minutes, and usually these
have not been behaving this way.

-- bennet

On Mon, Jun 18, 2018 at 4:21 PM Bennet Fauber <ben...@umich.edu> wrote:
>
> No such luck.  If it matters, mpirun does seem to work with processes
> on the local node that have no internal MPI code.  That is,
>
> [bennet@cavium-hpc ~]$ mpirun -np 4 hello
> Hello, ARM
> Hello, ARM
> Hello, ARM
> Hello, ARM
>
> but it fails with a similar error if run while a SLURM job is active; i.e.,
>
> [bennet@cavium-hpc ~]$ mpirun hello
> --------------------------------------------------------------------------
> ORTE has lost communication with a remote daemon.
>
>   HNP daemon   : [[26589,0],0] on node cavium-hpc
>   Remote daemon: [[26589,0],1] on node cav01
>
> This is usually due to either a failure of the TCP network
> connection to the node, or possibly an internal failure of
> the daemon itself. We cannot recover from this failure, and
> therefore will terminate the job.
> --------------------------------------------------------------------------
>
> That makes sense, I guess.
>
> I'll keep you posted as to what happens with 3.0.0 and downgrading SLURM.
>
>
> Thanks,   -- bennet
>
>
> On Mon, Jun 18, 2018 at 4:05 PM r...@open-mpi.org <r...@open-mpi.org> wrote:
> >
> > I doubt Slurm is the issue. For grins, lets try adding “--mca plm rsh” to 
> > your mpirun cmd line and see if that works.
> >
> >
> > > On Jun 18, 2018, at 12:57 PM, Bennet Fauber <ben...@umich.edu> wrote:
> > >
> > > To eliminate possibilities, I removed all other versions of OpenMPI
> > > from the system, and rebuilt using the same build script as was used
> > > to generate the prior report.
> > >
> > > [bennet@cavium-hpc bennet]$ ./ompi-3.1.0bd.sh
> > > Checking compilers and things
> > > OMPI is ompi
> > > COMP_NAME is gcc_7_1_0
> > > SRC_ROOT is /sw/arcts/centos7/src
> > > PREFIX_ROOT is /sw/arcts/centos7
> > > PREFIX is /sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-bd
> > > CONFIGURE_FLAGS are
> > > COMPILERS are CC=gcc CXX=g++ FC=gfortran
> > >
> > > Currently Loaded Modules:
> > >  1) gcc/7.1.0
> > >
> > > gcc (ARM-build-14) 7.1.0
> > > Copyright (C) 2017 Free Software Foundation, Inc.
> > > This is free software; see the source for copying conditions.  There is NO
> > > warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR 
> > > PURPOSE.
> > >
> > > Using the following configure command
> > >
> > > ./configure     --prefix=/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-bd
> > >   --mandir=/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-bd/share/man
> > > --with-pmix=/opt/pmix/2.0.2     --with-libevent=external
> > > --with-hwloc=external     --with-slurm     --disable-dlopen
> > > --enable-debug          CC=gcc CXX=g++ FC=gfortran
> > >
> > > The tar ball is
> > >
> > > 2e783873f6b206aa71f745762fa15da5
> > > /sw/arcts/centos7/src/ompi/openmpi-3.1.0.tar.gz
> > >
> > > I still get
> > >
> > > [bennet@cavium-hpc ~]$ salloc -N 1 --ntasks-per-node=24
> > > salloc: Pending job allocation 165
> > > salloc: job 165 queued and waiting for resources
> > > salloc: job 165 has been allocated resources
> > > salloc: Granted job allocation 165
> > > [bennet@cavium-hpc ~]$ srun ./test_mpi
> > > The sum = 0.866386
> > > Elapsed time is:  5.425549
> > > The sum = 0.866386
> > > Elapsed time is:  5.422826
> > > The sum = 0.866386
> > > Elapsed time is:  5.427676
> > > The sum = 0.866386
> > > Elapsed time is:  5.424928
> > > The sum = 0.866386
> > > Elapsed time is:  5.422060
> > > The sum = 0.866386
> > > Elapsed time is:  5.425431
> > > The sum = 0.866386
> > > Elapsed time is:  5.424350
> > > The sum = 0.866386
> > > Elapsed time is:  5.423037
> > > The sum = 0.866386
> > > Elapsed time is:  5.427727
> > > The sum = 0.866386
> > > Elapsed time is:  5.424922
> > > The sum = 0.866386
> > > Elapsed time is:  5.424279
> > > Total time is:  59.672992
> > >
> > > [bennet@cavium-hpc ~]$ mpirun ./test_mpi
> > > --------------------------------------------------------------------------
> > > An ORTE daemon has unexpectedly failed after launch and before
> > > communicating back to mpirun. This could be caused by a number
> > > of factors, including an inability to create a connection back
> > > to mpirun due to a lack of common network interfaces and/or no
> > > route found between them. Please check network connectivity
> > > (including firewalls and network routing requirements).
> > > --------------------------------------------------------------------------
> > >
> > > I reran with
> > >
> > > [bennet@cavium-hpc ~]$ mpirun --mca plm_base_verbose 10 ./test_mpi
> > > 2>&1 | tee debug3.log
> > >
> > > and the gzipped log is attached.
> > >
> > > I thought to try it with a different test program, which spits the error
> > > [cavium-hpc.arc-ts.umich.edu:42853] [[58987,1],0] ORTE_ERROR_LOG: Not
> > > found in file base/ess_base_std_app.c at line 219
> > > [cavium-hpc.arc-ts.umich.edu:42854] [[58987,1],1] ORTE_ERROR_LOG: Not
> > > found in file base/ess_base_std_app.c at line 219
> > > --------------------------------------------------------------------------
> > > It looks like orte_init failed for some reason; your parallel process is
> > > likely to abort.  There are many reasons that a parallel process can
> > > fail during orte_init; some of which are due to configuration or
> > > environment problems.  This failure appears to be an internal failure;
> > > here's some additional information (which may only be relevant to an
> > > Open MPI developer):
> > >
> > >  store DAEMON URI failed
> > >  --> Returned value Not found (-13) instead of ORTE_SUCCESS
> > >
> > >
> > > At one point, I am almost certain that OMPI mpirun did work, and I am
> > > at a loss to explain why it no longer does.
> > >
> > > I have also tried the 3.1.1rc1 version.  I am now going to try 3.0.0,
> > > and we'll try downgrading SLURM to a prior version.
> > >
> > > -- bennet
> > >
> > >
> > > -- bennetOn Mon, Jun 18, 2018 at 10:56 AM r...@open-mpi.org
> > > <r...@open-mpi.org> wrote:
> > >>
> > >> Hmmm...well, the error has changed from your initial report. Turning off 
> > >> the firewall was the solution to that problem.
> > >>
> > >> This problem is different - it isn’t the orted that failed in the log 
> > >> you sent, but the application proc that couldn’t initialize. It looks 
> > >> like that app was compiled against some earlier version of OMPI? It is 
> > >> looking for something that no longer exists. I saw that you compiled it 
> > >> with a simple “gcc” instead of our wrapper compiler “mpicc” - any 
> > >> particular reason? My guess is that your compile picked up some older 
> > >> version of OMPI on the system.
> > >>
> > >> Ralph
> > >>
> > >>
> > >>> On Jun 17, 2018, at 2:51 PM, Bennet Fauber <ben...@umich.edu> wrote:
> > >>>
> > >>> I rebuilt with --enable-debug, then ran with
> > >>>
> > >>> [bennet@cavium-hpc ~]$ salloc -N 1 --ntasks-per-node=24
> > >>> salloc: Pending job allocation 158
> > >>> salloc: job 158 queued and waiting for resources
> > >>> salloc: job 158 has been allocated resources
> > >>> salloc: Granted job allocation 158
> > >>>
> > >>> [bennet@cavium-hpc ~]$ srun ./test_mpi
> > >>> The sum = 0.866386
> > >>> Elapsed time is:  5.426759
> > >>> The sum = 0.866386
> > >>> Elapsed time is:  5.424068
> > >>> The sum = 0.866386
> > >>> Elapsed time is:  5.426195
> > >>> The sum = 0.866386
> > >>> Elapsed time is:  5.426059
> > >>> The sum = 0.866386
> > >>> Elapsed time is:  5.423192
> > >>> The sum = 0.866386
> > >>> Elapsed time is:  5.426252
> > >>> The sum = 0.866386
> > >>> Elapsed time is:  5.425444
> > >>> The sum = 0.866386
> > >>> Elapsed time is:  5.423647
> > >>> The sum = 0.866386
> > >>> Elapsed time is:  5.426082
> > >>> The sum = 0.866386
> > >>> Elapsed time is:  5.425936
> > >>> The sum = 0.866386
> > >>> Elapsed time is:  5.423964
> > >>> Total time is:  59.677830
> > >>>
> > >>> [bennet@cavium-hpc ~]$ mpirun --mca plm_base_verbose 10 ./test_mpi
> > >>> 2>&1 | tee debug2.log
> > >>>
> > >>> The zipped debug log should be attached.
> > >>>
> > >>> I did that after using systemctl to turn off the firewall on the login
> > >>> node from which the mpirun is executed, as well as on the host on
> > >>> which it runs.
> > >>>
> > >>> [bennet@cavium-hpc ~]$ mpirun hostname
> > >>> --------------------------------------------------------------------------
> > >>> An ORTE daemon has unexpectedly failed after launch and before
> > >>> communicating back to mpirun. This could be caused by a number
> > >>> of factors, including an inability to create a connection back
> > >>> to mpirun due to a lack of common network interfaces and/or no
> > >>> route found between them. Please check network connectivity
> > >>> (including firewalls and network routing requirements).
> > >>> --------------------------------------------------------------------------
> > >>>
> > >>> [bennet@cavium-hpc ~]$ squeue
> > >>>            JOBID PARTITION     NAME     USER ST       TIME  NODES
> > >>> NODELIST(REASON)
> > >>>              158  standard     bash   bennet  R      14:30      1 cav01
> > >>> [bennet@cavium-hpc ~]$ srun hostname
> > >>> cav01.arc-ts.umich.edu
> > >>> [ repeated 23 more times ]
> > >>>
> > >>> As always, your help is much appreciated,
> > >>>
> > >>> -- bennet
> > >>>
> > >>> On Sun, Jun 17, 2018 at 1:06 PM r...@open-mpi.org <r...@open-mpi.org> 
> > >>> wrote:
> > >>>>
> > >>>> Add --enable-debug to your OMPI configure cmd line, and then add --mca 
> > >>>> plm_base_verbose 10 to your mpirun cmd line. For some reason, the 
> > >>>> remote daemon isn’t starting - this will give you some info as to why.
> > >>>>
> > >>>>
> > >>>>> On Jun 17, 2018, at 9:07 AM, Bennet Fauber <ben...@umich.edu> wrote:
> > >>>>>
> > >>>>> I have a compiled binary that will run with srun but not with mpirun.
> > >>>>> The attempts to run with mpirun all result in failures to initialize.
> > >>>>> I have tried this on one node, and on two nodes, with firewall turned
> > >>>>> on and with it off.
> > >>>>>
> > >>>>> Am I missing some command line option for mpirun?
> > >>>>>
> > >>>>> OMPI built from this configure command
> > >>>>>
> > >>>>> $ ./configure --prefix=/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-b
> > >>>>> --mandir=/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-b/share/man
> > >>>>> --with-pmix=/opt/pmix/2.0.2 --with-libevent=external
> > >>>>> --with-hwloc=external --with-slurm --disable-dlopen CC=gcc CXX=g++
> > >>>>> FC=gfortran
> > >>>>>
> > >>>>> All tests from `make check` passed, see below.
> > >>>>>
> > >>>>> [bennet@cavium-hpc ~]$ mpicc --show
> > >>>>> gcc -I/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-b/include -pthread
> > >>>>> -L/opt/pmix/2.0.2/lib -Wl,-rpath -Wl,/opt/pmix/2.0.2/lib -Wl,-rpath
> > >>>>> -Wl,/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-b/lib
> > >>>>> -Wl,--enable-new-dtags
> > >>>>> -L/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-b/lib -lmpi
> > >>>>>
> > >>>>> The test_mpi was compiled with
> > >>>>>
> > >>>>> $ gcc -o test_mpi test_mpi.c -lm
> > >>>>>
> > >>>>> This is the runtime library path
> > >>>>>
> > >>>>> [bennet@cavium-hpc ~]$ echo $LD_LIBRARY_PATH
> > >>>>> /opt/slurm/lib64:/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-b/lib:/opt/arm/gcc-7.1.0_Generic-AArch64_RHEL-7_aarch64-linux/lib64:/opt/arm/gcc-7.1.0_Generic-AArch64_RHEL-7_aarch64-linux/lib:/opt/slurm/lib64:/opt/pmix/2.0.2/lib:/sw/arcts/centos7/hpc-utils/lib
> > >>>>>
> > >>>>>
> > >>>>> These commands are given in exact sequence in which they were entered
> > >>>>> at a console.
> > >>>>>
> > >>>>> [bennet@cavium-hpc ~]$ salloc -N 1 --ntasks-per-node=24
> > >>>>> salloc: Pending job allocation 156
> > >>>>> salloc: job 156 queued and waiting for resources
> > >>>>> salloc: job 156 has been allocated resources
> > >>>>> salloc: Granted job allocation 156
> > >>>>>
> > >>>>> [bennet@cavium-hpc ~]$ mpirun ./test_mpi
> > >>>>> --------------------------------------------------------------------------
> > >>>>> An ORTE daemon has unexpectedly failed after launch and before
> > >>>>> communicating back to mpirun. This could be caused by a number
> > >>>>> of factors, including an inability to create a connection back
> > >>>>> to mpirun due to a lack of common network interfaces and/or no
> > >>>>> route found between them. Please check network connectivity
> > >>>>> (including firewalls and network routing requirements).
> > >>>>> --------------------------------------------------------------------------
> > >>>>>
> > >>>>> [bennet@cavium-hpc ~]$ srun ./test_mpi
> > >>>>> The sum = 0.866386
> > >>>>> Elapsed time is:  5.425439
> > >>>>> The sum = 0.866386
> > >>>>> Elapsed time is:  5.427427
> > >>>>> The sum = 0.866386
> > >>>>> Elapsed time is:  5.422579
> > >>>>> The sum = 0.866386
> > >>>>> Elapsed time is:  5.424168
> > >>>>> The sum = 0.866386
> > >>>>> Elapsed time is:  5.423951
> > >>>>> The sum = 0.866386
> > >>>>> Elapsed time is:  5.422414
> > >>>>> The sum = 0.866386
> > >>>>> Elapsed time is:  5.427156
> > >>>>> The sum = 0.866386
> > >>>>> Elapsed time is:  5.424834
> > >>>>> The sum = 0.866386
> > >>>>> Elapsed time is:  5.425103
> > >>>>> The sum = 0.866386
> > >>>>> Elapsed time is:  5.422415
> > >>>>> The sum = 0.866386
> > >>>>> Elapsed time is:  5.422948
> > >>>>> Total time is:  59.668622
> > >>>>>
> > >>>>> Thanks,    -- bennet
> > >>>>>
> > >>>>>
> > >>>>> make check results
> > >>>>> ----------------------------------------------
> > >>>>>
> > >>>>> make  check-TESTS
> > >>>>> make[3]: Entering directory `/tmp/build/openmpi-3.1.0/ompi/debuggers'
> > >>>>> make[4]: Entering directory `/tmp/build/openmpi-3.1.0/ompi/debuggers'
> > >>>>> PASS: predefined_gap_test
> > >>>>> PASS: predefined_pad_test
> > >>>>> SKIP: dlopen_test
> > >>>>> ============================================================================
> > >>>>> Testsuite summary for Open MPI 3.1.0
> > >>>>> ============================================================================
> > >>>>> # TOTAL: 3
> > >>>>> # PASS:  2
> > >>>>> # SKIP:  1
> > >>>>> # XFAIL: 0
> > >>>>> # FAIL:  0
> > >>>>> # XPASS: 0
> > >>>>> # ERROR: 0
> > >>>>> ============================================================================
> > >>>>> [ elided ]
> > >>>>> PASS: atomic_cmpset_noinline
> > >>>>>  - 5 threads: Passed
> > >>>>> PASS: atomic_cmpset_noinline
> > >>>>>  - 8 threads: Passed
> > >>>>> ============================================================================
> > >>>>> Testsuite summary for Open MPI 3.1.0
> > >>>>> ============================================================================
> > >>>>> # TOTAL: 8
> > >>>>> # PASS:  8
> > >>>>> # SKIP:  0
> > >>>>> # XFAIL: 0
> > >>>>> # FAIL:  0
> > >>>>> # XPASS: 0
> > >>>>> # ERROR: 0
> > >>>>> ============================================================================
> > >>>>> [ elided ]
> > >>>>> make[4]: Entering directory `/tmp/build/openmpi-3.1.0/test/class'
> > >>>>> PASS: ompi_rb_tree
> > >>>>> PASS: opal_bitmap
> > >>>>> PASS: opal_hash_table
> > >>>>> PASS: opal_proc_table
> > >>>>> PASS: opal_tree
> > >>>>> PASS: opal_list
> > >>>>> PASS: opal_value_array
> > >>>>> PASS: opal_pointer_array
> > >>>>> PASS: opal_lifo
> > >>>>> PASS: opal_fifo
> > >>>>> ============================================================================
> > >>>>> Testsuite summary for Open MPI 3.1.0
> > >>>>> ============================================================================
> > >>>>> # TOTAL: 10
> > >>>>> # PASS:  10
> > >>>>> # SKIP:  0
> > >>>>> # XFAIL: 0
> > >>>>> # FAIL:  0
> > >>>>> # XPASS: 0
> > >>>>> # ERROR: 0
> > >>>>> ============================================================================
> > >>>>> [ elided ]
> > >>>>> make  opal_thread opal_condition
> > >>>>> make[3]: Entering directory `/tmp/build/openmpi-3.1.0/test/threads'
> > >>>>> CC       opal_thread.o
> > >>>>> CCLD     opal_thread
> > >>>>> CC       opal_condition.o
> > >>>>> CCLD     opal_condition
> > >>>>> make[3]: Leaving directory `/tmp/build/openmpi-3.1.0/test/threads'
> > >>>>> make  check-TESTS
> > >>>>> make[3]: Entering directory `/tmp/build/openmpi-3.1.0/test/threads'
> > >>>>> make[4]: Entering directory `/tmp/build/openmpi-3.1.0/test/threads'
> > >>>>> ============================================================================
> > >>>>> Testsuite summary for Open MPI 3.1.0
> > >>>>> ============================================================================
> > >>>>> # TOTAL: 0
> > >>>>> # PASS:  0
> > >>>>> # SKIP:  0
> > >>>>> # XFAIL: 0
> > >>>>> # FAIL:  0
> > >>>>> # XPASS: 0
> > >>>>> # ERROR: 0
> > >>>>> ============================================================================
> > >>>>> [ elided ]
> > >>>>> make[4]: Entering directory `/tmp/build/openmpi-3.1.0/test/datatype'
> > >>>>> PASS: opal_datatype_test
> > >>>>> PASS: unpack_hetero
> > >>>>> PASS: checksum
> > >>>>> PASS: position
> > >>>>> PASS: position_noncontig
> > >>>>> PASS: ddt_test
> > >>>>> PASS: ddt_raw
> > >>>>> PASS: unpack_ooo
> > >>>>> PASS: ddt_pack
> > >>>>> PASS: external32
> > >>>>> ============================================================================
> > >>>>> Testsuite summary for Open MPI 3.1.0
> > >>>>> ============================================================================
> > >>>>> # TOTAL: 10
> > >>>>> # PASS:  10
> > >>>>> # SKIP:  0
> > >>>>> # XFAIL: 0
> > >>>>> # FAIL:  0
> > >>>>> # XPASS: 0
> > >>>>> # ERROR: 0
> > >>>>> ============================================================================
> > >>>>> [ elided ]
> > >>>>> make[4]: Entering directory `/tmp/build/openmpi-3.1.0/test/util'
> > >>>>> PASS: opal_bit_ops
> > >>>>> PASS: opal_path_nfs
> > >>>>> PASS: bipartite_graph
> > >>>>> ============================================================================
> > >>>>> Testsuite summary for Open MPI 3.1.0
> > >>>>> ============================================================================
> > >>>>> # TOTAL: 3
> > >>>>> # PASS:  3
> > >>>>> # SKIP:  0
> > >>>>> # XFAIL: 0
> > >>>>> # FAIL:  0
> > >>>>> # XPASS: 0
> > >>>>> # ERROR: 0
> > >>>>> ============================================================================
> > >>>>> [ elided ]
> > >>>>> make[4]: Entering directory `/tmp/build/openmpi-3.1.0/test/dss'
> > >>>>> PASS: dss_buffer
> > >>>>> PASS: dss_cmp
> > >>>>> PASS: dss_payload
> > >>>>> PASS: dss_print
> > >>>>> ============================================================================
> > >>>>> Testsuite summary for Open MPI 3.1.0
> > >>>>> ============================================================================
> > >>>>> # TOTAL: 4
> > >>>>> # PASS:  4
> > >>>>> # SKIP:  0
> > >>>>> # XFAIL: 0
> > >>>>> # FAIL:  0
> > >>>>> # XPASS: 0
> > >>>>> # ERROR: 0
> > >>>>> ============================================================================
> > >>>>> _______________________________________________
> > >>>>> users mailing list
> > >>>>> users@lists.open-mpi.org
> > >>>>> https://lists.open-mpi.org/mailman/listinfo/users
> > >>>>
> > >>>> _______________________________________________
> > >>>> users mailing list
> > >>>> users@lists.open-mpi.org
> > >>>> https://lists.open-mpi.org/mailman/listinfo/users
> > >>> <debug2.log.gz>_______________________________________________
> > >>> users mailing list
> > >>> users@lists.open-mpi.org
> > >>> https://lists.open-mpi.org/mailman/listinfo/users
> > >>
> > >> _______________________________________________
> > >> users mailing list
> > >> users@lists.open-mpi.org
> > >> https://lists.open-mpi.org/mailman/listinfo/users
> > > <debug3.log.gz>_______________________________________________
> > > users mailing list
> > > users@lists.open-mpi.org
> > > https://lists.open-mpi.org/mailman/listinfo/users
> >
> > _______________________________________________
> > users mailing list
> > users@lists.open-mpi.org
> > https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Reply via email to