Hi Matt, There seem to be two different issues here:
a) The warning message comes from the openib btl. Given that Omnipath has verbs API and you have the necessary libraries in your system, openib btl finds itself as a potential transport and prints the warning during its init (openib btl is its way to deprecation). You may try to explicitly ask for vader btl given you are running on shared mem: -mca btl self,vader -mca pml ob1. Or better, explicitly build without openib: ./configure --with-verbs=no … b) Not my field of expertise, but you may be having some conflict with the external components you are using: --with-pmix=/usr/nlocal/pmix/2.1 --with-libevent=/usr . You may try not specifying these and using the ones provided by OMPI. _MAC From: users [mailto:users-boun...@lists.open-mpi.org] On Behalf Of Matt Thompson Sent: Tuesday, January 22, 2019 6:04 AM To: Open MPI Users <users@lists.open-mpi.org> Subject: Re: [OMPI users] Help Getting Started with Open MPI and PMIx and UCX Well, By turning off UCX compilation per Howard, things get a bit better in that something happens! It's not a good something, as it seems to die with an infiniband error. As this is an Omnipath system, is OpenMPI perhaps seeing libverbs somewhere and compiling it in? To wit: (1006)(master) $ mpirun -np 4 ./helloWorld.mpi3.SLES12.OMPI400.exe -------------------------------------------------------------------------- By default, for Open MPI 4.0 and later, infiniband ports on a device are not used by default. The intent is to use UCX for these devices. You can override this policy by setting the btl_openib_allow_ib MCA parameter to true. Local host: borgc129 Local adapter: hfi1_0 Local port: 1 -------------------------------------------------------------------------- -------------------------------------------------------------------------- WARNING: There was an error initializing an OpenFabrics device. Local host: borgc129 Local device: hfi1_0 -------------------------------------------------------------------------- Compiler Version: Intel(R) Fortran Intel(R) 64 Compiler for applications running on Intel(R) 64, Version 18.0.5.274 Build 20180823 MPI Version: 3.1 MPI Library Version: Open MPI v4.0.0, package: Open MPI mathomp4@discover23 Distribution, ident: 4.0.0, repo rev: v4.0.0, Nov 12, 2018 [borgc129:260830] *** An error occurred in MPI_Barrier [borgc129:260830] *** reported by process [140736833716225,46909632806913] [borgc129:260830] *** on communicator MPI_COMM_WORLD [borgc129:260830] *** MPI_ERR_OTHER: known error not in list [borgc129:260830] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, [borgc129:260830] *** and potentially your MPI job) forrtl: error (78): process killed (SIGTERM) Image PC Routine Line Source helloWorld.mpi3.S 000000000040A38E for__signal_handl Unknown Unknown libpthread-2.22.s 00002AAAAB9CCB20 Unknown Unknown Unknown libpthread-2.22.s 00002AAAAB9C90CD pthread_cond_wait Unknown Unknown libpmix.so.2.1.11 00002AAAB1D780A1 PMIx_Abort Unknown Unknown mca_pmix_ext2x.so 00002AAAB1B3AA75 ext2x_abort Unknown Unknown mca_ess_pmi.so 00002AAAB1724BC0 Unknown Unknown Unknown libopen-rte.so.40 00002AAAAC3E941C orte_errmgr_base_ Unknown Unknown mca_errmgr_defaul 00002AAABC401668 Unknown Unknown Unknown libmpi.so.40.20.0 00002AAAAB3CDBC4 ompi_mpi_abort Unknown Unknown libmpi.so.40.20.0 00002AAAAB3BB1EF ompi_mpi_errors_a Unknown Unknown libmpi.so.40.20.0 00002AAAAB3B99C9 ompi_errhandler_i Unknown Unknown libmpi.so.40.20.0 00002AAAAB3E4576 MPI_Barrier Unknown Unknown libmpi_mpifh.so.4 00002AAAAB15EE53 MPI_Barrier_f08 Unknown Unknown libmpi_usempif08. 00002AAAAACE7732 mpi_barrier_f08_ Unknown Unknown helloWorld.mpi3.S 000000000040939F Unknown Unknown Unknown helloWorld.mpi3.S 000000000040915E Unknown Unknown Unknown libc-2.22.so<http://libc-2.22.so> 00002AAAABBF96D5 __libc_start_main Unknown Unknown helloWorld.mpi3.S 0000000000409069 Unknown Unknown Unknown On Sun, Jan 20, 2019 at 4:19 PM Howard Pritchard <hpprit...@gmail.com<mailto:hpprit...@gmail.com>> wrote: Hi Matt Definitely do not include the ucx option for an omnipath cluster. Actually if you accidentally installed ucx in it’s default location use on the system Switch to this config option —with-ucx=no Otherwise you will hit https://github.com/openucx/ucx/issues/750 Howard Gilles Gouaillardet <gilles.gouaillar...@gmail.com<mailto:gilles.gouaillar...@gmail.com>> schrieb am Sa. 19. Jan. 2019 um 18:41: Matt, There are two ways of using PMIx - if you use mpirun, then the MPI app (e.g. the PMIx client) will talk to mpirun and orted daemons (e.g. the PMIx server) - if you use SLURM srun, then the MPI app will directly talk to the PMIx server provided by SLURM. (note you might have to srun --mpi=pmix_v2 or something) In the former case, it does not matter whether you use the embedded or external PMIx. In the latter case, Open MPI and SLURM have to use compatible PMIx libraries, and you can either check the cross-version compatibility matrix, or build Open MPI with the same PMIx used by SLURM to be on the safe side (not a bad idea IMHO). Regarding the hang, I suggest you try different things - use mpirun in a SLURM job (e.g. sbatch instead of salloc so mpirun runs on a compute node rather than on a frontend node) - try something even simpler such as mpirun hostname (both with sbatch and salloc) - explicitly specify the network to be used for the wire-up. you can for example mpirun --mca oob_tcp_if_include 192.168.0.0/24<http://192.168.0.0/24> if this is the network subnet by which all the nodes (e.g. compute nodes and frontend node if you use salloc) communicate. Cheers, Gilles On Sat, Jan 19, 2019 at 3:31 AM Matt Thompson <fort...@gmail.com<mailto:fort...@gmail.com>> wrote: > > On Fri, Jan 18, 2019 at 1:13 PM Jeff Squyres (jsquyres) via users > <users@lists.open-mpi.org<mailto:users@lists.open-mpi.org>> wrote: >> >> On Jan 18, 2019, at 12:43 PM, Matt Thompson >> <fort...@gmail.com<mailto:fort...@gmail.com>> wrote: >> > >> > With some help, I managed to build an Open MPI 4.0.0 with: >> >> We can discuss each of these params to let you know what they are. >> >> > ./configure --disable-wrapper-rpath --disable-wrapper-runpath >> >> Did you have a reason for disabling these? They're generally good things. >> What they do is add linker flags to the wrapper compilers (i.e., mpicc and >> friends) that basically put a default path to find libraries at run time >> (that can/will in most cases override LD_LIBRARY_PATH -- but you can >> override these linked-in-default-paths if you want/need to). > > > I've had these in my Open MPI builds for a while now. The reason was one of > the libraries I need for the climate model I work on went nuts if both of > them weren't there. It was originally the rpath one but then eventually (Open > MPI 3?) I had to add the runpath one. But I have been updating the libraries > more aggressively recently (due to OS upgrades) so it's possible this is no > longer needed. > >> >> >> > --with-psm2 >> >> Ensure that Open MPI can include support for the PSM2 library, and abort >> configure if it cannot. >> >> > --with-slurm >> >> Ensure that Open MPI can include support for SLURM, and abort configure if >> it cannot. >> >> > --enable-mpi1-compatibility >> >> Add support for MPI_Address and other MPI-1 functions that have since been >> deleted from the MPI 3.x specification. >> >> > --with-ucx >> >> Ensure that Open MPI can include support for UCX, and abort configure if it >> cannot. >> >> > --with-pmix=/usr/nlocal/pmix/2.1 >> >> Tells Open MPI to use the PMIx that is installed at /usr/nlocal/pmix/2.1 >> (instead of using the PMIx that is bundled internally to Open MPI's source >> code tree/expanded tarball). >> >> Unless you have a reason to use the external PMIx, the internal/bundled PMIx >> is usually sufficient. > > > Ah. I did not know that. I figured if our SLURM was built linked to a > specific PMIx v2 that I should build Open MPI with the same PMIx. I'll build > an Open MPI 4 without specifying this. > >> >> >> > --with-libevent=/usr >> >> Same as previous; change "pmix" to "libevent" (i.e., use the external >> libevent instead of the bundled libevent). >> >> > CC=icc CXX=icpc FC=ifort >> >> Specify the exact compilers to use. >> >> > The MPI 1 is because I need to build HDF5 eventually and I added psm2 >> > because it's an Omnipath cluster. The libevent was probably a red herring >> > as libevent-devel wasn't installed on the system. It was eventually, and I >> > just didn't remove the flag. And I saw no errors in the build! >> >> Might as well remove the --with-libevent if you don't need it. >> >> > However, I seem to have built an Open MPI that doesn't work: >> > >> > (1099)(master) $ mpirun --version >> > mpirun (Open MPI) 4.0.0 >> > >> > Report bugs to http://www.open-mpi.org/community/help/ >> > (1100)(master) $ mpirun -np 4 ./helloWorld.mpi3.SLES12.OMPI400.exe >> > >> > It just sits there...forever. Can the gurus here help me figure out what I >> > managed to break? Perhaps I added too much to my configure line? Not >> > enough? >> >> There could be a few things going on here. >> >> Are you running inside a SLURM job? E.g., in a "salloc" job, or in an >> "sbatch" script? > > > I have salloc'd 8 nodes of 40 cores each. Intel MPI 18 and 19 work just fine > (as you'd hope on an Omnipath cluster), but for some reason Open MPI is > twitchy on this cluster. I once managed to get Open MPI 3.0.1 working (a few > months ago), and it had some interesting startup scaling I liked (slow at low > core count, but getting close to Intel MPI at high core count), though it > seemed to not work after about 100 nodes (4000 processes) or so. > > -- > Matt Thompson > “The fact is, this is about us identifying what we do best and > finding more ways of doing less of it better” -- Director of Better Anna > Rampton > _______________________________________________ > users mailing list > users@lists.open-mpi.org<mailto:users@lists.open-mpi.org> > https://lists.open-mpi.org/mailman/listinfo/users _______________________________________________ users mailing list users@lists.open-mpi.org<mailto:users@lists.open-mpi.org> https://lists.open-mpi.org/mailman/listinfo/users _______________________________________________ users mailing list users@lists.open-mpi.org<mailto:users@lists.open-mpi.org> https://lists.open-mpi.org/mailman/listinfo/users -- Matt Thompson “The fact is, this is about us identifying what we do best and finding more ways of doing less of it better” -- Director of Better Anna Rampton
_______________________________________________ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users