On further investigation removing the "preconnect_all" option does change the problem at least. Without "preconnect_all" I no longer see:
-------------------------------------------------------------------------- At least one pair of MPI processes are unable to reach each other for MPI communications. This means that no Open MPI device has indicated that it can be used to communicate between these processes. This is an error; Open MPI requires that all MPI processes be able to reach each other. This error can sometimes be the result of forgetting to specify the "self" BTL. Process 1 ([[32179,2],15]) is on host: node092 Process 2 ([[32179,2],0]) is on host: unknown! BTLs attempted: self tcp vader Your MPI job is now going to abort; sorry. -------------------------------------------------------------------------- Instead it hangs for several minutes and finally aborts with: -------------------------------------------------------------------------- A request has timed out and will therefore fail: Operation: LOOKUP: orted/pmix/pmix_server_pub.c:345 Your job may terminate as a result of this problem. You may want to adjust the MCA parameter pmix_server_max_wait and try again. If this occurred during a connect/accept operation, you can adjust that time using the pmix_base_exchange_timeout parameter. -------------------------------------------------------------------------- [node091:19470] *** An error occurred in MPI_Comm_spawn [node091:19470] *** reported by process [1614086145,0] [node091:19470] *** on communicator MPI_COMM_WORLD [node091:19470] *** MPI_ERR_UNKNOWN: unknown error [node091:19470] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, [node091:19470] *** and potentially your MPI job) I've tried increasing both pmix_server_max_wait and pmix_base_exchange_timeout as suggested in the error message, but the result is unchanged (it just takes longer to time out). Once again, if I remove "--map-by node" it runs successfully. -Andrew On Sunday, September 16, 2018 7:03:15 AM PDT Ralph H Castain wrote: > I see you are using “preconnect_all” - that is the source of the trouble. I > don’t believe we have tested that option in years and the code is almost > certainly dead. I’d suggest removing that option and things should work. > > On Sep 15, 2018, at 1:46 PM, Andrew Benson <abenso...@gmail.com> wrote: > > > > I'm running into problems trying to spawn MPI processes across multiple > > nodes on a cluster using recent versions of OpenMPI. Specifically, using > > the attached Fortan code, compiled using OpenMPI 3.1.2 with: > > > > mpif90 test.F90 -o test.exe > > > > and run via a PBS scheduler using the attached test1.pbs, it fails as can > > be seen in the attached testFAIL.err file. > > > > If I do the same but using OpenMPI v1.10.3 then it works successfully, > > giving me the output in the attached testSUCCESS.err file. > > > > From testing a few different versions of OpenMPI it seems that the > > behavior > > changed between v1.10.7 and v2.0.4. > > > > Is there some change in options needed to make this work with newer > > OpenMPIs? > > > > Output from omp_info --all is attached. config.log can be found here: > > > > http://users.obs.carnegiescience.edu/abenson/config.log.bz2 > > > > Thanks for any help you can offer! > > > > -Andrew<ompi_info.log.bz2><test.F90><test1.pbs><testFAIL.err.bz2><testSUCC > > ESS.err.bz2>_______________________________________________ users mailing > > list > > users@lists.open-mpi.org > > https://lists.open-mpi.org/mailman/listinfo/users > > _______________________________________________ > users mailing list > users@lists.open-mpi.org > https://lists.open-mpi.org/mailman/listinfo/users -- * Andrew Benson: http://users.obs.carnegiescience.edu/abenson/contact.html * Galacticus: https://bitbucket.org/abensonca/galacticus _______________________________________________ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users