On further investigation removing the "preconnect_all" option does change the 
problem at least. Without "preconnect_all" I no longer see:

--------------------------------------------------------------------------
At least one pair of MPI processes are unable to reach each other for
MPI communications.  This means that no Open MPI device has indicated
that it can be used to communicate between these processes.  This is
an error; Open MPI requires that all MPI processes be able to reach
each other.  This error can sometimes be the result of forgetting to
specify the "self" BTL.

  Process 1 ([[32179,2],15]) is on host: node092
  Process 2 ([[32179,2],0]) is on host: unknown!
  BTLs attempted: self tcp vader

Your MPI job is now going to abort; sorry.
--------------------------------------------------------------------------


Instead it hangs for several minutes and finally aborts with:

--------------------------------------------------------------------------
A request has timed out and will therefore fail:

  Operation:  LOOKUP: orted/pmix/pmix_server_pub.c:345

Your job may terminate as a result of this problem. You may want to
adjust the MCA parameter pmix_server_max_wait and try again. If this
occurred during a connect/accept operation, you can adjust that time
using the pmix_base_exchange_timeout parameter.
--------------------------------------------------------------------------
[node091:19470] *** An error occurred in MPI_Comm_spawn
[node091:19470] *** reported by process [1614086145,0]
[node091:19470] *** on communicator MPI_COMM_WORLD
[node091:19470] *** MPI_ERR_UNKNOWN: unknown error
[node091:19470] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will 
now abort,
[node091:19470] ***    and potentially your MPI job)

I've tried increasing both pmix_server_max_wait and pmix_base_exchange_timeout 
as suggested in the error message, but the result is unchanged (it just takes 
longer to time out).

Once again, if I remove "--map-by node" it runs successfully.

-Andrew



On Sunday, September 16, 2018 7:03:15 AM PDT Ralph H Castain wrote:
> I see you are using “preconnect_all” - that is the source of the trouble. I
> don’t believe we have tested that option in years and the code is almost
> certainly dead. I’d suggest removing that option and things should work.
> > On Sep 15, 2018, at 1:46 PM, Andrew Benson <abenso...@gmail.com> wrote:
> > 
> > I'm running into problems trying to spawn MPI processes across multiple
> > nodes on a cluster using recent versions of OpenMPI. Specifically, using
> > the attached Fortan code, compiled using OpenMPI 3.1.2 with:
> > 
> > mpif90 test.F90 -o test.exe
> > 
> > and run via a PBS scheduler using the attached test1.pbs, it fails as can
> > be seen in the attached testFAIL.err file.
> > 
> > If I do the same but using OpenMPI v1.10.3 then it works successfully,
> > giving me the output in the attached testSUCCESS.err file.
> > 
> > From testing a few different versions of OpenMPI it seems that the
> > behavior
> > changed between v1.10.7 and v2.0.4.
> > 
> > Is there some change in options needed to make this work with newer
> > OpenMPIs?
> > 
> > Output from omp_info --all is attached. config.log can be found here:
> > 
> > http://users.obs.carnegiescience.edu/abenson/config.log.bz2
> > 
> > Thanks for any help you can offer!
> > 
> > -Andrew<ompi_info.log.bz2><test.F90><test1.pbs><testFAIL.err.bz2><testSUCC
> > ESS.err.bz2>_______________________________________________ users mailing
> > list
> > users@lists.open-mpi.org
> > https://lists.open-mpi.org/mailman/listinfo/users
> 
> _______________________________________________
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users


-- 

* Andrew Benson: http://users.obs.carnegiescience.edu/abenson/contact.html

* Galacticus: https://bitbucket.org/abensonca/galacticus

_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Reply via email to