> On Sep 11, 2015, at 10:00 PM, Gilles Gouaillardet 
> <gilles.gouaillar...@gmail.com> wrote:
> 
> Ralph,
> 
> at first glance, these errors look unrelated to PMIx.
> I noticed a bunch of bind() failure.
> based on your command line, I guess you are not running your job via a batch 
> manager,
> and I would guess not all unix sockets are always cleaned up.

Yeah, the no-disconnect test was causing mpirun to segfault, which meant that 
the sockets weren’t cleaned up. So eventually I’d hit a case where they 
collided. Simply blowing away the session directory tree resolves the problem.

> (or this is an old bug and you did not manually clean your nodes when it was 
> fixed)
> 
> the neighbor_allgather_self failure is discussed at 
> https://github.com/open-mpi/ompi/pull/790 
> <https://github.com/open-mpi/ompi/pull/790>
Ah, indeed - thanks!

> 
> I will have a look at the op related failure on Monday
> (looks like a MPI conformance issue unrelated to PMIx)

Again, thanks!

> 
> Cheers,
> 
> Gilles
> 
> On Saturday, September 12, 2015, Ralph Castain <r...@open-mpi.org 
> <mailto:r...@open-mpi.org>> wrote:
> Hi folks
> 
> I’ve closed all the holes I can find in the PMIx integration, and things look 
> pretty good overall. There are a handful of failures still being seen - most 
> of them involving what appear to be unrelated code. I’m not entirely sure I 
> understand the source of the errors, and could really use some help to 
> determine (a) if these are in any way related to PMIx, and if so (b) how.
> 
> The errors from my MTT run are here:  
> http://mtt.open-mpi.org/index.php?do_redir=2256 
> <http://mtt.open-mpi.org/index.php?do_redir=2256>
> 
> Any help diagnosing these problems would be greatly appreciated
> Ralph
> 
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2015/09/18015.php

Reply via email to