Hmmm….yeah, I know we saw this and resolved it in the trunk, but it looks like the fix indeed failed to come over to 1.8. I’ll take a gander (pretty sure I remember how I fixed it) - thanks!
> On Nov 26, 2014, at 12:03 AM, Gilles Gouaillardet > <gilles.gouaillar...@iferc.org> wrote: > > Ralph, > > i noted several hangs in mtt with the v1.8 branch. > > a simple way to reproduce it is to use the MPI_Errhandler_fatal_f test > from the intel_tests suite, > invoke mpirun on one node and run the taks on an other node : > > node0$ mpirun -np 3 -host node1 --mca btl tcp,self ./MPI_Errhandler_fatal_f > > /* since this is a race condition, you might need to run this in a loop > in order to hit the bug */ > > the attached tarball contains a patch (add debug + temporary hack) and > some log files obtained with > --mca errmgr_base_verbose 100 --mca odls_base_verbose 100 > > without the hack, i can reproduce the bug with -np 3 (log.ko.txt) , with > the hack, i can still reproduce the hang (though it might > be a different one) with -np 16 (log.ko.2.txt) > > i remember some similar hangs were fixed on the trunk/master a few > monthes ago. > i tried to backport some commits but it did not help :-( > > could you please have a look at this ? > > Cheers, > > Gilles > <abort_hang.tar.gz>_______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/11/16357.php