Unfortunately, this issue appears as having been introduced by a certain change that was supposed to make the ORTE code more debugging-friendly [obviously]. That particular change duplicated the epoch-tainted error managers into their "default" version, more stable and also providing less features.
The patches (25245, 25248) proposed so far as a solution to this problem should be removed, as they do not really solve the problem, instead they alleviate the symptoms. From here there are two possible fixes: 1. Put back the code dealing with the daemons leaving the job in the "default" version of the orted error manager. Here are the lines to be added in update_status in orte/mca/errmgr/default_orted/errmgr_default_orted.c: if (0 == orte_routed.num_routes() && 0 == opal_list_get_size(&orte_local_children)) { orte_quit(); } 2. Remove the "default" versions and fall back to a single sane set of error managers. george. PS: I feel compelled to clarify a statement here. The failure detection at the socket level, is working properly under all circumstances, how we deal with it at the upper levels suffered some mishandling lately. On Oct 11, 2011, at 13:17 , George Bosilca wrote: > > On Oct 11, 2011, at 01:17 , Ralph Castain wrote: > >> >> On Oct 10, 2011, at 11:29 AM, Ralph Castain wrote: >> >>> >>> On Oct 10, 2011, at 11:14 AM, George Bosilca wrote: >>> >>>> Ralph, >>>> >>>> If you don't mind I would like to understand this issue a little bit more. >>>> What exactly is broken in the termination detection? >>>> >>>>> From a network point of view, there is a slight issue with the commit >>>>> 25245. A direct call to exit will close all pending sockets, with a >>>>> linger of 60 seconds (quite bad if you use static ports as an example). >>>>> There are proper protocols to shutdown sockets in a reliable way, maybe >>>>> it is time to implement one of them. >> >> I should have clarified something here. The change in r25245 doesn't cause >> the daemons to directly call exit. It causes them to call orte_quit, which >> runs thru orte_finalize on its way out. So if the oob properly shuts down >> its sockets, then there will be no issue with lingering sockets. >> >> The question has always been: does the oob properly shut down sockets? If >> someone has time to address that issue, it would indeed be helpful. > > Please, allow me to clarify this one. We lived with the OOB issue for 7 (oh a > lucky number) years, and we always manage to patch the rest of the ORTE to > deal with. > > Why suddenly someone would bother to fix it? > > george. > >> >> >>> >>> Did I miss this message somewhere? I didn't receive it. >>> >>> The problem is that the daemons with children are not seeing the sockets >>> close when their child daemons terminate. So the lowest layer of daemons >>> terminate (as they have no children). The next layer does not terminate, >>> leaving mpirun hanging. >>> >>> What's interesting is that mpirun does see the sockets directly connected >>> to it close. As you probably recall, the binomial routed module has daemons >>> directly callback to mpirun, so that connection is created. The socket to >>> the next level of daemon gets created later - it is this later socket whose >>> closure isn't being detected. >>> >>> From what I could see, it appeared to be a progress issue. The socket may >>> be closed, but the event lib may not be progressing to detect it. As we >>> intend to fix that anyway with the async progress work, and static ports >>> are not the default behavior (and rarely used), I didn't feel it worth >>> leaving the trunk hanging while continuing to chase it. >>> >>> If someone has time, properly closing the sockets is something we've talked >>> about for quite awhile, but never gotten around to doing. I'm not sure it >>> will resolve this problem, but would be worth doing anyway. >>> >>> >>>> >>>> Thanks, >>>> george. >>>> >>>> On Oct 10, 2011, at 12:40 , Ralph Castain wrote: >>>> >>>>> It wasn't the launcher that was broken, but termination detection, and >>>>> not for all environments (e.g., worked fine for slurm). It is a >>>>> progress-related issue. >>>>> >>>>> Should be fixed in r25245. >>>>> >>>>> >>>>> On Oct 10, 2011, at 8:33 AM, Shamis, Pavel wrote: >>>>> >>>>>> + 1 , I see the same issue. >>>>>> >>>>>>> -----Original Message----- >>>>>>> From: devel-boun...@open-mpi.org [mailto:devel-boun...@open-mpi.org] >>>>>>> On Behalf Of Yevgeny Kliteynik >>>>>>> Sent: Monday, October 10, 2011 10:24 AM >>>>>>> To: OpenMPI Devel >>>>>>> Subject: [OMPI devel] Launcher in trunk is broken? >>>>>>> >>>>>>> It looks like the process launcher is broken in the OMPI trunk: >>>>>>> If you run any simple test (not necessarily including MPI calls) on 4 or >>>>>>> more nodes, the MPI processes won't be killed after the test finishes. >>>>>>> >>>>>>> $ mpirun -host host_1,host_2,host_3,host_4 -np 4 --mca btl sm,tcp,self >>>>>>> /bin/hostname >>>>>>> >>>>>>> Output: >>>>>>> host_1 >>>>>>> host_2 >>>>>>> host_3 >>>>>>> host_4 >>>>>>> >>>>>>> And test is hanging...... >>>>>>> >>>>>>> I have an older trunk (r25228), and everything is OK there. >>>>>>> Not sure if it means that something was broken after that, or the >>>>>>> problem >>>>>>> existed before, but kicked in only now due to some other change. >>>>>>> >>>>>>> -- YK >>>>>>> _______________________________________________ >>>>>>> devel mailing list >>>>>>> de...@open-mpi.org >>>>>>> hxxp://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> devel mailing list >>>>>> de...@open-mpi.org >>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>> >>>>> >>>>> _______________________________________________ >>>>> devel mailing list >>>>> de...@open-mpi.org >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>> >>>> >>>> _______________________________________________ >>>> devel mailing list >>>> de...@open-mpi.org >>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> >> >> >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel > > > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel