On Oct 11, 2011, at 1:40 PM, George Bosilca wrote: > Unfortunately, this issue appears as having been introduced by a certain > change that was supposed to make the ORTE code more debugging-friendly > [obviously]. That particular change duplicated the epoch-tainted error > managers into their "default" version, more stable and also providing less > features.
Interesting - I didn't see any "comm failure" calls back to the errmgr. However, it is always possible I missed them...I'll check again. > > The patches (25245, 25248) proposed so far as a solution to this problem > should be removed, as they do not really solve the problem, instead they > alleviate the symptoms. From here there are two possible fixes: > > 1. Put back the code dealing with the daemons leaving the job in the > "default" version of the orted error manager. > > Here are the lines to be added in update_status in > orte/mca/errmgr/default_orted/errmgr_default_orted.c: > > if (0 == orte_routed.num_routes() && > 0 == opal_list_get_size(&orte_local_children)) { > orte_quit(); > } Thanks for looking at this more closely. I'll restore those lines, and see if we are actually getting there. Could be the system I'm using behaves differently. > > 2. Remove the "default" versions and fall back to a single sane set of error > managers. No thanks, I'd rather only go thru this recovery process once :-) > > george. > > PS: I feel compelled to clarify a statement here. The failure detection at > the socket level, is working properly under all circumstances, how we deal > with it at the upper levels suffered some mishandling lately. > > > On Oct 11, 2011, at 13:17 , George Bosilca wrote: > >> >> On Oct 11, 2011, at 01:17 , Ralph Castain wrote: >> >>> >>> On Oct 10, 2011, at 11:29 AM, Ralph Castain wrote: >>> >>>> >>>> On Oct 10, 2011, at 11:14 AM, George Bosilca wrote: >>>> >>>>> Ralph, >>>>> >>>>> If you don't mind I would like to understand this issue a little bit >>>>> more. What exactly is broken in the termination detection? >>>>> >>>>>> From a network point of view, there is a slight issue with the commit >>>>>> 25245. A direct call to exit will close all pending sockets, with a >>>>>> linger of 60 seconds (quite bad if you use static ports as an example). >>>>>> There are proper protocols to shutdown sockets in a reliable way, maybe >>>>>> it is time to implement one of them. >>> >>> I should have clarified something here. The change in r25245 doesn't cause >>> the daemons to directly call exit. It causes them to call orte_quit, which >>> runs thru orte_finalize on its way out. So if the oob properly shuts down >>> its sockets, then there will be no issue with lingering sockets. >>> >>> The question has always been: does the oob properly shut down sockets? If >>> someone has time to address that issue, it would indeed be helpful. >> >> Please, allow me to clarify this one. We lived with the OOB issue for 7 (oh >> a lucky number) years, and we always manage to patch the rest of the ORTE to >> deal with. >> >> Why suddenly someone would bother to fix it? >> >> george. >> >>> >>> >>>> >>>> Did I miss this message somewhere? I didn't receive it. >>>> >>>> The problem is that the daemons with children are not seeing the sockets >>>> close when their child daemons terminate. So the lowest layer of daemons >>>> terminate (as they have no children). The next layer does not terminate, >>>> leaving mpirun hanging. >>>> >>>> What's interesting is that mpirun does see the sockets directly connected >>>> to it close. As you probably recall, the binomial routed module has >>>> daemons directly callback to mpirun, so that connection is created. The >>>> socket to the next level of daemon gets created later - it is this later >>>> socket whose closure isn't being detected. >>>> >>>> From what I could see, it appeared to be a progress issue. The socket may >>>> be closed, but the event lib may not be progressing to detect it. As we >>>> intend to fix that anyway with the async progress work, and static ports >>>> are not the default behavior (and rarely used), I didn't feel it worth >>>> leaving the trunk hanging while continuing to chase it. >>>> >>>> If someone has time, properly closing the sockets is something we've >>>> talked about for quite awhile, but never gotten around to doing. I'm not >>>> sure it will resolve this problem, but would be worth doing anyway. >>>> >>>> >>>>> >>>>> Thanks, >>>>> george. >>>>> >>>>> On Oct 10, 2011, at 12:40 , Ralph Castain wrote: >>>>> >>>>>> It wasn't the launcher that was broken, but termination detection, and >>>>>> not for all environments (e.g., worked fine for slurm). It is a >>>>>> progress-related issue. >>>>>> >>>>>> Should be fixed in r25245. >>>>>> >>>>>> >>>>>> On Oct 10, 2011, at 8:33 AM, Shamis, Pavel wrote: >>>>>> >>>>>>> + 1 , I see the same issue. >>>>>>> >>>>>>>> -----Original Message----- >>>>>>>> From: devel-boun...@open-mpi.org [mailto:devel-boun...@open-mpi.org] >>>>>>>> On Behalf Of Yevgeny Kliteynik >>>>>>>> Sent: Monday, October 10, 2011 10:24 AM >>>>>>>> To: OpenMPI Devel >>>>>>>> Subject: [OMPI devel] Launcher in trunk is broken? >>>>>>>> >>>>>>>> It looks like the process launcher is broken in the OMPI trunk: >>>>>>>> If you run any simple test (not necessarily including MPI calls) on 4 >>>>>>>> or >>>>>>>> more nodes, the MPI processes won't be killed after the test finishes. >>>>>>>> >>>>>>>> $ mpirun -host host_1,host_2,host_3,host_4 -np 4 --mca btl sm,tcp,self >>>>>>>> /bin/hostname >>>>>>>> >>>>>>>> Output: >>>>>>>> host_1 >>>>>>>> host_2 >>>>>>>> host_3 >>>>>>>> host_4 >>>>>>>> >>>>>>>> And test is hanging...... >>>>>>>> >>>>>>>> I have an older trunk (r25228), and everything is OK there. >>>>>>>> Not sure if it means that something was broken after that, or the >>>>>>>> problem >>>>>>>> existed before, but kicked in only now due to some other change. >>>>>>>> >>>>>>>> -- YK >>>>>>>> _______________________________________________ >>>>>>>> devel mailing list >>>>>>>> de...@open-mpi.org >>>>>>>> hxxp://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>> >>>>>>> >>>>>>> _______________________________________________ >>>>>>> devel mailing list >>>>>>> de...@open-mpi.org >>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> devel mailing list >>>>>> de...@open-mpi.org >>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>> >>>>> >>>>> _______________________________________________ >>>>> devel mailing list >>>>> de...@open-mpi.org >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>> >>> >>> >>> _______________________________________________ >>> devel mailing list >>> de...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> >> >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel > > > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel