Re: [OMPI devel] opal_event_loop exiting

Greg Watson Thu, 20 Apr 2006 13:06:31 -0400

The simplest thing for us would be for opal_event_loop() to return anerror value. That way we can detect the situation and clean up oursystem. At the moment we're not trying to restart orted, so cleanrecovery of orte is not that important, though ultimately I wouldthink it is desirable. Other alternatives are to pass you an errorhandler that you call, or you could send a signal that we can trap.

From our perspective, we're simply calling a library that doesstuff. Having the library call exit() at any point is a major problemfor applications trying to do more than run a single job.


Greg

On Apr 20, 2006, at 9:40 AM, Ralph Castain wrote:

Well, I actually don't know much about opal_event_loop and/or howit is intended to work. My guess is that:
(a) your remote orted is acting as the seed and your local process(the one in Eclipse) is running as a client to that seed - atleast, that was the case last I talked to Nathan
(b) when the seed orted dies, it is the oob in your local clientthat actually detects socket closure and decides that - since it isthe seed that has lost contact - the local application must abort.
(c) the errmgr.abort function does exactly what it was supposed todo - it provides an immediate way of killing the local process.
I'd be a little hesitant to recommend overloading the errmgr.abortfunction as you really do want the local processes to die whenlosing connection to the seed (at least, until we develop arecovery capability for the seed orted - which is some ways off),and (given the way you are running) I'm not sure you can have adifferent errmgr for your process while leaving the other one foreveryone else.
Probably the best solution for now would be for us to insert a (yetanother) MCA parameter into the errmgr that would (if set) haveerrmgr.abort do something other than exit. The question then is:what would you want it to do?? We need to have it tell the rest ofthe system to stop trying to send messages etc - right now, I don'tthink the infrastructure exists to do that short of killing orte.
We could try to have errmgr.abort do an orte_finalize - that wouldkill the orte system without impacting your host program, Isuspect. You would then have to re-initialize, so we'd have to findsome way to let you know that we had finalized. I can't swear thiswill work, though - we might well generate a segfault since this ishappening deep down inside the system. We could try it, though.
Would any of that be of help? Do you have any suggestions on how wemight let you know that we had finalized?
Ralph


Brian Barrett wrote:
On Apr 19, 2006, at 4:15 PM, Greg Watson wrote:
We've just run across a rather tricky issue. We're callingopal_event_loop() to dispatch orte events to an orted that hasbeen launched separately. However if the orted dies for somereason (gets a signal or whatever) then opal_event_loop() iscalling exit(). Needless to say, this is not good behavior us.Any suggestions on how to get around this problem?
Is the orted you are connecting to the "seed" daemon? I think theonly time we should be exiting like that is if the orted was theseed daemon. I'm not sure what we want to do if that's the case --it looks like we're calling errmgr.abort() when badness happens. Iwonder if your application can provide its own errmgr componentthat provides an abort that doesn't actually abort? Just some offthe cuff ideas -- Ralph could probably give a better idea ofexactly what is happening... Brian
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

Re: [OMPI devel] opal_event_loop exiting

Reply via email to