Re: [OMPI devel] RFC: Resilient ORTE

2011-06-23 Thread Wesley Bland
Committed in r24815. On Thursday, June 23, 2011 at 4:19 PM, Ralph Castain wrote: > > On Jun 23, 2011, at 2:14 PM, Wesley Bland wrote: > > Maybe before the ORTED saw the signal, it detected a communication failure > > and reacted to that. > > Quite possible. However, remember that procs local

Re: [OMPI devel] RFC: Resilient ORTE

2011-06-23 Thread Josh Hursey
Ga - what a rookie mistake :) I tested the patched test and it works as advertised for the small scale tests I used before. So I'm good with this going in today. Thanks, Josh On Thu, Jun 23, 2011 at 3:34 PM, Wesley Bland wrote: > Right. Sorry I misspoke. > > On Thursday,

Re: [OMPI devel] RFC: Resilient ORTE

2011-06-23 Thread Wesley Bland
Right. Sorry I misspoke. On Thursday, June 23, 2011 at 3:32 PM, Ralph Castain wrote: > Ummm...just to clarify. There are no threads in ORTE, so it wasn't a problem > of "not giving up the thread". The problem was that Josh's test never called > progress. It would have been equally okay to

Re: [OMPI devel] RFC: Resilient ORTE

2011-06-23 Thread Ralph Castain
Ummm...just to clarify. There are no threads in ORTE, so it wasn't a problem of "not giving up the thread". The problem was that Josh's test never called progress. It would have been equally okay to simply call "opal_event_dispatch" while waiting for the callback. All applications have to

Re: [OMPI devel] RFC: Resilient ORTE

2011-06-23 Thread Wesley Bland
Josh, There were a couple of bugs that I cleared up in my most recent checkin, but I also needed to modify your test. The callback for the application layer errmgr actually occurs in the application layer. Your test was never giving up the thread to the ORTE application event loop to receive

Re: [OMPI devel] RFC: Resilient ORTE

2011-06-23 Thread Josh Hursey
So I finally got a chance to test the branch this morning. I cannot get it to work. Maybe I'm doing some wrong, missing some MCA parameter? - [jjhursey@smoky-login1 resilient-orte] hg summary parent: 2:c550cf6ed6a2 tip Newest version. Synced with trunk r24785. branch:

Re: [OMPI devel] RFC: Resilient ORTE

2011-06-23 Thread Wesley Bland
Last reminder (I hope). RFC goes in a COB today. Wesley

Re: [OMPI devel] RFC: Resilient ORTE

2011-06-18 Thread Josh Hursey
Sounds good. Thanks. On Sat, Jun 18, 2011 at 9:31 AM, Wesley Bland wrote: > That's fine. Let's say Thursday COB is now the timeout. > > On Jun 18, 2011 9:10 AM, "Joshua Hursey" wrote: >> Cool. Then can we hold off pushing this into the trunk for a

Re: [OMPI devel] RFC: Resilient ORTE

2011-06-18 Thread Wesley Bland
That's fine. Let's say Thursday COB is now the timeout. On Jun 18, 2011 9:10 AM, "Joshua Hursey" wrote: > Cool. Then can we hold off pushing this into the trunk for a couple days until I get a chance to test it? Monday COB does not give me much time since we just got the

Re: [OMPI devel] RFC: Resilient ORTE

2011-06-18 Thread Joshua Hursey
Cool. Then can we hold off pushing this into the trunk for a couple days until I get a chance to test it? Monday COB does not give me much time since we just got the new patch on Friday COB (the RFC gave us 2 weeks to review the original patch). Would waiting until next Thursday/Friday COB be

Re: [OMPI devel] RFC: Resilient ORTE

2011-06-17 Thread Wesley Bland
I believe that it does. I made quite a few changes in the last checkin though I didn't run your specific test this afternoon. I'll be able to try it later this evening but it should be easy to test now that it's synced with the trunk again. On Jun 17, 2011 5:32 PM, "Josh Hursey"

Re: [OMPI devel] RFC: Resilient ORTE

2011-06-17 Thread Josh Hursey
Does this include a fix for the problem I reported with mpirun-hosted processes? If not I would ask that we holding off on putting it into the trunk until that particular bug is addressed. From my experience tackling this particular issues requires some code refactoring, which should probably be

Re: [OMPI devel] RFC: Resilient ORTE

2011-06-10 Thread Ralph Castain
No issue - just trying to get ahead of the game instead of running into an issue later. We can leave it for now. On Jun 10, 2011, at 2:47 PM, Josh Hursey wrote: > We could, but we could also just replace the callback. I will never > what to use it in my scenario, and if I did then I could just

Re: [OMPI devel] RFC: Resilient ORTE

2011-06-10 Thread Josh Hursey
We could, but we could also just replace the callback. I will never what to use it in my scenario, and if I did then I could just call it directly instead of relying on the errmgr to do the right thing. So why complicate the errmgr with additional complexity for something that we don't need at the

Re: [OMPI devel] RFC: Resilient ORTE

2011-06-10 Thread Ralph Castain
So why not have the callback return an int, and your callback returns "go no further"? On Jun 10, 2011, at 2:06 PM, Josh Hursey wrote: > Yeah I do not want the default fatal callback in OMPI. I want to > replace it with something that allows OMPI to continue running when > there are process

Re: [OMPI devel] RFC: Resilient ORTE

2011-06-10 Thread Josh Hursey
Yeah I do not want the default fatal callback in OMPI. I want to replace it with something that allows OMPI to continue running when there are process failures (if the error handlers associated with the communicators permit such an action). So having the default fatal callback called after mine

Re: [OMPI devel] RFC: Resilient ORTE

2011-06-10 Thread Ralph Castain
On Jun 10, 2011, at 6:32 AM, Josh Hursey wrote: > On Fri, Jun 10, 2011 at 7:37 AM, Ralph Castain wrote: >> >> On Jun 9, 2011, at 6:12 PM, Joshua Hursey wrote: >> >>> >>> On Jun 9, 2011, at 6:47 PM, George Bosilca wrote: >>> Well, you're way to trusty. ;) >>> >>>

Re: [OMPI devel] RFC: Resilient ORTE

2011-06-10 Thread Ralph Castain
On Jun 10, 2011, at 7:01 AM, Josh Hursey wrote: > On Fri, Jun 10, 2011 at 8:51 AM, Ralph Castain wrote: >> >> On Jun 10, 2011, at 6:38 AM, Josh Hursey wrote: >> >>> Another problem with this patch, that I mentioned to Wesley and George >>> off list, is that it does not

Re: [OMPI devel] RFC: Resilient ORTE

2011-06-10 Thread Josh Hursey
On Fri, Jun 10, 2011 at 8:51 AM, Ralph Castain wrote: > > On Jun 10, 2011, at 6:38 AM, Josh Hursey wrote: > >> Another problem with this patch, that I mentioned to Wesley and George >> off list, is that it does not handle the case when mpirun/HNP is also >> hosting processes

Re: [OMPI devel] RFC: Resilient ORTE

2011-06-10 Thread Ralph Castain
On Jun 10, 2011, at 6:32 AM, Josh Hursey wrote: > On Fri, Jun 10, 2011 at 7:37 AM, Ralph Castain wrote: >> >> On Jun 9, 2011, at 6:12 PM, Joshua Hursey wrote: >> >>> >>> On Jun 9, 2011, at 6:47 PM, George Bosilca wrote: >>> Well, you're way to trusty. ;) >>> >>>

Re: [OMPI devel] RFC: Resilient ORTE

2011-06-10 Thread Ralph Castain
On Jun 10, 2011, at 6:38 AM, Josh Hursey wrote: > Another problem with this patch, that I mentioned to Wesley and George > off list, is that it does not handle the case when mpirun/HNP is also > hosting processes that might fail. In my testing of the patch it > worked fine if mpirun/HNP was

Re: [OMPI devel] RFC: Resilient ORTE

2011-06-10 Thread Josh Hursey
Another problem with this patch, that I mentioned to Wesley and George off list, is that it does not handle the case when mpirun/HNP is also hosting processes that might fail. In my testing of the patch it worked fine if mpirun/HNP was -not- hosting any processes, but once it had to host processes

Re: [OMPI devel] RFC: Resilient ORTE

2011-06-10 Thread Ralph Castain
Okay, finally have time to sit down and review this. It looks pretty much identical to what was done in ORCM - we just kept "epoch" separate from the process name, and use multicast to notify all procs that someone failed. I do have a few questions/comments about your proposed patch: 1. I note

Re: [OMPI devel] RFC: Resilient ORTE

2011-06-10 Thread Josh Hursey
On Fri, Jun 10, 2011 at 7:37 AM, Ralph Castain wrote: > > On Jun 9, 2011, at 6:12 PM, Joshua Hursey wrote: > >> >> On Jun 9, 2011, at 6:47 PM, George Bosilca wrote: >> >>> Well, you're way to trusty. ;) >> >> It's the midwestern boy in me :) > > Still need to shake that corn

Re: [OMPI devel] RFC: Resilient ORTE

2011-06-10 Thread Ralph Castain
Something else you might want to address in here: the current code sends an RML message from the proc calling abort to its local daemon telling the daemon that we are exiting due to the app calling "abort". We needed to do this because we wanted to flag the proc termination as one induced by

Re: [OMPI devel] RFC: Resilient ORTE

2011-06-10 Thread Ralph Castain
On Jun 9, 2011, at 6:12 PM, Joshua Hursey wrote: > > On Jun 9, 2011, at 6:47 PM, George Bosilca wrote: > >> Well, you're way to trusty. ;) > > It's the midwestern boy in me :) Still need to shake that corn out of your head... :-) > >> >> This only works if all component play the game, and

Re: [OMPI devel] RFC: Resilient ORTE

2011-06-09 Thread Joshua Hursey
On Jun 9, 2011, at 6:47 PM, George Bosilca wrote: > Well, you're way to trusty. ;) It's the midwestern boy in me :) > > This only works if all component play the game, and even then there it is > difficult if you want to allow components to deregister themselves in the > middle of the

Re: [OMPI devel] RFC: Resilient ORTE

2011-06-09 Thread Josh Hursey
So the "Resilient ORTE" patch has a registration in ompi_mpi_init.c: - orte_errmgr.set_fault_callback(_errhandler_runtime_callback); - Which is a callback that just calls abort (which is what we want to do by default): - void

Re: [OMPI devel] RFC: Resilient ORTE

2011-06-09 Thread Ralph Castain
You mean you want the abort API to point somewhere else, without using a new component? Perhaps a telecon would help resolve this quicker? I'm available tomorrow or anytime next week, if that helps. On Thu, Jun 9, 2011 at 11:02 AM, Josh Hursey wrote: > As long as there

Re: [OMPI devel] RFC: Resilient ORTE

2011-06-09 Thread Josh Hursey
As long as there is the ability to remove and replace a callback I'm fine. I personally think that forcing the errmgr to track ordering of callback registration makes it a more complex solution, but as long as it works. In particular I need to replace the default 'abort' errmgr call in OMPI with

Re: [OMPI devel] RFC: Resilient ORTE

2011-06-09 Thread Ralph Castain
I agree - let's not get overly complex unless we can clearly articulate a requirement to do so. On Thu, Jun 9, 2011 at 10:45 AM, George Bosilca wrote: > This will require exactly opposite registration and de-registration order, > or no de-registration at all (aka no way to

Re: [OMPI devel] RFC: Resilient ORTE

2011-06-09 Thread George Bosilca
This will require exactly opposite registration and de-registration order, or no de-registration at all (aka no way to unload a component). Or some even more complex code to deal with internally. If the error manager handle the callbacks it can use the registration ordering (which will be what

Re: [OMPI devel] RFC: Resilient ORTE

2011-06-09 Thread Josh Hursey
On Wed, Jun 8, 2011 at 5:37 PM, Wesley Bland wrote: > On Tuesday, June 7, 2011 at 4:55 PM, Josh Hursey wrote: > > - orte_errmgr.post_startup() start the persistent RML message. There > does not seem to be a shutdown version of this (to deregister the RML > message at

Re: [OMPI devel] RFC: Resilient ORTE

2011-06-08 Thread George Bosilca
Well well well, that wasn't supposed to go on the mailing list ;) george On Jun 8, 2011, at 17:43 , George Bosilca wrote: > Hey if you want to push to the extreme the logic of the "computer scientist" > you were talking about in my office, then return the previous callback and > let the

Re: [OMPI devel] RFC: Resilient ORTE

2011-06-08 Thread Wesley Bland
On Tuesday, June 7, 2011 at 4:55 PM, Josh Hursey wrote: - orte_errmgr.post_startup() start the persistent RML message. There does not seem to be a shutdown version of this (to deregister the RML message at orte_finalize time). Was this intentional, or just missed? I just missed that one. I've

Re: [OMPI devel] RFC: Resilient ORTE

2011-06-07 Thread Ralph Castain
Thanks - that helps! On Tue, Jun 7, 2011 at 1:25 PM, Wesley Bland wrote: > Definitely we are targeting ORTED failures here. If an ORTED fails than > any other ORTEDs connected to it will notice and report the failure. Of > course if the failure is an application than the

Re: [OMPI devel] RFC: Resilient ORTE

2011-06-07 Thread Josh Hursey
I looked through the patch a bit more today and had a few notes/questions. - orte_errmgr.post_startup() start the persistent RML message. There does not seem to be a shutdown version of this (to deregister the RML message at orte_finalize time). Was this intentional, or just missed? - in the

Re: [OMPI devel] RFC: Resilient ORTE

2011-06-07 Thread Wesley Bland
Definitely we are targeting ORTED failures here. If an ORTED fails than any other ORTEDs connected to it will notice and report the failure. Of course if the failure is an application than the ORTED on that node will be the only one to detect it. Also, if an ORTED is lost, all of the

Re: [OMPI devel] RFC: Resilient ORTE

2011-06-07 Thread Ralph Castain
Quick question: could you please clarify this statement: ...because more than one ORTED could (and often will) detect the failure. > I don't understand how this can be true, except for detecting an ORTED failure. Only one orted can detect an MPI process failure, unless you have now involved

Re: [OMPI devel] RFC: Resilient ORTE

2011-06-07 Thread Ralph Castain
Ah - thanks! That really helped clarify things. Much appreciated. Will look at the patch in this light... On Tue, Jun 7, 2011 at 1:00 PM, Wesley Bland wrote: > > Perhaps it would help if you folks could provide a little explanation about > how you use epoch? While the

Re: [OMPI devel] RFC: Resilient ORTE

2011-06-07 Thread Wesley Bland
> > Perhaps it would help if you folks could provide a little explanation about > how you use epoch? While the value sounds similar, your explanations are > beginning to sound very different from what we are doing and/or had > envisioned. > > I'm not sure how you can talk about an epoch

Re: [OMPI devel] RFC: Resilient ORTE

2011-06-07 Thread Ralph Castain
On Tue, Jun 7, 2011 at 10:37 AM, George Bosilca wrote: > > On Jun 7, 2011, at 12:14 , Ralph Castain wrote: > > > But the epoch is process-unique - i.e., it is the number of times that > this specific process has been started, which differs per proc since we > don't restart

Re: [OMPI devel] RFC: Resilient ORTE

2011-06-07 Thread Ralph Castain
On Tue, Jun 7, 2011 at 10:35 AM, Wesley Bland wrote: > > On Tuesday, June 7, 2011 at 12:14 PM, Ralph Castain wrote: > > > > On Tue, Jun 7, 2011 at 9:45 AM, Wesley Bland wrote: > > To adress your concerns about putting the epoch in the process name >

Re: [OMPI devel] RFC: Resilient ORTE

2011-06-07 Thread George Bosilca
On Jun 7, 2011, at 12:14 , Ralph Castain wrote: > But the epoch is process-unique - i.e., it is the number of times that this > specific process has been started, which differs per proc since we don't > restart all the procs every time one fails. Yes the epoch is per process, but it is

Re: [OMPI devel] RFC: Resilient ORTE

2011-06-07 Thread Wesley Bland
On Tuesday, June 7, 2011 at 12:14 PM, Ralph Castain wrote: > > > On Tue, Jun 7, 2011 at 9:45 AM, Wesley Bland (mailto:wbl...@eecs.utk.edu)> wrote: > > To adress your concerns about putting the epoch in the process name > > structure, putting it in there rather than in

Re: [OMPI devel] RFC: Resilient ORTE

2011-06-07 Thread Wesley Bland
To adress your concerns about putting the epoch in the process name structure, putting it in there rather than in a separately maintained list simplifies things later. For example, during communication you need to attach the epoch to each of your messages so they can be tracked later. If a

Re: [OMPI devel] RFC: Resilient ORTE

2011-06-07 Thread Ralph Castain
Thanks for the explanation - as I said, I won't have time to really review the patch this week, but appreciate the info. I don't really expect to see a conflict as George had discussed this with me previously. I know I'll have merge conflicts with my state machine branch, which would be ready for

Re: [OMPI devel] RFC: Resilient ORTE

2011-06-07 Thread Ralph Castain
I'm on travel this week, but will look this over when I return. From the description, it sounds nearly identical to what we did in ORCM, so I expect there won't be many issues. You do get some race conditions that the new state machine code should help resolve. Only difference I can quickly see

[OMPI devel] RFC: Resilient ORTE

2011-06-06 Thread George Bosilca
WHAT: Allow the runtime to handle fail-stop failures for both runtime (daemons) or application level processes. This patch extends the orte_process_name_t structure with a field to store the process epoch (the number of times it died so far), and add an application failure notification callback