subject:"\[OMPI devel\] RFC\: Resilient ORTE"

Re: [OMPI devel] RFC: Resilient ORTE

2011-06-23 Thread Wesley Bland

Committed in r24815. On Thursday, June 23, 2011 at 4:19 PM, Ralph Castain wrote: > > On Jun 23, 2011, at 2:14 PM, Wesley Bland wrote: > > Maybe before the ORTED saw the signal, it detected a communication failure > > and reacted to that. > > Quite possible. However, remember that procs local

Re: [OMPI devel] RFC: Resilient ORTE

2011-06-23 Thread Josh Hursey

Ga - what a rookie mistake :) I tested the patched test and it works as advertised for the small scale tests I used before. So I'm good with this going in today. Thanks, Josh On Thu, Jun 23, 2011 at 3:34 PM, Wesley Bland wrote: > Right. Sorry I misspoke. > > On Thursday, June 23, 2011 at 3:32 P

Re: [OMPI devel] RFC: Resilient ORTE

2011-06-23 Thread Wesley Bland

Right. Sorry I misspoke. On Thursday, June 23, 2011 at 3:32 PM, Ralph Castain wrote: > Ummm...just to clarify. There are no threads in ORTE, so it wasn't a problem > of "not giving up the thread". The problem was that Josh's test never called > progress. It would have been equally okay to simpl

Re: [OMPI devel] RFC: Resilient ORTE

2011-06-23 Thread Ralph Castain

Ummm...just to clarify. There are no threads in ORTE, so it wasn't a problem of "not giving up the thread". The problem was that Josh's test never called progress. It would have been equally okay to simply call "opal_event_dispatch" while waiting for the callback. All applications have to cycle

Re: [OMPI devel] RFC: Resilient ORTE

2011-06-23 Thread Wesley Bland

Josh, There were a couple of bugs that I cleared up in my most recent checkin, but I also needed to modify your test. The callback for the application layer errmgr actually occurs in the application layer. Your test was never giving up the thread to the ORTE application event loop to receive it

Re: [OMPI devel] RFC: Resilient ORTE

2011-06-23 Thread Josh Hursey

So I finally got a chance to test the branch this morning. I cannot get it to work. Maybe I'm doing some wrong, missing some MCA parameter? - [jjhursey@smoky-login1 resilient-orte] hg summary parent: 2:c550cf6ed6a2 tip Newest version. Synced with trunk r24785. branch: defa

Re: [OMPI devel] RFC: Resilient ORTE

2011-06-23 Thread Wesley Bland

Last reminder (I hope). RFC goes in a COB today. Wesley

Re: [OMPI devel] RFC: Resilient ORTE

2011-06-18 Thread Josh Hursey

Sounds good. Thanks. On Sat, Jun 18, 2011 at 9:31 AM, Wesley Bland wrote: > That's fine. Let's say Thursday COB is now the timeout. > > On Jun 18, 2011 9:10 AM, "Joshua Hursey" wrote: >> Cool. Then can we hold off pushing this into the trunk for a couple days >> until I get a chance to test it?

Re: [OMPI devel] RFC: Resilient ORTE

2011-06-18 Thread Wesley Bland

That's fine. Let's say Thursday COB is now the timeout. On Jun 18, 2011 9:10 AM, "Joshua Hursey" wrote: > Cool. Then can we hold off pushing this into the trunk for a couple days until I get a chance to test it? Monday COB does not give me much time since we just got the new patch on Friday COB (t

Re: [OMPI devel] RFC: Resilient ORTE

2011-06-18 Thread Joshua Hursey

Cool. Then can we hold off pushing this into the trunk for a couple days until I get a chance to test it? Monday COB does not give me much time since we just got the new patch on Friday COB (the RFC gave us 2 weeks to review the original patch). Would waiting until next Thursday/Friday COB be to

Re: [OMPI devel] RFC: Resilient ORTE

2011-06-17 Thread Wesley Bland

I believe that it does. I made quite a few changes in the last checkin though I didn't run your specific test this afternoon. I'll be able to try it later this evening but it should be easy to test now that it's synced with the trunk again. On Jun 17, 2011 5:32 PM, "Josh Hursey" wrote: > Does this

Re: [OMPI devel] RFC: Resilient ORTE

2011-06-17 Thread Josh Hursey

Does this include a fix for the problem I reported with mpirun-hosted processes? If not I would ask that we holding off on putting it into the trunk until that particular bug is addressed. From my experience tackling this particular issues requires some code refactoring, which should probably be d

Re: [OMPI devel] RFC: Resilient ORTE

2011-06-17 Thread Wesley Bland

This is a reminder that the Resilient ORTE RFC is set to go into the trunk on Monday at COB. I've updated the code with a few of the changes that were mentioned on and off the list (moved code out of orted_comm.c, errmgr_set_callback returns previous callback, post_startup function, corrected n

Re: [OMPI devel] RFC: Resilient ORTE

2011-06-10 Thread Ralph Castain

No issue - just trying to get ahead of the game instead of running into an issue later. We can leave it for now. On Jun 10, 2011, at 2:47 PM, Josh Hursey wrote: > We could, but we could also just replace the callback. I will never > what to use it in my scenario, and if I did then I could just

Re: [OMPI devel] RFC: Resilient ORTE

2011-06-10 Thread Josh Hursey

We could, but we could also just replace the callback. I will never what to use it in my scenario, and if I did then I could just call it directly instead of relying on the errmgr to do the right thing. So why complicate the errmgr with additional complexity for something that we don't need at the

Re: [OMPI devel] RFC: Resilient ORTE

2011-06-10 Thread Ralph Castain

So why not have the callback return an int, and your callback returns "go no further"? On Jun 10, 2011, at 2:06 PM, Josh Hursey wrote: > Yeah I do not want the default fatal callback in OMPI. I want to > replace it with something that allows OMPI to continue running when > there are process fai

Re: [OMPI devel] RFC: Resilient ORTE

2011-06-10 Thread Josh Hursey

Yeah I do not want the default fatal callback in OMPI. I want to replace it with something that allows OMPI to continue running when there are process failures (if the error handlers associated with the communicators permit such an action). So having the default fatal callback called after mine wou

Re: [OMPI devel] RFC: Resilient ORTE

2011-06-10 Thread Ralph Castain

On Jun 10, 2011, at 6:32 AM, Josh Hursey wrote: > On Fri, Jun 10, 2011 at 7:37 AM, Ralph Castain wrote: >> >> On Jun 9, 2011, at 6:12 PM, Joshua Hursey wrote: >> >>> >>> On Jun 9, 2011, at 6:47 PM, George Bosilca wrote: >>> Well, you're way to trusty. ;) >>> >>> It's the midwestern boy

Re: [OMPI devel] RFC: Resilient ORTE

2011-06-10 Thread Ralph Castain

On Jun 10, 2011, at 7:01 AM, Josh Hursey wrote: > On Fri, Jun 10, 2011 at 8:51 AM, Ralph Castain wrote: >> >> On Jun 10, 2011, at 6:38 AM, Josh Hursey wrote: >> >>> Another problem with this patch, that I mentioned to Wesley and George >>> off list, is that it does not handle the case when mpi

Re: [OMPI devel] RFC: Resilient ORTE

2011-06-10 Thread Josh Hursey

On Fri, Jun 10, 2011 at 8:51 AM, Ralph Castain wrote: > > On Jun 10, 2011, at 6:38 AM, Josh Hursey wrote: > >> Another problem with this patch, that I mentioned to Wesley and George >> off list, is that it does not handle the case when mpirun/HNP is also >> hosting processes that might fail. In my

Re: [OMPI devel] RFC: Resilient ORTE

2011-06-10 Thread Ralph Castain

On Jun 10, 2011, at 6:32 AM, Josh Hursey wrote: > On Fri, Jun 10, 2011 at 7:37 AM, Ralph Castain wrote: >> >> On Jun 9, 2011, at 6:12 PM, Joshua Hursey wrote: >> >>> >>> On Jun 9, 2011, at 6:47 PM, George Bosilca wrote: >>> Well, you're way to trusty. ;) >>> >>> It's the midwestern boy

Re: [OMPI devel] RFC: Resilient ORTE

2011-06-10 Thread Ralph Castain

On Jun 10, 2011, at 6:38 AM, Josh Hursey wrote: > Another problem with this patch, that I mentioned to Wesley and George > off list, is that it does not handle the case when mpirun/HNP is also > hosting processes that might fail. In my testing of the patch it > worked fine if mpirun/HNP was -not-

Re: [OMPI devel] RFC: Resilient ORTE

2011-06-10 Thread Josh Hursey

Another problem with this patch, that I mentioned to Wesley and George off list, is that it does not handle the case when mpirun/HNP is also hosting processes that might fail. In my testing of the patch it worked fine if mpirun/HNP was -not- hosting any processes, but once it had to host processes

Re: [OMPI devel] RFC: Resilient ORTE

2011-06-10 Thread Ralph Castain

Okay, finally have time to sit down and review this. It looks pretty much identical to what was done in ORCM - we just kept "epoch" separate from the process name, and use multicast to notify all procs that someone failed. I do have a few questions/comments about your proposed patch: 1. I note

Re: [OMPI devel] RFC: Resilient ORTE

2011-06-10 Thread Josh Hursey

On Fri, Jun 10, 2011 at 7:37 AM, Ralph Castain wrote: > > On Jun 9, 2011, at 6:12 PM, Joshua Hursey wrote: > >> >> On Jun 9, 2011, at 6:47 PM, George Bosilca wrote: >> >>> Well, you're way to trusty. ;) >> >> It's the midwestern boy in me :) > > Still need to shake that corn out of your head... :-

Re: [OMPI devel] RFC: Resilient ORTE

2011-06-10 Thread Ralph Castain

Something else you might want to address in here: the current code sends an RML message from the proc calling abort to its local daemon telling the daemon that we are exiting due to the app calling "abort". We needed to do this because we wanted to flag the proc termination as one induced by the

Re: [OMPI devel] RFC: Resilient ORTE

2011-06-10 Thread Ralph Castain

On Jun 9, 2011, at 6:12 PM, Joshua Hursey wrote: > > On Jun 9, 2011, at 6:47 PM, George Bosilca wrote: > >> Well, you're way to trusty. ;) > > It's the midwestern boy in me :) Still need to shake that corn out of your head... :-) > >> >> This only works if all component play the game, and

Re: [OMPI devel] RFC: Resilient ORTE

2011-06-09 Thread Joshua Hursey

On Jun 9, 2011, at 6:47 PM, George Bosilca wrote: > Well, you're way to trusty. ;) It's the midwestern boy in me :) > > This only works if all component play the game, and even then there it is > difficult if you want to allow components to deregister themselves in the > middle of the execut

Re: [OMPI devel] RFC: Resilient ORTE

2011-06-09 Thread George Bosilca

Well, you're way to trusty. ;) This only works if all component play the game, and even then there it is difficult if you want to allow components to deregister themselves in the middle of the execution. The problem is that a callback will be previous for some component, and that when you want

Re: [OMPI devel] RFC: Resilient ORTE

2011-06-09 Thread Josh Hursey

So the "Resilient ORTE" patch has a registration in ompi_mpi_init.c: - orte_errmgr.set_fault_callback(&ompi_errhandler_runtime_callback); - Which is a callback that just calls abort (which is what we want to do by default): - void ompi_errhandler_runtime_callbac

Re: [OMPI devel] RFC: Resilient ORTE

2011-06-09 Thread Ralph Castain

You mean you want the abort API to point somewhere else, without using a new component? Perhaps a telecon would help resolve this quicker? I'm available tomorrow or anytime next week, if that helps. On Thu, Jun 9, 2011 at 11:02 AM, Josh Hursey wrote: > As long as there is the ability to remove

Re: [OMPI devel] RFC: Resilient ORTE

2011-06-09 Thread Josh Hursey

As long as there is the ability to remove and replace a callback I'm fine. I personally think that forcing the errmgr to track ordering of callback registration makes it a more complex solution, but as long as it works. In particular I need to replace the default 'abort' errmgr call in OMPI with s

Re: [OMPI devel] RFC: Resilient ORTE

2011-06-09 Thread Ralph Castain

I agree - let's not get overly complex unless we can clearly articulate a requirement to do so. On Thu, Jun 9, 2011 at 10:45 AM, George Bosilca wrote: > This will require exactly opposite registration and de-registration order, > or no de-registration at all (aka no way to unload a component). Or

Re: [OMPI devel] RFC: Resilient ORTE

2011-06-09 Thread George Bosilca

This will require exactly opposite registration and de-registration order, or no de-registration at all (aka no way to unload a component). Or some even more complex code to deal with internally. If the error manager handle the callbacks it can use the registration ordering (which will be what

Re: [OMPI devel] RFC: Resilient ORTE

2011-06-09 Thread Josh Hursey

On Wed, Jun 8, 2011 at 5:37 PM, Wesley Bland wrote: > On Tuesday, June 7, 2011 at 4:55 PM, Josh Hursey wrote: > > - orte_errmgr.post_startup() start the persistent RML message. There > does not seem to be a shutdown version of this (to deregister the RML > message at orte_finalize time). Was this

Re: [OMPI devel] RFC: Resilient ORTE

2011-06-08 Thread George Bosilca

Well well well, that wasn't supposed to go on the mailing list ;) george On Jun 8, 2011, at 17:43 , George Bosilca wrote: > Hey if you want to push to the extreme the logic of the "computer scientist" > you were talking about in my office, then return the previous callback and > let the uppe

Re: [OMPI devel] RFC: Resilient ORTE

2011-06-08 Thread George Bosilca

Hey if you want to push to the extreme the logic of the "computer scientist" you were talking about in my office, then return the previous callback and let the upper layer do the right thing. Suppose they don't screw up for once ... george. On Jun 8, 2011, at 17:37 , Wesley Bland wrote: > A

Re: [OMPI devel] RFC: Resilient ORTE

2011-06-08 Thread Wesley Bland

On Tuesday, June 7, 2011 at 4:55 PM, Josh Hursey wrote: - orte_errmgr.post_startup() start the persistent RML message. There does not seem to be a shutdown version of this (to deregister the RML message at orte_finalize time). Was this intentional, or just missed? I just missed that one. I've ad

Re: [OMPI devel] RFC: Resilient ORTE

2011-06-07 Thread Ralph Castain

Thanks - that helps! On Tue, Jun 7, 2011 at 1:25 PM, Wesley Bland wrote: > Definitely we are targeting ORTED failures here. If an ORTED fails than > any other ORTEDs connected to it will notice and report the failure. Of > course if the failure is an application than the ORTED on that node wil

Re: [OMPI devel] RFC: Resilient ORTE

2011-06-07 Thread Josh Hursey

I looked through the patch a bit more today and had a few notes/questions. - orte_errmgr.post_startup() start the persistent RML message. There does not seem to be a shutdown version of this (to deregister the RML message at orte_finalize time). Was this intentional, or just missed? - in the orte_e

Re: [OMPI devel] RFC: Resilient ORTE

2011-06-07 Thread Wesley Bland

Definitely we are targeting ORTED failures here. If an ORTED fails than any other ORTEDs connected to it will notice and report the failure. Of course if the failure is an application than the ORTED on that node will be the only one to detect it. Also, if an ORTED is lost, all of the applicatio

Re: [OMPI devel] RFC: Resilient ORTE

2011-06-07 Thread Ralph Castain

Quick question: could you please clarify this statement: ...because more than one ORTED could (and often will) detect the failure. > I don't understand how this can be true, except for detecting an ORTED failure. Only one orted can detect an MPI process failure, unless you have now involved orted

Re: [OMPI devel] RFC: Resilient ORTE

2011-06-07 Thread Ralph Castain

Ah - thanks! That really helped clarify things. Much appreciated. Will look at the patch in this light... On Tue, Jun 7, 2011 at 1:00 PM, Wesley Bland wrote: > > Perhaps it would help if you folks could provide a little explanation about > how you use epoch? While the value sounds similar, your

Re: [OMPI devel] RFC: Resilient ORTE

2011-06-07 Thread Wesley Bland

> > Perhaps it would help if you folks could provide a little explanation about > how you use epoch? While the value sounds similar, your explanations are > beginning to sound very different from what we are doing and/or had > envisioned. > > I'm not sure how you can talk about an epoch being

Re: [OMPI devel] RFC: Resilient ORTE

2011-06-07 Thread Ralph Castain

On Tue, Jun 7, 2011 at 10:37 AM, George Bosilca wrote: > > On Jun 7, 2011, at 12:14 , Ralph Castain wrote: > > > But the epoch is process-unique - i.e., it is the number of times that > this specific process has been started, which differs per proc since we > don't restart all the procs every time

Re: [OMPI devel] RFC: Resilient ORTE

2011-06-07 Thread Ralph Castain

On Tue, Jun 7, 2011 at 10:35 AM, Wesley Bland wrote: > > On Tuesday, June 7, 2011 at 12:14 PM, Ralph Castain wrote: > > > > On Tue, Jun 7, 2011 at 9:45 AM, Wesley Bland wrote: > > To adress your concerns about putting the epoch in the process name > structure, putting it in there rather than i

Re: [OMPI devel] RFC: Resilient ORTE

2011-06-07 Thread George Bosilca

On Jun 7, 2011, at 12:14 , Ralph Castain wrote: > But the epoch is process-unique - i.e., it is the number of times that this > specific process has been started, which differs per proc since we don't > restart all the procs every time one fails. Yes the epoch is per process, but it is distrib

Re: [OMPI devel] RFC: Resilient ORTE

2011-06-07 Thread Wesley Bland

On Tuesday, June 7, 2011 at 12:14 PM, Ralph Castain wrote: > > > On Tue, Jun 7, 2011 at 9:45 AM, Wesley Bland (mailto:wbl...@eecs.utk.edu)> wrote: > > To adress your concerns about putting the epoch in the process name > > structure, putting it in there rather than in a separately maintained

Re: [OMPI devel] RFC: Resilient ORTE

2011-06-07 Thread Ralph Castain

On Tue, Jun 7, 2011 at 9:45 AM, Wesley Bland wrote: > To adress your concerns about putting the epoch in the process name > structure, putting it in there rather than in a separately maintained list > simplifies things later. > Not really concerned - I was just noting we had done it a tad diffe

Re: [OMPI devel] RFC: Resilient ORTE

2011-06-07 Thread Wesley Bland

To adress your concerns about putting the epoch in the process name structure, putting it in there rather than in a separately maintained list simplifies things later. For example, during communication you need to attach the epoch to each of your messages so they can be tracked later. If a pro

Re: [OMPI devel] RFC: Resilient ORTE

2011-06-07 Thread Ralph Castain

Thanks for the explanation - as I said, I won't have time to really review the patch this week, but appreciate the info. I don't really expect to see a conflict as George had discussed this with me previously. I know I'll have merge conflicts with my state machine branch, which would be ready for

Re: [OMPI devel] RFC: Resilient ORTE

2011-06-07 Thread Josh Hursey

I briefly looked over the patch. Excluding the epochs (which we don't need now, but will soon) it looks similar to what I have setup on my MPI run-through stabilization branch - so it should support that work nicely. I'll try to test it this week and send back any other comments. Good work. Thank

Re: [OMPI devel] RFC: Resilient ORTE

2011-06-07 Thread Wesley Bland

This could certainly work alongside another ORCM or any other fault detection/prediction/recovery mechanism. Most of the code is just dedicated to keeping the epoch up to date and tracking the status of the processes. The underlying idea was to provide a way for the application to decide what it

Re: [OMPI devel] RFC: Resilient ORTE

2011-06-07 Thread Ralph Castain

I'm on travel this week, but will look this over when I return. From the description, it sounds nearly identical to what we did in ORCM, so I expect there won't be many issues. You do get some race conditions that the new state machine code should help resolve. Only difference I can quickly see is

[OMPI devel] RFC: Resilient ORTE

2011-06-06 Thread George Bosilca

WHAT: Allow the runtime to handle fail-stop failures for both runtime (daemons) or application level processes. This patch extends the orte_process_name_t structure with a field to store the process epoch (the number of times it died so far), and add an application failure notification callback

55 matches

Mail list logo