Committed in r24815.
On Thursday, June 23, 2011 at 4:19 PM, Ralph Castain wrote:
>
> On Jun 23, 2011, at 2:14 PM, Wesley Bland wrote:
> > Maybe before the ORTED saw the signal, it detected a communication failure
> > and reacted to that.
>
> Quite possible. However, remember that procs local
Ga - what a rookie mistake :)
I tested the patched test and it works as advertised for the small
scale tests I used before. So I'm good with this going in today.
Thanks,
Josh
On Thu, Jun 23, 2011 at 3:34 PM, Wesley Bland wrote:
> Right. Sorry I misspoke.
>
> On Thursday, June 23, 2011 at 3:32 P
Right. Sorry I misspoke.
On Thursday, June 23, 2011 at 3:32 PM, Ralph Castain wrote:
> Ummm...just to clarify. There are no threads in ORTE, so it wasn't a problem
> of "not giving up the thread". The problem was that Josh's test never called
> progress. It would have been equally okay to simpl
Ummm...just to clarify. There are no threads in ORTE, so it wasn't a problem of
"not giving up the thread". The problem was that Josh's test never called
progress. It would have been equally okay to simply call "opal_event_dispatch"
while waiting for the callback.
All applications have to cycle
Josh,
There were a couple of bugs that I cleared up in my most recent checkin, but I
also needed to modify your test. The callback for the application layer errmgr
actually occurs in the application layer. Your test was never giving up the
thread to the ORTE application event loop to receive it
So I finally got a chance to test the branch this morning. I cannot
get it to work. Maybe I'm doing some wrong, missing some MCA
parameter?
-
[jjhursey@smoky-login1 resilient-orte] hg summary
parent: 2:c550cf6ed6a2 tip
Newest version. Synced with trunk r24785.
branch: defa
Last reminder (I hope). RFC goes in a COB today.
Wesley
Sounds good. Thanks.
On Sat, Jun 18, 2011 at 9:31 AM, Wesley Bland wrote:
> That's fine. Let's say Thursday COB is now the timeout.
>
> On Jun 18, 2011 9:10 AM, "Joshua Hursey" wrote:
>> Cool. Then can we hold off pushing this into the trunk for a couple days
>> until I get a chance to test it?
That's fine. Let's say Thursday COB is now the timeout.
On Jun 18, 2011 9:10 AM, "Joshua Hursey" wrote:
> Cool. Then can we hold off pushing this into the trunk for a couple days
until I get a chance to test it? Monday COB does not give me much time since
we just got the new patch on Friday COB (t
Cool. Then can we hold off pushing this into the trunk for a couple days until
I get a chance to test it? Monday COB does not give me much time since we just
got the new patch on Friday COB (the RFC gave us 2 weeks to review the original
patch). Would waiting until next Thursday/Friday COB be to
I believe that it does. I made quite a few changes in the last checkin
though I didn't run your specific test this afternoon. I'll be able to try
it later this evening but it should be easy to test now that it's synced
with the trunk again.
On Jun 17, 2011 5:32 PM, "Josh Hursey" wrote:
> Does this
Does this include a fix for the problem I reported with mpirun-hosted processes?
If not I would ask that we holding off on putting it into the trunk
until that particular bug is addressed. From my experience tackling
this particular issues requires some code refactoring, which should
probably be d
This is a reminder that the Resilient ORTE RFC is set to go into the trunk on
Monday at COB.
I've updated the code with a few of the changes that were mentioned on and off
the list (moved code out of orted_comm.c, errmgr_set_callback returns previous
callback, post_startup function, corrected n
No issue - just trying to get ahead of the game instead of running into an
issue later.
We can leave it for now.
On Jun 10, 2011, at 2:47 PM, Josh Hursey wrote:
> We could, but we could also just replace the callback. I will never
> what to use it in my scenario, and if I did then I could just
We could, but we could also just replace the callback. I will never
what to use it in my scenario, and if I did then I could just call it
directly instead of relying on the errmgr to do the right thing. So
why complicate the errmgr with additional complexity for something
that we don't need at the
So why not have the callback return an int, and your callback returns "go no
further"?
On Jun 10, 2011, at 2:06 PM, Josh Hursey wrote:
> Yeah I do not want the default fatal callback in OMPI. I want to
> replace it with something that allows OMPI to continue running when
> there are process fai
Yeah I do not want the default fatal callback in OMPI. I want to
replace it with something that allows OMPI to continue running when
there are process failures (if the error handlers associated with the
communicators permit such an action). So having the default fatal
callback called after mine wou
On Jun 10, 2011, at 6:32 AM, Josh Hursey wrote:
> On Fri, Jun 10, 2011 at 7:37 AM, Ralph Castain wrote:
>>
>> On Jun 9, 2011, at 6:12 PM, Joshua Hursey wrote:
>>
>>>
>>> On Jun 9, 2011, at 6:47 PM, George Bosilca wrote:
>>>
Well, you're way to trusty. ;)
>>>
>>> It's the midwestern boy
On Jun 10, 2011, at 7:01 AM, Josh Hursey wrote:
> On Fri, Jun 10, 2011 at 8:51 AM, Ralph Castain wrote:
>>
>> On Jun 10, 2011, at 6:38 AM, Josh Hursey wrote:
>>
>>> Another problem with this patch, that I mentioned to Wesley and George
>>> off list, is that it does not handle the case when mpi
On Fri, Jun 10, 2011 at 8:51 AM, Ralph Castain wrote:
>
> On Jun 10, 2011, at 6:38 AM, Josh Hursey wrote:
>
>> Another problem with this patch, that I mentioned to Wesley and George
>> off list, is that it does not handle the case when mpirun/HNP is also
>> hosting processes that might fail. In my
On Jun 10, 2011, at 6:32 AM, Josh Hursey wrote:
> On Fri, Jun 10, 2011 at 7:37 AM, Ralph Castain wrote:
>>
>> On Jun 9, 2011, at 6:12 PM, Joshua Hursey wrote:
>>
>>>
>>> On Jun 9, 2011, at 6:47 PM, George Bosilca wrote:
>>>
Well, you're way to trusty. ;)
>>>
>>> It's the midwestern boy
On Jun 10, 2011, at 6:38 AM, Josh Hursey wrote:
> Another problem with this patch, that I mentioned to Wesley and George
> off list, is that it does not handle the case when mpirun/HNP is also
> hosting processes that might fail. In my testing of the patch it
> worked fine if mpirun/HNP was -not-
Another problem with this patch, that I mentioned to Wesley and George
off list, is that it does not handle the case when mpirun/HNP is also
hosting processes that might fail. In my testing of the patch it
worked fine if mpirun/HNP was -not- hosting any processes, but once it
had to host processes
Okay, finally have time to sit down and review this. It looks pretty much
identical to what was done in ORCM - we just kept "epoch" separate from the
process name, and use multicast to notify all procs that someone failed. I do
have a few questions/comments about your proposed patch:
1. I note
On Fri, Jun 10, 2011 at 7:37 AM, Ralph Castain wrote:
>
> On Jun 9, 2011, at 6:12 PM, Joshua Hursey wrote:
>
>>
>> On Jun 9, 2011, at 6:47 PM, George Bosilca wrote:
>>
>>> Well, you're way to trusty. ;)
>>
>> It's the midwestern boy in me :)
>
> Still need to shake that corn out of your head... :-
Something else you might want to address in here: the current code sends an RML
message from the proc calling abort to its local daemon telling the daemon that
we are exiting due to the app calling "abort". We needed to do this because we
wanted to flag the proc termination as one induced by the
On Jun 9, 2011, at 6:12 PM, Joshua Hursey wrote:
>
> On Jun 9, 2011, at 6:47 PM, George Bosilca wrote:
>
>> Well, you're way to trusty. ;)
>
> It's the midwestern boy in me :)
Still need to shake that corn out of your head... :-)
>
>>
>> This only works if all component play the game, and
On Jun 9, 2011, at 6:47 PM, George Bosilca wrote:
> Well, you're way to trusty. ;)
It's the midwestern boy in me :)
>
> This only works if all component play the game, and even then there it is
> difficult if you want to allow components to deregister themselves in the
> middle of the execut
Well, you're way to trusty. ;)
This only works if all component play the game, and even then there it is
difficult if you want to allow components to deregister themselves in the
middle of the execution. The problem is that a callback will be previous for
some component, and that when you want
So the "Resilient ORTE" patch has a registration in ompi_mpi_init.c:
-
orte_errmgr.set_fault_callback(&ompi_errhandler_runtime_callback);
-
Which is a callback that just calls abort (which is what we want to do
by default):
-
void ompi_errhandler_runtime_callbac
You mean you want the abort API to point somewhere else, without using a new
component?
Perhaps a telecon would help resolve this quicker? I'm available tomorrow or
anytime next week, if that helps.
On Thu, Jun 9, 2011 at 11:02 AM, Josh Hursey wrote:
> As long as there is the ability to remove
As long as there is the ability to remove and replace a callback I'm
fine. I personally think that forcing the errmgr to track ordering of
callback registration makes it a more complex solution, but as long as
it works.
In particular I need to replace the default 'abort' errmgr call in
OMPI with s
I agree - let's not get overly complex unless we can clearly articulate a
requirement to do so.
On Thu, Jun 9, 2011 at 10:45 AM, George Bosilca wrote:
> This will require exactly opposite registration and de-registration order,
> or no de-registration at all (aka no way to unload a component). Or
This will require exactly opposite registration and de-registration order, or
no de-registration at all (aka no way to unload a component). Or some even more
complex code to deal with internally.
If the error manager handle the callbacks it can use the registration ordering
(which will be what
On Wed, Jun 8, 2011 at 5:37 PM, Wesley Bland wrote:
> On Tuesday, June 7, 2011 at 4:55 PM, Josh Hursey wrote:
>
> - orte_errmgr.post_startup() start the persistent RML message. There
> does not seem to be a shutdown version of this (to deregister the RML
> message at orte_finalize time). Was this
Well well well, that wasn't supposed to go on the mailing list ;)
george
On Jun 8, 2011, at 17:43 , George Bosilca wrote:
> Hey if you want to push to the extreme the logic of the "computer scientist"
> you were talking about in my office, then return the previous callback and
> let the uppe
Hey if you want to push to the extreme the logic of the "computer scientist"
you were talking about in my office, then return the previous callback and let
the upper layer do the right thing. Suppose they don't screw up for once ...
george.
On Jun 8, 2011, at 17:37 , Wesley Bland wrote:
> A
On Tuesday, June 7, 2011 at 4:55 PM, Josh Hursey wrote:
- orte_errmgr.post_startup() start the persistent RML message. There
does not seem to be a shutdown version of this (to deregister the RML
message at orte_finalize time). Was this intentional, or just missed?
I just missed that one. I've ad
Thanks - that helps!
On Tue, Jun 7, 2011 at 1:25 PM, Wesley Bland wrote:
> Definitely we are targeting ORTED failures here. If an ORTED fails than
> any other ORTEDs connected to it will notice and report the failure. Of
> course if the failure is an application than the ORTED on that node wil
I looked through the patch a bit more today and had a few notes/questions.
- orte_errmgr.post_startup() start the persistent RML message. There
does not seem to be a shutdown version of this (to deregister the RML
message at orte_finalize time). Was this intentional, or just missed?
- in the orte_e
Definitely we are targeting ORTED failures here. If an ORTED fails than any
other ORTEDs connected to it will notice and report the failure. Of course if
the failure is an application than the ORTED on that node will be the only one
to detect it.
Also, if an ORTED is lost, all of the applicatio
Quick question: could you please clarify this statement:
...because more than one ORTED could (and often will) detect the failure.
>
I don't understand how this can be true, except for detecting an ORTED
failure. Only one orted can detect an MPI process failure, unless you have
now involved orted
Ah - thanks! That really helped clarify things. Much appreciated.
Will look at the patch in this light...
On Tue, Jun 7, 2011 at 1:00 PM, Wesley Bland wrote:
>
> Perhaps it would help if you folks could provide a little explanation about
> how you use epoch? While the value sounds similar, your
>
> Perhaps it would help if you folks could provide a little explanation about
> how you use epoch? While the value sounds similar, your explanations are
> beginning to sound very different from what we are doing and/or had
> envisioned.
>
> I'm not sure how you can talk about an epoch being
On Tue, Jun 7, 2011 at 10:37 AM, George Bosilca wrote:
>
> On Jun 7, 2011, at 12:14 , Ralph Castain wrote:
>
> > But the epoch is process-unique - i.e., it is the number of times that
> this specific process has been started, which differs per proc since we
> don't restart all the procs every time
On Tue, Jun 7, 2011 at 10:35 AM, Wesley Bland wrote:
>
> On Tuesday, June 7, 2011 at 12:14 PM, Ralph Castain wrote:
>
>
>
> On Tue, Jun 7, 2011 at 9:45 AM, Wesley Bland wrote:
>
> To adress your concerns about putting the epoch in the process name
> structure, putting it in there rather than i
On Jun 7, 2011, at 12:14 , Ralph Castain wrote:
> But the epoch is process-unique - i.e., it is the number of times that this
> specific process has been started, which differs per proc since we don't
> restart all the procs every time one fails.
Yes the epoch is per process, but it is distrib
On Tuesday, June 7, 2011 at 12:14 PM, Ralph Castain wrote:
>
>
> On Tue, Jun 7, 2011 at 9:45 AM, Wesley Bland (mailto:wbl...@eecs.utk.edu)> wrote:
> > To adress your concerns about putting the epoch in the process name
> > structure, putting it in there rather than in a separately maintained
On Tue, Jun 7, 2011 at 9:45 AM, Wesley Bland wrote:
> To adress your concerns about putting the epoch in the process name
> structure, putting it in there rather than in a separately maintained list
> simplifies things later.
>
Not really concerned - I was just noting we had done it a tad diffe
To adress your concerns about putting the epoch in the process name structure,
putting it in there rather than in a separately maintained list simplifies
things later.
For example, during communication you need to attach the epoch to each of your
messages so they can be tracked later. If a pro
Thanks for the explanation - as I said, I won't have time to really review
the patch this week, but appreciate the info. I don't really expect to see a
conflict as George had discussed this with me previously.
I know I'll have merge conflicts with my state machine branch, which would
be ready for
I briefly looked over the patch. Excluding the epochs (which we don't
need now, but will soon) it looks similar to what I have setup on my
MPI run-through stabilization branch - so it should support that work
nicely. I'll try to test it this week and send back any other
comments.
Good work.
Thank
This could certainly work alongside another ORCM or any other fault
detection/prediction/recovery mechanism. Most of the code is just dedicated to
keeping the epoch up to date and tracking the status of the processes. The
underlying idea was to provide a way for the application to decide what it
I'm on travel this week, but will look this over when I return. From the
description, it sounds nearly identical to what we did in ORCM, so I expect
there won't be many issues. You do get some race conditions that the new
state machine code should help resolve.
Only difference I can quickly see is
54 matches
Mail list logo