Committed in r24815.
On Thursday, June 23, 2011 at 4:19 PM, Ralph Castain wrote:
>
> On Jun 23, 2011, at 2:14 PM, Wesley Bland wrote:
> > Maybe before the ORTED saw the signal, it detected a communication failure
> > and reacted to that.
>
> Quite possible. However, remember that procs local
Ga - what a rookie mistake :)
I tested the patched test and it works as advertised for the small
scale tests I used before. So I'm good with this going in today.
Thanks,
Josh
On Thu, Jun 23, 2011 at 3:34 PM, Wesley Bland wrote:
> Right. Sorry I misspoke.
>
> On Thursday,
Right. Sorry I misspoke.
On Thursday, June 23, 2011 at 3:32 PM, Ralph Castain wrote:
> Ummm...just to clarify. There are no threads in ORTE, so it wasn't a problem
> of "not giving up the thread". The problem was that Josh's test never called
> progress. It would have been equally okay to
Ummm...just to clarify. There are no threads in ORTE, so it wasn't a problem of
"not giving up the thread". The problem was that Josh's test never called
progress. It would have been equally okay to simply call "opal_event_dispatch"
while waiting for the callback.
All applications have to
Josh,
There were a couple of bugs that I cleared up in my most recent checkin, but I
also needed to modify your test. The callback for the application layer errmgr
actually occurs in the application layer. Your test was never giving up the
thread to the ORTE application event loop to receive
So I finally got a chance to test the branch this morning. I cannot
get it to work. Maybe I'm doing some wrong, missing some MCA
parameter?
-
[jjhursey@smoky-login1 resilient-orte] hg summary
parent: 2:c550cf6ed6a2 tip
Newest version. Synced with trunk r24785.
branch:
Last reminder (I hope). RFC goes in a COB today.
Wesley
Sounds good. Thanks.
On Sat, Jun 18, 2011 at 9:31 AM, Wesley Bland wrote:
> That's fine. Let's say Thursday COB is now the timeout.
>
> On Jun 18, 2011 9:10 AM, "Joshua Hursey" wrote:
>> Cool. Then can we hold off pushing this into the trunk for a
That's fine. Let's say Thursday COB is now the timeout.
On Jun 18, 2011 9:10 AM, "Joshua Hursey" wrote:
> Cool. Then can we hold off pushing this into the trunk for a couple days
until I get a chance to test it? Monday COB does not give me much time since
we just got the
Cool. Then can we hold off pushing this into the trunk for a couple days until
I get a chance to test it? Monday COB does not give me much time since we just
got the new patch on Friday COB (the RFC gave us 2 weeks to review the original
patch). Would waiting until next Thursday/Friday COB be
I believe that it does. I made quite a few changes in the last checkin
though I didn't run your specific test this afternoon. I'll be able to try
it later this evening but it should be easy to test now that it's synced
with the trunk again.
On Jun 17, 2011 5:32 PM, "Josh Hursey"
Does this include a fix for the problem I reported with mpirun-hosted processes?
If not I would ask that we holding off on putting it into the trunk
until that particular bug is addressed. From my experience tackling
this particular issues requires some code refactoring, which should
probably be
No issue - just trying to get ahead of the game instead of running into an
issue later.
We can leave it for now.
On Jun 10, 2011, at 2:47 PM, Josh Hursey wrote:
> We could, but we could also just replace the callback. I will never
> what to use it in my scenario, and if I did then I could just
We could, but we could also just replace the callback. I will never
what to use it in my scenario, and if I did then I could just call it
directly instead of relying on the errmgr to do the right thing. So
why complicate the errmgr with additional complexity for something
that we don't need at the
So why not have the callback return an int, and your callback returns "go no
further"?
On Jun 10, 2011, at 2:06 PM, Josh Hursey wrote:
> Yeah I do not want the default fatal callback in OMPI. I want to
> replace it with something that allows OMPI to continue running when
> there are process
Yeah I do not want the default fatal callback in OMPI. I want to
replace it with something that allows OMPI to continue running when
there are process failures (if the error handlers associated with the
communicators permit such an action). So having the default fatal
callback called after mine
On Jun 10, 2011, at 6:32 AM, Josh Hursey wrote:
> On Fri, Jun 10, 2011 at 7:37 AM, Ralph Castain wrote:
>>
>> On Jun 9, 2011, at 6:12 PM, Joshua Hursey wrote:
>>
>>>
>>> On Jun 9, 2011, at 6:47 PM, George Bosilca wrote:
>>>
Well, you're way to trusty. ;)
>>>
>>>
On Jun 10, 2011, at 7:01 AM, Josh Hursey wrote:
> On Fri, Jun 10, 2011 at 8:51 AM, Ralph Castain wrote:
>>
>> On Jun 10, 2011, at 6:38 AM, Josh Hursey wrote:
>>
>>> Another problem with this patch, that I mentioned to Wesley and George
>>> off list, is that it does not
On Fri, Jun 10, 2011 at 8:51 AM, Ralph Castain wrote:
>
> On Jun 10, 2011, at 6:38 AM, Josh Hursey wrote:
>
>> Another problem with this patch, that I mentioned to Wesley and George
>> off list, is that it does not handle the case when mpirun/HNP is also
>> hosting processes
On Jun 10, 2011, at 6:32 AM, Josh Hursey wrote:
> On Fri, Jun 10, 2011 at 7:37 AM, Ralph Castain wrote:
>>
>> On Jun 9, 2011, at 6:12 PM, Joshua Hursey wrote:
>>
>>>
>>> On Jun 9, 2011, at 6:47 PM, George Bosilca wrote:
>>>
Well, you're way to trusty. ;)
>>>
>>>
On Jun 10, 2011, at 6:38 AM, Josh Hursey wrote:
> Another problem with this patch, that I mentioned to Wesley and George
> off list, is that it does not handle the case when mpirun/HNP is also
> hosting processes that might fail. In my testing of the patch it
> worked fine if mpirun/HNP was
Another problem with this patch, that I mentioned to Wesley and George
off list, is that it does not handle the case when mpirun/HNP is also
hosting processes that might fail. In my testing of the patch it
worked fine if mpirun/HNP was -not- hosting any processes, but once it
had to host processes
Okay, finally have time to sit down and review this. It looks pretty much
identical to what was done in ORCM - we just kept "epoch" separate from the
process name, and use multicast to notify all procs that someone failed. I do
have a few questions/comments about your proposed patch:
1. I note
On Fri, Jun 10, 2011 at 7:37 AM, Ralph Castain wrote:
>
> On Jun 9, 2011, at 6:12 PM, Joshua Hursey wrote:
>
>>
>> On Jun 9, 2011, at 6:47 PM, George Bosilca wrote:
>>
>>> Well, you're way to trusty. ;)
>>
>> It's the midwestern boy in me :)
>
> Still need to shake that corn
Something else you might want to address in here: the current code sends an RML
message from the proc calling abort to its local daemon telling the daemon that
we are exiting due to the app calling "abort". We needed to do this because we
wanted to flag the proc termination as one induced by
On Jun 9, 2011, at 6:12 PM, Joshua Hursey wrote:
>
> On Jun 9, 2011, at 6:47 PM, George Bosilca wrote:
>
>> Well, you're way to trusty. ;)
>
> It's the midwestern boy in me :)
Still need to shake that corn out of your head... :-)
>
>>
>> This only works if all component play the game, and
On Jun 9, 2011, at 6:47 PM, George Bosilca wrote:
> Well, you're way to trusty. ;)
It's the midwestern boy in me :)
>
> This only works if all component play the game, and even then there it is
> difficult if you want to allow components to deregister themselves in the
> middle of the
So the "Resilient ORTE" patch has a registration in ompi_mpi_init.c:
-
orte_errmgr.set_fault_callback(_errhandler_runtime_callback);
-
Which is a callback that just calls abort (which is what we want to do
by default):
-
void
You mean you want the abort API to point somewhere else, without using a new
component?
Perhaps a telecon would help resolve this quicker? I'm available tomorrow or
anytime next week, if that helps.
On Thu, Jun 9, 2011 at 11:02 AM, Josh Hursey wrote:
> As long as there
As long as there is the ability to remove and replace a callback I'm
fine. I personally think that forcing the errmgr to track ordering of
callback registration makes it a more complex solution, but as long as
it works.
In particular I need to replace the default 'abort' errmgr call in
OMPI with
I agree - let's not get overly complex unless we can clearly articulate a
requirement to do so.
On Thu, Jun 9, 2011 at 10:45 AM, George Bosilca wrote:
> This will require exactly opposite registration and de-registration order,
> or no de-registration at all (aka no way to
This will require exactly opposite registration and de-registration order, or
no de-registration at all (aka no way to unload a component). Or some even more
complex code to deal with internally.
If the error manager handle the callbacks it can use the registration ordering
(which will be what
On Wed, Jun 8, 2011 at 5:37 PM, Wesley Bland wrote:
> On Tuesday, June 7, 2011 at 4:55 PM, Josh Hursey wrote:
>
> - orte_errmgr.post_startup() start the persistent RML message. There
> does not seem to be a shutdown version of this (to deregister the RML
> message at
Well well well, that wasn't supposed to go on the mailing list ;)
george
On Jun 8, 2011, at 17:43 , George Bosilca wrote:
> Hey if you want to push to the extreme the logic of the "computer scientist"
> you were talking about in my office, then return the previous callback and
> let the
On Tuesday, June 7, 2011 at 4:55 PM, Josh Hursey wrote:
- orte_errmgr.post_startup() start the persistent RML message. There
does not seem to be a shutdown version of this (to deregister the RML
message at orte_finalize time). Was this intentional, or just missed?
I just missed that one. I've
Thanks - that helps!
On Tue, Jun 7, 2011 at 1:25 PM, Wesley Bland wrote:
> Definitely we are targeting ORTED failures here. If an ORTED fails than
> any other ORTEDs connected to it will notice and report the failure. Of
> course if the failure is an application than the
I looked through the patch a bit more today and had a few notes/questions.
- orte_errmgr.post_startup() start the persistent RML message. There
does not seem to be a shutdown version of this (to deregister the RML
message at orte_finalize time). Was this intentional, or just missed?
- in the
Definitely we are targeting ORTED failures here. If an ORTED fails than any
other ORTEDs connected to it will notice and report the failure. Of course if
the failure is an application than the ORTED on that node will be the only one
to detect it.
Also, if an ORTED is lost, all of the
Quick question: could you please clarify this statement:
...because more than one ORTED could (and often will) detect the failure.
>
I don't understand how this can be true, except for detecting an ORTED
failure. Only one orted can detect an MPI process failure, unless you have
now involved
Ah - thanks! That really helped clarify things. Much appreciated.
Will look at the patch in this light...
On Tue, Jun 7, 2011 at 1:00 PM, Wesley Bland wrote:
>
> Perhaps it would help if you folks could provide a little explanation about
> how you use epoch? While the
>
> Perhaps it would help if you folks could provide a little explanation about
> how you use epoch? While the value sounds similar, your explanations are
> beginning to sound very different from what we are doing and/or had
> envisioned.
>
> I'm not sure how you can talk about an epoch
On Tue, Jun 7, 2011 at 10:37 AM, George Bosilca wrote:
>
> On Jun 7, 2011, at 12:14 , Ralph Castain wrote:
>
> > But the epoch is process-unique - i.e., it is the number of times that
> this specific process has been started, which differs per proc since we
> don't restart
On Tue, Jun 7, 2011 at 10:35 AM, Wesley Bland wrote:
>
> On Tuesday, June 7, 2011 at 12:14 PM, Ralph Castain wrote:
>
>
>
> On Tue, Jun 7, 2011 at 9:45 AM, Wesley Bland wrote:
>
> To adress your concerns about putting the epoch in the process name
>
On Jun 7, 2011, at 12:14 , Ralph Castain wrote:
> But the epoch is process-unique - i.e., it is the number of times that this
> specific process has been started, which differs per proc since we don't
> restart all the procs every time one fails.
Yes the epoch is per process, but it is
On Tuesday, June 7, 2011 at 12:14 PM, Ralph Castain wrote:
>
>
> On Tue, Jun 7, 2011 at 9:45 AM, Wesley Bland (mailto:wbl...@eecs.utk.edu)> wrote:
> > To adress your concerns about putting the epoch in the process name
> > structure, putting it in there rather than in
To adress your concerns about putting the epoch in the process name structure,
putting it in there rather than in a separately maintained list simplifies
things later.
For example, during communication you need to attach the epoch to each of your
messages so they can be tracked later. If a
Thanks for the explanation - as I said, I won't have time to really review
the patch this week, but appreciate the info. I don't really expect to see a
conflict as George had discussed this with me previously.
I know I'll have merge conflicts with my state machine branch, which would
be ready for
I'm on travel this week, but will look this over when I return. From the
description, it sounds nearly identical to what we did in ORCM, so I expect
there won't be many issues. You do get some race conditions that the new
state machine code should help resolve.
Only difference I can quickly see
WHAT: Allow the runtime to handle fail-stop failures for both runtime (daemons)
or application level processes. This patch extends the orte_process_name_t
structure with a field to store the process epoch (the number of times it died
so far), and add an application failure notification callback
49 matches
Mail list logo