On Jun 9, 2011, at 6:47 PM, George Bosilca wrote:
> Well, you're way to trusty. ;)
It's the midwestern boy in me :)
>
> This only works if all component play the game, and even then there it is
> difficult if you want to allow components to deregister themselves in the
> middle of the execut
Ah I see what you are getting at now.
The construction of the list of connected processes is something I,
intentionally, did not modify from the current Open MPI code. The list is
calculated based on the locally known set of local and remote process groups
attached to the communicator. So this
Well, you're way to trusty. ;)
This only works if all component play the game, and even then there it is
difficult if you want to allow components to deregister themselves in the
middle of the execution. The problem is that a callback will be previous for
some component, and that when you want
What I'm saying is that there is no reason to have any other type of MPI_Abort
if we are not able to compute the set of connected processes.
With this RFC the processes on the communicator on MPI_Abort will abort. Then
the other processes in the same MPI_COMM_WORLD (in fact jobid) will be notif
On Thu, Jun 9, 2011 at 4:47 PM, George Bosilca wrote:
> If this change the behavior of MPI_Abort to only abort processes on the
> specified communicator how this doesn't affects the default user experience
> (when today it aborts everything)?
Open MPI does abort everything by default - decided
If this change the behavior of MPI_Abort to only abort processes on the
specified communicator how this doesn't affects the default user experience
(when today it aborts everything)?
If we accept the fact that MPI_Abort will only abort the processes in the
current communicator what happens with
WHAT: Fix missing code in MPI_Abort
WHY: MPI_Abort is missing logic to ask for termination of the process
group defined by the communicator
WHERE: Mostly orte/mca/errmgr
WHEN: Open MPI trunk
TIMEOUT: Tuesday, June 14, 2011 (after teleconf)
Details:
---
A
So the "Resilient ORTE" patch has a registration in ompi_mpi_init.c:
-
orte_errmgr.set_fault_callback(&ompi_errhandler_runtime_callback);
-
Which is a callback that just calls abort (which is what we want to do
by default):
-
void ompi_errhandler_runtime_callbac
You mean you want the abort API to point somewhere else, without using a new
component?
Perhaps a telecon would help resolve this quicker? I'm available tomorrow or
anytime next week, if that helps.
On Thu, Jun 9, 2011 at 11:02 AM, Josh Hursey wrote:
> As long as there is the ability to remove
As long as there is the ability to remove and replace a callback I'm
fine. I personally think that forcing the errmgr to track ordering of
callback registration makes it a more complex solution, but as long as
it works.
In particular I need to replace the default 'abort' errmgr call in
OMPI with s
I agree - let's not get overly complex unless we can clearly articulate a
requirement to do so.
On Thu, Jun 9, 2011 at 10:45 AM, George Bosilca wrote:
> This will require exactly opposite registration and de-registration order,
> or no de-registration at all (aka no way to unload a component). Or
This will require exactly opposite registration and de-registration order, or
no de-registration at all (aka no way to unload a component). Or some even more
complex code to deal with internally.
If the error manager handle the callbacks it can use the registration ordering
(which will be what
On Wed, Jun 8, 2011 at 5:37 PM, Wesley Bland wrote:
> On Tuesday, June 7, 2011 at 4:55 PM, Josh Hursey wrote:
>
> - orte_errmgr.post_startup() start the persistent RML message. There
> does not seem to be a shutdown version of this (to deregister the RML
> message at orte_finalize time). Was this
13 matches
Mail list logo