Re: [OMPI devel] RFC: Resilient ORTE

2011-06-09 Thread Joshua Hursey
On Jun 9, 2011, at 6:47 PM, George Bosilca wrote: > Well, you're way to trusty. ;) It's the midwestern boy in me :) > > This only works if all component play the game, and even then there it is > difficult if you want to allow components to deregister themselves in the > middle of the execut

Re: [OMPI devel] RFC: Fix missing code in MPI_Abort functionality

2011-06-09 Thread Joshua Hursey
Ah I see what you are getting at now. The construction of the list of connected processes is something I, intentionally, did not modify from the current Open MPI code. The list is calculated based on the locally known set of local and remote process groups attached to the communicator. So this

Re: [OMPI devel] RFC: Resilient ORTE

2011-06-09 Thread George Bosilca
Well, you're way to trusty. ;) This only works if all component play the game, and even then there it is difficult if you want to allow components to deregister themselves in the middle of the execution. The problem is that a callback will be previous for some component, and that when you want

Re: [OMPI devel] RFC: Fix missing code in MPI_Abort functionality

2011-06-09 Thread George Bosilca
What I'm saying is that there is no reason to have any other type of MPI_Abort if we are not able to compute the set of connected processes. With this RFC the processes on the communicator on MPI_Abort will abort. Then the other processes in the same MPI_COMM_WORLD (in fact jobid) will be notif

Re: [OMPI devel] RFC: Fix missing code in MPI_Abort functionality

2011-06-09 Thread Josh Hursey
On Thu, Jun 9, 2011 at 4:47 PM, George Bosilca wrote: > If this change the behavior of MPI_Abort to only abort processes on the > specified communicator how this doesn't affects the default user experience > (when today it aborts everything)? Open MPI does abort everything by default - decided

Re: [OMPI devel] RFC: Fix missing code in MPI_Abort functionality

2011-06-09 Thread George Bosilca
If this change the behavior of MPI_Abort to only abort processes on the specified communicator how this doesn't affects the default user experience (when today it aborts everything)? If we accept the fact that MPI_Abort will only abort the processes in the current communicator what happens with

[OMPI devel] RFC: Fix missing code in MPI_Abort functionality

2011-06-09 Thread Josh Hursey
WHAT: Fix missing code in MPI_Abort WHY: MPI_Abort is missing logic to ask for termination of the process group defined by the communicator WHERE: Mostly orte/mca/errmgr WHEN: Open MPI trunk TIMEOUT: Tuesday, June 14, 2011 (after teleconf) Details: --- A

Re: [OMPI devel] RFC: Resilient ORTE

2011-06-09 Thread Josh Hursey
So the "Resilient ORTE" patch has a registration in ompi_mpi_init.c: - orte_errmgr.set_fault_callback(&ompi_errhandler_runtime_callback); - Which is a callback that just calls abort (which is what we want to do by default): - void ompi_errhandler_runtime_callbac

Re: [OMPI devel] RFC: Resilient ORTE

2011-06-09 Thread Ralph Castain
You mean you want the abort API to point somewhere else, without using a new component? Perhaps a telecon would help resolve this quicker? I'm available tomorrow or anytime next week, if that helps. On Thu, Jun 9, 2011 at 11:02 AM, Josh Hursey wrote: > As long as there is the ability to remove

Re: [OMPI devel] RFC: Resilient ORTE

2011-06-09 Thread Josh Hursey
As long as there is the ability to remove and replace a callback I'm fine. I personally think that forcing the errmgr to track ordering of callback registration makes it a more complex solution, but as long as it works. In particular I need to replace the default 'abort' errmgr call in OMPI with s

Re: [OMPI devel] RFC: Resilient ORTE

2011-06-09 Thread Ralph Castain
I agree - let's not get overly complex unless we can clearly articulate a requirement to do so. On Thu, Jun 9, 2011 at 10:45 AM, George Bosilca wrote: > This will require exactly opposite registration and de-registration order, > or no de-registration at all (aka no way to unload a component). Or

Re: [OMPI devel] RFC: Resilient ORTE

2011-06-09 Thread George Bosilca
This will require exactly opposite registration and de-registration order, or no de-registration at all (aka no way to unload a component). Or some even more complex code to deal with internally. If the error manager handle the callbacks it can use the registration ordering (which will be what

Re: [OMPI devel] RFC: Resilient ORTE

2011-06-09 Thread Josh Hursey
On Wed, Jun 8, 2011 at 5:37 PM, Wesley Bland wrote: > On Tuesday, June 7, 2011 at 4:55 PM, Josh Hursey wrote: > > - orte_errmgr.post_startup() start the persistent RML message. There > does not seem to be a shutdown version of this (to deregister the RML > message at orte_finalize time). Was this