Ah I see what you are getting at now.

The construction of the list of connected processes is something I, 
intentionally, did not modify from the current Open MPI code. The list is 
calculated based on the locally known set of local and remote process groups 
attached to the communicator. So this is the set of directly connected 
processes in the specified communicator known to the calling process at the 
OMPI level.

ORTE is asked to abort this defined set of processes. Once those processes are 
terminated then ORTE needs to eventually inform all of the processes (in the 
jobid(s) specified - maybe other jobids too?) that these processes have 
failed/aborted. Upon notification of the failed/aborted processes the local 
process (at the OMPI level) needs to determine if that process loss is critical 
based upon the error handlers attached to communicators that it shares with the 
failed/aborted processes.  That should be handled in the callback from the 
errmgr at the OMPI level, since connectedness is an MPI construct. If the 
process failure/abort is critical to the local process, then upon notification 
the local process can call abort on the communicator effected.

So this has the possibility for a rolling abort effect [the abort of one 
communicator triggers the abort of another due to MPI_ERR_ARE_FATAL]. From 
which (depending upon the error handlers at the user level) the system will 
eventually converge to either some stable subset of process or all processes 
aborting resulting in job termination.

The rolling abort effect relies heavily upon the ability of the runtime to make 
sure that all process failures/abort are eventually known to all alive 
processes. Since all alive processes will know of the failure/abort, it can 
then determine if they are transitively effected by the failure based upon the 
local list of communicators and associated error handlers. But to complete this 
aspect of the abort procedure, we do need the callback mechanism from the 
runtime - but since ORTE (today) will kill the job for OMPI then it is not a 
big deal for end users since the job will terminate anyway. Once we have the 
callback, then we can finish tightening up the OMPI layer code.

It is not perfect, but I think it does address the transitive nature of the 
connectivity of MPI processes by relying on the runtime to provide uniform 
notification of failures. I figure that we will need to look over this code 
again and verify that the implementation of MPI_Comm_disconnect and associated 
underpinnings do the 'right thing' with regard to updating the communicator 
structures. But I think that is best addressed as a second set of patches.


The goal of this patch is to put back in functionality that was commented out 
during the last reorganization of the errmgr. What will likely follow, once we 
have notification of failure/abort at the OMPI level, is a cleanup of the 
connected groups code paths.


-- Josh


On Jun 9, 2011, at 6:13 PM, George Bosilca wrote:

> What I'm saying is that there is no reason to have any other type of 
> MPI_Abort if we are not able to compute the set of connected processes. 
> 
> With this RFC the processes on the communicator on MPI_Abort will abort. Then 
> the other processes in the same MPI_COMM_WORLD (in fact jobid) will be 
> notified (if we suppose that the ORTE will not make a difference between 
> aborted and faulty). As a result the entire MPI_COMM_WORLD will be aborted, 
> if we consider a sane application where everyone use the same type of error 
> handler. However, this is not enough. We have to distribute the abort signal 
> to every other process "connected", and I don't see how we can compute this 
> list of connected processes in Open MPI today.It is not that I don't see it 
> in your patch, it is that the definition of the connectivity in the MPI 
> standard is transitive and relies heavily on a correct implementation for the 
> MPI_Comm_disconnect.
> 
>  george.
> 
> On Jun 9, 2011, at 16:59 , Josh Hursey wrote:
> 
>> On Thu, Jun 9, 2011 at 4:47 PM, George Bosilca <bosi...@eecs.utk.edu> wrote:
>>> If this change the behavior of MPI_Abort to only abort processes on the 
>>> specified communicator how this doesn't affects the default user experience 
>>> (when today it aborts everything)?
>> 
>> Open MPI does abort everything by default - decided by the runtime at
>> the moment (but addressed in your RFC). So it does not matter if one
>> process aborts or if many do. So the behavior of MPI_Abort experienced
>> by the user will not change. Effectively the only change is an extra
>> message in the runtime before the process actually calls
>> errmgr.abort().
>> 
>> This branch just makes the implementation complete by first telling
>> ORTE that a group of processes, defined by the communicator, should be
>> terminated along with the calling process. Currently ORTE notices that
>> there was an abort, and terminates the job. Once your RFC goes through
>> then this may no longer be the case, and OMPI can determine what to do
>> when it receives a process failure notification.
>> 
>>> 
>>> If we accept the fact that MPI_Abort will only abort the processes in the 
>>> current communicator what happens with the other processes in the same 
>>> MPI_COMM_WORLD (but not on the communicator that has been used by 
>>> MPI_Abort)?
>> 
>> Currently, ORTE will abort them as well. When your RFC goes through
>> then the OMPI layer will be notified of the error and can take the
>> appropriate action, as determined by the MPI standard.
>> 
>>> What about all the other connected processes (based on the connectivity as 
>>> defined in the MPI standard in Section 10.5.4) ? Do they see this as a 
>>> fault?
>> 
>> They are informed of the fault via the ORTE errmgr callback routine
>> (that we have an RFC for), and then can take the appropriate action
>> based on MPI semantics. So we are pushing the decision of the
>> implication of the fault to the OMPI layer - where it should be.
>> 
>> 
>> The remainder of the OMPI layer logic for MPI_ERRORS_RETURN and other
>> connected error management scenarios is not included in this patch
>> since that depends on there being a callback to the OMPI layer - which
>> does not exist just yet. So a small patch to wire in the ORTE piece to
>> allow the OMPI layer to request a set of processes to be terminated -
>> to more accurately support MPI_Abort semantics.
>> 
>> Does that answer your questions?
>> 
>> -- Josh
>> 
>> 
>>> 
>>> george.
>>> 
>>> On Jun 9, 2011, at 16:32 , Josh Hursey wrote:
>>> 
>>>> WHAT: Fix missing code in MPI_Abort
>>>> 
>>>> WHY: MPI_Abort is missing logic to ask for termination of the process
>>>> group defined by the communicator
>>>> 
>>>> WHERE: Mostly orte/mca/errmgr
>>>> 
>>>> WHEN: Open MPI trunk
>>>> 
>>>> TIMEOUT: Tuesday, June 14, 2011 (after teleconf)
>>>> 
>>>> Details:
>>>> -------------------------------------------
>>>> A bitbucket branch is available here (last sync to r24757 of trunk)
>>>> https://bitbucket.org/jjhursey/ompi-abort/
>>>> 
>>>> In the MPI Standard (v2.2) Section 8.7 after the introduction of
>>>> MPI_Abort, it states:
>>>> "This routine makes a best attempt to abort all tasks in the group of 
>>>> comm."
>>>> 
>>>> Open MPI currently only calls orte_errmgr.abort() to abort the calling
>>>> process itself. The code to ask for the abort of the other processes
>>>> in the group defined by the communicator is commented out. Since one
>>>> process calling abort currently causes all processes in the job to
>>>> abort, it has not been a big deal. However as the group starts
>>>> exploring better resilience in the OMPI layer (with further support
>>>> from the ORTE layer) this aspect of MPI_Abort will become more
>>>> necessary to get right.
>>>> 
>>>> This branch adds back the logic necessary for a single process calling
>>>> MPI_Abort to request, from ORTE errmgr, that a defined subgroup of
>>>> processes be aborted. Once the request is sent to the HNP, the local
>>>> process then calls abort on itself. The HNP requests that the defined
>>>> subgroup of processes be terminated using the existing plm mechanisms
>>>> for doing so.
>>>> 
>>>> This change has no effect on the current default user experienced
>>>> behavior of MPI_Abort.
>>>> 
>>>> --
>>>> Joshua Hursey
>>>> Postdoctoral Research Associate
>>>> Oak Ridge National Laboratory
>>>> http://users.nccs.gov/~jjhursey
>>>> _______________________________________________
>>>> devel mailing list
>>>> de...@open-mpi.org
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> 
>>> 
>>> _______________________________________________
>>> devel mailing list
>>> de...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> 
>>> 
>> 
>> 
>> 
>> -- 
>> Joshua Hursey
>> Postdoctoral Research Associate
>> Oak Ridge National Laboratory
>> http://users.nccs.gov/~jjhursey
>> 
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> 
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 


Reply via email to