Re: [OMPI users] Exit Program Without Calling MPI_Finalize ForSpecial Case

Richard Treumann Thu, 4 Jun 2009 09:20:56 -0400

Tee Wen Kai -

You asked "Just to find out more about the consequences for exiting MPI
processes without calling MPI_Finalize, will it cause memory leak or other
fatal problem?"

Be aware that Jeff has offered you an OpenMPI implementation oriented
answer rather than an MPI standard oriented answer.

When there is a communicator involving 2 or more tasks and any task
involved in that communicator goes down, all other tasks that are members
of that communicator enter a state the MPI standard says cannot be trusted.
It is legitimate for the process that manages an MPI job as a single entity
to recognize that the loss of a member task has made the state of all
connected tasks untrustworthy and bring down all previously connected tasks
too.

When you use MPI_Comm_spawn, one result is an intercommunicator connecting
the task that did the spawn to the task(s) that were spawned so the two
sides are "connected".  If you intend to use MPI to communicate between the
spawn caller and the spawned tasks they must remain connected. You can
explicitly disconnect them and then a failure of the spawned task is
harmless to the task that spawned it but doing the disconnect costs you the
communication path.

The MPI standard does not require that connected tasks be brought down but
it is a valid MPI implementation behavior. This makes some sense when you
consider the fact that there is no MPI mechanism by which the other tasks
can see that the communicator involving the lost task is now broken and
there is no way a collective communication can work "correctly" on a
communicator that has lost a member task.

For example, what would it mean to call MPI_Reduce on MPI_COMM_WORLD when a
member of MPI_COMM_WORLD has been lost (especially if it is the root that
was lost)? If you had an MPI application that  computed for hours between
the loss of one task and the next collective call on MPI_ COMM_WORLD, would
you prefer to pay for hours of computation and then deadlock at the
collective call or just abort ASAP after the job is recognizably broken.

There is a fault tolerance working group trying to define something for MPI
3.0 but at this stage they are still trying to work out a proposal to bring
before the MPI Forum.  You might be interested in getting involved in that
effort.  They try to address question like:
- how would a task know it should not make collective  calls on the broken
communicator?
- should the communicator still support point to point communications with
remaining tasks?
- If a task has posted a receive and the expected sender is then lost, how
should the posted receive act?
- is there a clean way to "repair"  the broken communicator by spawning a
replacement task?
- is there a clean way to  "shrink" the broken communicator

The Fault Tolerance Working Group has taken on a very tough problem.  The
list above is just a tiny sample of the challenges in making MPI fault
tolerant.

             Dick

Dick Treumann  -  MPI Team
IBM Systems & Technology Group
Dept X2ZA / MS P963 -- 2455 South Road -- Poughkeepsie, NY 12601
Tele (845) 433-7846         Fax (845) 433-8363

  From:       Jeff Squyres <jsquy...@cisco.com>                        

  To:         "Open MPI Users" <us...@open-mpi.org>                    

  Date:       06/04/2009 07:32 AM                                      

  Subject:    Re: [OMPI users] Exit Program Without Calling MPI_Finalize  
ForSpecial Case

  Sent by:    users-boun...@open-mpi.org                               

On Jun 4, 2009, at 2:16 AM, Tee Wen Kai wrote:

> Just to find out more about the consequences for exiting MPI
> processes without calling MPI_Finalize, will it cause memory leak or
> other fatal problem?

If you're exiting the process, you won't cause any kind of problems --
the OS will clean up everything.

However, we might also have the orted clean up some things when MPI
processes unexpectedly die (e.g., filesystem temporary files in /
tmp).  So you might want to leave those around to clean themselves up
and die naturally.

--
Jeff Squyres
Cisco Systems

_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] Exit Program Without Calling MPI_Finalize ForSpecial Case

Reply via email to