There was a long thread of discussion on why we must use an rte_barrier and
not an mpi_barrier during the finalize. Basically, we long as we have
connectionless unreliable BTLs we need an external mechanism to ensure
complete tear-down of the entire infrastructure. Thus, we need to rely on
an rte_barrier not because it guarantees the correctness of the code, but
because it provides enough time to all processes to flush all HPC traffic.

  George.



On Mon, Jul 21, 2014 at 1:10 PM, Yossi Etigin <yos...@mellanox.com> wrote:

> I see. But in branch v1.8, in 31869, Ralph reverted the commit which moved
> del_procs after the barrier:
>   "Revert r31851 until we can resolve how to close these leaks without
> causing the usnic BTL to fail during disconnect of intercommunicators
>    Refs #4643"
> Also, we need an rte barrier after del_procs - because otherwise rankA
> could call pml_finalize() before rankB finishes disconnecting from rankA.
>
> I think the order in finalize should be like this:
>         1. mpi_barrier(world)
>         2. del_procs()
>         3. rte_barrier()
>         4. pml_finalize()
>
> -----Original Message-----
> From: Nathan Hjelm [mailto:hje...@lanl.gov]
> Sent: Monday, July 21, 2014 8:01 PM
> To: Open MPI Developers
> Cc: Yossi Etigin
> Subject: Re: [OMPI devel] barrier before calling del_procs
>
> I should add that it is an rte barrier and not an MPI barrier for
> technical reasons.
>
> -Nathan
>
> On Mon, Jul 21, 2014 at 09:42:53AM -0700, Ralph Castain wrote:
> >    We already have an rte barrier before del procs
> >
> >    Sent from my iPhone
> >    On Jul 21, 2014, at 8:21 AM, Yossi Etigin <yos...@mellanox.com>
> wrote:
> >
> >      Hi,
> >
> >
> >
> >      We get occasional hangs with MTL/MXM during finalize, because a
> global
> >      synchronization is needed before calling del_procs.
> >
> >      e.g rank A may call del_procs() and disconnect from rank B, while
> rank B
> >      is still working.
> >
> >      What do you think about adding an MPI barrier on COMM_WORLD before
> >      calling del_procs()?
> >
> >
>
> > _______________________________________________
> > devel mailing list
> > de...@open-mpi.org
> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > Link to this post:
> > http://www.open-mpi.org/community/lists/devel/2014/07/15204.php
>
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/07/15206.php
>

Reply via email to