On Mon, Jul 21, 2014 at 1:41 PM, Yossi Etigin <yos...@mellanox.com> wrote:

>  Right, but:
>
> 1.       IMHO the rte_barrier in the wrong place (in the trunk)
>

In the trunk we have the rte_barrier prior to del_proc, which is what I
would have expected: quiescence the BTLs by reaching a point where
everybody agree that no more MPI messages will be exchanged, and then
delete the BTLs.


>  2.       In addition to the rte_barrier, need also mpi_barrier
>
Care for providing a reasoning for this barrier? Why and where should it be
placed?

  George.




>
>
> *From:* devel [mailto:devel-boun...@open-mpi.org] *On Behalf Of *George
> Bosilca
> *Sent:* Monday, July 21, 2014 8:19 PM
> *To:* Open MPI Developers
>
> *Subject:* Re: [OMPI devel] barrier before calling del_procs
>
>
>
> There was a long thread of discussion on why we must use an rte_barrier
> and not an mpi_barrier during the finalize. Basically, we long as we have
> connectionless unreliable BTLs we need an external mechanism to ensure
> complete tear-down of the entire infrastructure. Thus, we need to rely on
> an rte_barrier not because it guarantees the correctness of the code, but
> because it provides enough time to all processes to flush all HPC traffic.
>
>
>
>   George.
>
>
>
>
>
> On Mon, Jul 21, 2014 at 1:10 PM, Yossi Etigin <yos...@mellanox.com> wrote:
>
> I see. But in branch v1.8, in 31869, Ralph reverted the commit which moved
> del_procs after the barrier:
>   "Revert r31851 until we can resolve how to close these leaks without
> causing the usnic BTL to fail during disconnect of intercommunicators
>    Refs #4643"
> Also, we need an rte barrier after del_procs - because otherwise rankA
> could call pml_finalize() before rankB finishes disconnecting from rankA.
>
> I think the order in finalize should be like this:
>         1. mpi_barrier(world)
>         2. del_procs()
>         3. rte_barrier()
>         4. pml_finalize()
>
>
> -----Original Message-----
> From: Nathan Hjelm [mailto:hje...@lanl.gov]
> Sent: Monday, July 21, 2014 8:01 PM
> To: Open MPI Developers
> Cc: Yossi Etigin
> Subject: Re: [OMPI devel] barrier before calling del_procs
>
> I should add that it is an rte barrier and not an MPI barrier for
> technical reasons.
>
> -Nathan
>
> On Mon, Jul 21, 2014 at 09:42:53AM -0700, Ralph Castain wrote:
> >    We already have an rte barrier before del procs
> >
> >    Sent from my iPhone
> >    On Jul 21, 2014, at 8:21 AM, Yossi Etigin <yos...@mellanox.com>
> wrote:
> >
> >      Hi,
> >
> >
> >
> >      We get occasional hangs with MTL/MXM during finalize, because a
> global
> >      synchronization is needed before calling del_procs.
> >
> >      e.g rank A may call del_procs() and disconnect from rank B, while
> rank B
> >      is still working.
> >
> >      What do you think about adding an MPI barrier on COMM_WORLD before
> >      calling del_procs()?
> >
> >
>
> > _______________________________________________
> > devel mailing list
> > de...@open-mpi.org
> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > Link to this post:
> > http://www.open-mpi.org/community/lists/devel/2014/07/15204.php
>
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/07/15206.php
>
>
>
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/07/15208.php
>

Reply via email to