On Mon, Jul 21, 2014 at 1:41 PM, Yossi Etigin <yos...@mellanox.com> wrote:
> Right, but: > > 1. IMHO the rte_barrier in the wrong place (in the trunk) > In the trunk we have the rte_barrier prior to del_proc, which is what I would have expected: quiescence the BTLs by reaching a point where everybody agree that no more MPI messages will be exchanged, and then delete the BTLs. > 2. In addition to the rte_barrier, need also mpi_barrier > Care for providing a reasoning for this barrier? Why and where should it be placed? George. > > > *From:* devel [mailto:devel-boun...@open-mpi.org] *On Behalf Of *George > Bosilca > *Sent:* Monday, July 21, 2014 8:19 PM > *To:* Open MPI Developers > > *Subject:* Re: [OMPI devel] barrier before calling del_procs > > > > There was a long thread of discussion on why we must use an rte_barrier > and not an mpi_barrier during the finalize. Basically, we long as we have > connectionless unreliable BTLs we need an external mechanism to ensure > complete tear-down of the entire infrastructure. Thus, we need to rely on > an rte_barrier not because it guarantees the correctness of the code, but > because it provides enough time to all processes to flush all HPC traffic. > > > > George. > > > > > > On Mon, Jul 21, 2014 at 1:10 PM, Yossi Etigin <yos...@mellanox.com> wrote: > > I see. But in branch v1.8, in 31869, Ralph reverted the commit which moved > del_procs after the barrier: > "Revert r31851 until we can resolve how to close these leaks without > causing the usnic BTL to fail during disconnect of intercommunicators > Refs #4643" > Also, we need an rte barrier after del_procs - because otherwise rankA > could call pml_finalize() before rankB finishes disconnecting from rankA. > > I think the order in finalize should be like this: > 1. mpi_barrier(world) > 2. del_procs() > 3. rte_barrier() > 4. pml_finalize() > > > -----Original Message----- > From: Nathan Hjelm [mailto:hje...@lanl.gov] > Sent: Monday, July 21, 2014 8:01 PM > To: Open MPI Developers > Cc: Yossi Etigin > Subject: Re: [OMPI devel] barrier before calling del_procs > > I should add that it is an rte barrier and not an MPI barrier for > technical reasons. > > -Nathan > > On Mon, Jul 21, 2014 at 09:42:53AM -0700, Ralph Castain wrote: > > We already have an rte barrier before del procs > > > > Sent from my iPhone > > On Jul 21, 2014, at 8:21 AM, Yossi Etigin <yos...@mellanox.com> > wrote: > > > > Hi, > > > > > > > > We get occasional hangs with MTL/MXM during finalize, because a > global > > synchronization is needed before calling del_procs. > > > > e.g rank A may call del_procs() and disconnect from rank B, while > rank B > > is still working. > > > > What do you think about adding an MPI barrier on COMM_WORLD before > > calling del_procs()? > > > > > > > _______________________________________________ > > devel mailing list > > de...@open-mpi.org > > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > > Link to this post: > > http://www.open-mpi.org/community/lists/devel/2014/07/15204.php > > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/07/15206.php > > > > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/07/15208.php >