[OMPI devel] Deadlock in sync_wait_mt(): Proposed patch

2016-09-21 Thread DEVEZE, PASCAL
I encountered a deadlock in sync_wait_mt(). After investigations, it appears that a first thread executing wait_sync_update() decrements sync->count just after a second thread in sync_wait_mt() made the test : if(sync->count <= 0) return (0 == sync->status) ? OPAL_SUCCESS : OPAL_ERR

Re: [OMPI devel] Deadlock in sync_wait_mt(): Proposed patch

2016-09-21 Thread Nathan Hjelm
Yeah, that looks like a bug to me. We need to keep the check before the lock but otherwise this is fine and should be fixed in 2.0.2. -Nathan > On Sep 21, 2016, at 3:16 AM, DEVEZE, PASCAL wrote: > > I encountered a deadlock in sync_wait_mt(). > > After investigations, it appears that a first

Re: [OMPI devel] Deadlock in sync_wait_mt(): Proposed patch

2016-09-21 Thread George Bosilca
Nice catch. Keeping the first check only works because the signaling field prevent us from releasing the condition too early. I added some comments around the code (131fe42d). George. On Wed, Sep 21, 2016 at 5:33 AM, Nathan Hjelm wrote: > Yeah, that looks like a bug to me. We need to keep the

[OMPI devel] RFC: Reenabling the TCP BTL over local interfaces (when specifically requested)

2016-09-21 Thread George Bosilca
The current code in the TCP BTL prevents local execution on a laptop not exposing a public IP address, by unconditionally disqualifying all interfaces with local addresses. This is not done based on MCA parameters but instead is done deep inside the IP matching logic, independent of what the user s

Re: [OMPI devel] RFC: Reenabling the TCP BTL over local interfaces (when specifically requested)

2016-09-21 Thread Jeff Squyres (jsquyres)
What will happen when you run this in a TCP-based networked environment? I.e., won't the TCP BTL then publish the 127.x.x.x address in the modex, and then other peers will think "oh, that's on the same subnet as me, so therefore I should be able to communicate with that endpoint over my 127.x.x.

Re: [OMPI devel] RFC: Reenabling the TCP BTL over local interfaces (when specifically requested)

2016-09-21 Thread George Bosilca
On Wed, Sep 21, 2016 at 10:41 AM, Jeff Squyres (jsquyres) < jsquy...@cisco.com> wrote: > What will happen when you run this in a TCP-based networked environment? > > I.e., won't the TCP BTL then publish the 127.x.x.x address in the modex, > and then other peers will think "oh, that's on the same s

Re: [OMPI devel] RFC: Reenabling the TCP BTL over local interfaces (when specifically requested)

2016-09-21 Thread Jeff Squyres (jsquyres)
On Sep 21, 2016, at 10:56 AM, George Bosilca wrote: > > No, because 127.x.x.x is by default part of the exclude, so it will never get > into the modex. The problem today, is that even if you manually remove it > from the exclude and add it to the include, it will not work, because of the > har

Re: [OMPI devel] OMPI devel] RFC: Reenabling the TCP BTL over local interfaces (when specifically requested)

2016-09-21 Thread Gilles Gouaillardet
George, Is proc locality already set at that time ? If yes, then we could keep a hard coded test so 127.x.y.z address (and IPv6 equivalent) are never used (even if included or not excluded) for inter node communication Cheers, Gilles "Jeff Squyres (jsquyres)" wrote: >On Sep 21, 2016, at 10:

Re: [OMPI devel] OMPI devel] RFC: Reenabling the TCP BTL over local interfaces (when specifically requested)

2016-09-21 Thread r...@open-mpi.org
FWIW: you know the location of every proc (to at least the host level) from time of orte_init, which should precede anything in the BTL > On Sep 21, 2016, at 8:31 AM, Gilles Gouaillardet > wrote: > > George, > > Is proc locality already set at that time ? > > If yes, then we could keep a har

Re: [OMPI devel] OMPI devel] RFC: Reenabling the TCP BTL over local interfaces (when specifically requested)

2016-09-21 Thread George Bosilca
My proposal is not about adding new ways of deciding what is local and what not. I proposed to use the corresponding MCA parameters to allow the user to decide. More specifically, I want to be able to change the exclude and include MCA to enable TCP over local addresses. George On Sep 21, 2016 4:

Re: [OMPI devel] OMPI devel] RFC: Reenabling the TCP BTL over local interfaces (when specifically requested)

2016-09-21 Thread Gilles Gouaillardet
George, i got that, and i consider my suggestion as an improvement to your proposal. if i want to exclude ib0, i might want to mpirun --mca btl_tcp_if_exclude ib0 ... to me, this is an honest mistake, but with your proposal, i would be screwed when running on more than one node because i should

Re: [OMPI devel] OMPI devel] RFC: Reenabling the TCP BTL over local interfaces (when specifically requested)

2016-09-21 Thread Paul Hargrove
On Wed, Sep 21, 2016 at 9:36 AM, Gilles Gouaillardet < gilles.gouaillar...@gmail.com> wrote: > > if i want to exclude ib0, i might want to > mpirun --mca btl_tcp_if_exclude ib0 ... > > to me, this is an honest mistake, but with your proposal, i would be > screwed when > running on more than one no

Re: [OMPI devel] OMPI devel] RFC: Reenabling the TCP BTL over local interfaces (when specifically requested)

2016-09-21 Thread George Bosilca
Gilles, I don't understand how your proposal is any different than what we have today. I quote "If [locality flag is set], then we could keep a hard coded test so 127.x.y.z address (and IPv6 equivalent) are never used (even if included or not excluded) for inter node communication". We already hav

Re: [OMPI devel] RFC: Reenabling the TCP BTL over local interfaces (when specifically requested)

2016-09-21 Thread George Bosilca
On Wed, Sep 21, 2016 at 11:23 AM, Jeff Squyres (jsquyres) < jsquy...@cisco.com> wrote: > > I would have agreed with you if the current code was doing a better > decision of what is local and what not. But it is not, it simply remove all > 127.x.x.x interfaces (opal/util/net.c:222). Thus, the only

Re: [OMPI devel] OMPI devel] RFC: Reenabling the TCP BTL over local interfaces (when specifically requested)

2016-09-21 Thread Gilles Gouaillardet
George, let's consider the case where "lo" is *not* excluded via the btl_tcp_if_exclude MCA param (if i understand correctly, the following is also true if "lo" is included via the btl_tcp_if_include MCA param) currently, and because of/thanks to the test that is done "deep inside" 1) on a discon

Re: [OMPI devel] OMPI devel] RFC: Reenabling the TCP BTL over local interfaces (when specifically requested)

2016-09-21 Thread George Bosilca
On Wednesday, September 21, 2016, Gilles Gouaillardet < gilles.gouaillar...@gmail.com> wrote: > George, > > let's consider the case where "lo" is *not* excluded via the > btl_tcp_if_exclude MCA param > (if i understand correctly, the following is also true if "lo" is > included via the btl_tcp_if_

Re: [OMPI devel] OMPI devel] RFC: Reenabling the TCP BTL over local interfaces (when specifically requested)

2016-09-21 Thread Gilles Gouaillardet
ok, i was not clear by "let's consider the case where "lo" is *not* excluded via the btl_tcp_if_exclude MCA param" i really meant "let's consider the case where the value of the btl_tcp_if_exclude MCA param has been forced to a list of network/interfaces that do not contain any reference (e.g. nam