I encountered a deadlock in sync_wait_mt().
After investigations, it appears that a first thread executing
wait_sync_update() decrements sync->count just after a second thread in
sync_wait_mt() made the test :
if(sync->count <= 0)
return (0 == sync->status) ? OPAL_SUCCESS : OPAL_ERR
Yeah, that looks like a bug to me. We need to keep the check before the lock
but otherwise this is fine and should be fixed in 2.0.2.
-Nathan
> On Sep 21, 2016, at 3:16 AM, DEVEZE, PASCAL wrote:
>
> I encountered a deadlock in sync_wait_mt().
>
> After investigations, it appears that a first
Nice catch. Keeping the first check only works because the signaling field
prevent us from releasing the condition too early. I added some comments
around the code (131fe42d).
George.
On Wed, Sep 21, 2016 at 5:33 AM, Nathan Hjelm wrote:
> Yeah, that looks like a bug to me. We need to keep the
The current code in the TCP BTL prevents local execution on a laptop not
exposing a public IP address, by unconditionally disqualifying all
interfaces with local addresses. This is not done based on MCA parameters
but instead is done deep inside the IP matching logic, independent of what
the user s
What will happen when you run this in a TCP-based networked environment?
I.e., won't the TCP BTL then publish the 127.x.x.x address in the modex, and
then other peers will think "oh, that's on the same subnet as me, so therefore
I should be able to communicate with that endpoint over my 127.x.x.
On Wed, Sep 21, 2016 at 10:41 AM, Jeff Squyres (jsquyres) <
jsquy...@cisco.com> wrote:
> What will happen when you run this in a TCP-based networked environment?
>
> I.e., won't the TCP BTL then publish the 127.x.x.x address in the modex,
> and then other peers will think "oh, that's on the same s
On Sep 21, 2016, at 10:56 AM, George Bosilca wrote:
>
> No, because 127.x.x.x is by default part of the exclude, so it will never get
> into the modex. The problem today, is that even if you manually remove it
> from the exclude and add it to the include, it will not work, because of the
> har
George,
Is proc locality already set at that time ?
If yes, then we could keep a hard coded test so 127.x.y.z address (and IPv6
equivalent) are never used (even if included or not excluded) for inter node
communication
Cheers,
Gilles
"Jeff Squyres (jsquyres)" wrote:
>On Sep 21, 2016, at 10:
FWIW: you know the location of every proc (to at least the host level) from
time of orte_init, which should precede anything in the BTL
> On Sep 21, 2016, at 8:31 AM, Gilles Gouaillardet
> wrote:
>
> George,
>
> Is proc locality already set at that time ?
>
> If yes, then we could keep a har
My proposal is not about adding new ways of deciding what is local and what
not. I proposed to use the corresponding MCA parameters to allow the user
to decide. More specifically, I want to be able to change the exclude and
include MCA to enable TCP over local addresses.
George
On Sep 21, 2016 4:
George,
i got that, and i consider my suggestion as an improvement to your proposal.
if i want to exclude ib0, i might want to
mpirun --mca btl_tcp_if_exclude ib0 ...
to me, this is an honest mistake, but with your proposal, i would be
screwed when
running on more than one node because i should
On Wed, Sep 21, 2016 at 9:36 AM, Gilles Gouaillardet <
gilles.gouaillar...@gmail.com> wrote:
>
> if i want to exclude ib0, i might want to
> mpirun --mca btl_tcp_if_exclude ib0 ...
>
> to me, this is an honest mistake, but with your proposal, i would be
> screwed when
> running on more than one no
Gilles,
I don't understand how your proposal is any different than what we have
today. I quote "If [locality flag is set], then we could keep a hard coded
test so 127.x.y.z address (and IPv6 equivalent) are never used (even if
included or not excluded) for inter node communication". We already hav
On Wed, Sep 21, 2016 at 11:23 AM, Jeff Squyres (jsquyres) <
jsquy...@cisco.com> wrote:
> > I would have agreed with you if the current code was doing a better
> decision of what is local and what not. But it is not, it simply remove all
> 127.x.x.x interfaces (opal/util/net.c:222). Thus, the only
George,
let's consider the case where "lo" is *not* excluded via the
btl_tcp_if_exclude MCA param
(if i understand correctly, the following is also true if "lo" is
included via the btl_tcp_if_include MCA param)
currently, and because of/thanks to the test that is done "deep inside"
1) on a discon
On Wednesday, September 21, 2016, Gilles Gouaillardet <
gilles.gouaillar...@gmail.com> wrote:
> George,
>
> let's consider the case where "lo" is *not* excluded via the
> btl_tcp_if_exclude MCA param
> (if i understand correctly, the following is also true if "lo" is
> included via the btl_tcp_if_
ok, i was not clear
by "let's consider the case where "lo" is *not* excluded via the
btl_tcp_if_exclude MCA param" i really meant
"let's consider the case where the value of the btl_tcp_if_exclude MCA
param has been forced to a list of network/interfaces that do not
contain any reference (e.g. nam
17 matches
Mail list logo