[OMPI devel] MPI_Comm_spawn broken on master on RHEL7

2016-07-15 Thread Gilles Gouaillardet

Ralph,

i noticed MPI_Comm_spawn is broken on master and on RHEL7

for some reason i cannot yet explain, it works just fine on RHEL6 (!)


mpirun -np 1 ./dynamic/intercomm_create

from the ibm test suite can be used to reproduce the issue.



i digged a bit and i found OPAL_ERR_DEBUGGER_RELEASE is fired in mpirun, 
then the tasks received


a PMIX_ERR_DEBUGGER_RELEASE notification. it seems no event handler is 
registered, so the default handler


kills the task.


for the time being, a trivial workaround is not to fire 
OPAL_ERR_DEBUGGER_RELEASE in the first place


(see patch below)


could you please have a look ?

i am not sure whether client should not be notified at all, or whether 
they should register a dummy handler.


fwiw, in _event_hdlr, cd->nondefault is true on RHEL6, but false on 
RHEL7, and that might indicate a race condition



Cheers,


Gilles

diff --git a/orte/orted/orted_submit.c b/orte/orted/orted_submit.c
index b9d571c..0de0e79 100644
--- a/orte/orted/orted_submit.c
+++ b/orte/orted/orted_submit.c
@@ -2155,6 +2155,7 @@ static bool mpir_breakpoint_fired = false;

 static void _send_notification(void)
 {
+#if 0
 opal_buffer_t buf;
 int status = OPAL_ERR_DEBUGGER_RELEASE;
 orte_grpcomm_signature_t sig;
@@ -2209,6 +2210,7 @@ static void _send_notification(void)
 }
 OBJ_DESTRUCT(&sig);
 OBJ_DESTRUCT(&buf);
+#endif
 }

 static void orte_debugger_dump(void)





Re: [OMPI devel] MPI_Comm_spawn broken on master on RHEL7

2016-07-15 Thread Ralph Castain
That would break debugger attach. Sounds to me like it’s just an uninitialized 
variable for in_event_hdlr?

> On Jul 15, 2016, at 1:20 AM, Gilles Gouaillardet  wrote:
> 
> Ralph,
> 
> i noticed MPI_Comm_spawn is broken on master and on RHEL7
> 
> for some reason i cannot yet explain, it works just fine on RHEL6 (!)
> 
> 
> mpirun -np 1 ./dynamic/intercomm_create
> 
> from the ibm test suite can be used to reproduce the issue.
> 
> 
> 
> i digged a bit and i found OPAL_ERR_DEBUGGER_RELEASE is fired in mpirun, then 
> the tasks received
> 
> a PMIX_ERR_DEBUGGER_RELEASE notification. it seems no event handler is 
> registered, so the default handler
> 
> kills the task.
> 
> 
> for the time being, a trivial workaround is not to fire 
> OPAL_ERR_DEBUGGER_RELEASE in the first place
> 
> (see patch below)
> 
> 
> could you please have a look ?
> 
> i am not sure whether client should not be notified at all, or whether they 
> should register a dummy handler.
> 
> fwiw, in _event_hdlr, cd->nondefault is true on RHEL6, but false on RHEL7, 
> and that might indicate a race condition
> 
> 
> Cheers,
> 
> 
> Gilles
> 
> diff --git a/orte/orted/orted_submit.c b/orte/orted/orted_submit.c
> index b9d571c..0de0e79 100644
> --- a/orte/orted/orted_submit.c
> +++ b/orte/orted/orted_submit.c
> @@ -2155,6 +2155,7 @@ static bool mpir_breakpoint_fired = false;
> 
> static void _send_notification(void)
> {
> +#if 0
> opal_buffer_t buf;
> int status = OPAL_ERR_DEBUGGER_RELEASE;
> orte_grpcomm_signature_t sig;
> @@ -2209,6 +2210,7 @@ static void _send_notification(void)
> }
> OBJ_DESTRUCT(&sig);
> OBJ_DESTRUCT(&buf);
> +#endif
> }
> 
> static void orte_debugger_dump(void)
> 
> 
> 
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2016/07/19214.php



Re: [OMPI devel] MPI_Comm_spawn broken on master on RHEL7

2016-07-15 Thread Gilles Gouaillardet
Yep,

The constructor of pmix2x_threadshift_t (tscon) does not initialize
nondefault to false.
I won't be able to investigate this until Monday, but so far, my guess is
that if the constructor is fixed, then RHEL6 will fail like RHEL7 ...

fwiw, the intercomm_create used to fail in Cisco mtt because of too many
tasks and no over subscription, now it fails because of this bug.

Cheers,

Gilles

On Friday, July 15, 2016, Ralph Castain  wrote:

> That would break debugger attach. Sounds to me like it’s just an
> uninitialized variable for in_event_hdlr?
>
> > On Jul 15, 2016, at 1:20 AM, Gilles Gouaillardet  > wrote:
> >
> > Ralph,
> >
> > i noticed MPI_Comm_spawn is broken on master and on RHEL7
> >
> > for some reason i cannot yet explain, it works just fine on RHEL6 (!)
> >
> >
> > mpirun -np 1 ./dynamic/intercomm_create
> >
> > from the ibm test suite can be used to reproduce the issue.
> >
> >
> >
> > i digged a bit and i found OPAL_ERR_DEBUGGER_RELEASE is fired in mpirun,
> then the tasks received
> >
> > a PMIX_ERR_DEBUGGER_RELEASE notification. it seems no event handler is
> registered, so the default handler
> >
> > kills the task.
> >
> >
> > for the time being, a trivial workaround is not to fire
> OPAL_ERR_DEBUGGER_RELEASE in the first place
> >
> > (see patch below)
> >
> >
> > could you please have a look ?
> >
> > i am not sure whether client should not be notified at all, or whether
> they should register a dummy handler.
> >
> > fwiw, in _event_hdlr, cd->nondefault is true on RHEL6, but false on
> RHEL7, and that might indicate a race condition
> >
> >
> > Cheers,
> >
> >
> > Gilles
> >
> > diff --git a/orte/orted/orted_submit.c b/orte/orted/orted_submit.c
> > index b9d571c..0de0e79 100644
> > --- a/orte/orted/orted_submit.c
> > +++ b/orte/orted/orted_submit.c
> > @@ -2155,6 +2155,7 @@ static bool mpir_breakpoint_fired = false;
> >
> > static void _send_notification(void)
> > {
> > +#if 0
> > opal_buffer_t buf;
> > int status = OPAL_ERR_DEBUGGER_RELEASE;
> > orte_grpcomm_signature_t sig;
> > @@ -2209,6 +2210,7 @@ static void _send_notification(void)
> > }
> > OBJ_DESTRUCT(&sig);
> > OBJ_DESTRUCT(&buf);
> > +#endif
> > }
> >
> > static void orte_debugger_dump(void)
> >
> >
> >
> > ___
> > devel mailing list
> > de...@open-mpi.org 
> > Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel
> > Link to this post:
> http://www.open-mpi.org/community/lists/devel/2016/07/19214.php
>
> ___
> devel mailing list
> de...@open-mpi.org 
> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2016/07/19215.php


Re: [OMPI devel] MPI_Comm_spawn broken on master on RHEL7

2016-07-15 Thread Ralph Castain
Okay, I’ll take a look - thanks!

> On Jul 15, 2016, at 7:08 AM, Gilles Gouaillardet 
>  wrote:
> 
> 
> Yep,
> 
> The constructor of pmix2x_threadshift_t (tscon) does not initialize 
> nondefault to false.
> I won't be able to investigate this until Monday, but so far, my guess is 
> that if the constructor is fixed, then RHEL6 will fail like RHEL7 ...
> 
> fwiw, the intercomm_create used to fail in Cisco mtt because of too many 
> tasks and no over subscription, now it fails because of this bug.
> 
> Cheers,
> 
> Gilles
> 
> On Friday, July 15, 2016, Ralph Castain  > wrote:
> That would break debugger attach. Sounds to me like it’s just an 
> uninitialized variable for in_event_hdlr?
> 
> > On Jul 15, 2016, at 1:20 AM, Gilles Gouaillardet  > > wrote:
> >
> > Ralph,
> >
> > i noticed MPI_Comm_spawn is broken on master and on RHEL7
> >
> > for some reason i cannot yet explain, it works just fine on RHEL6 (!)
> >
> >
> > mpirun -np 1 ./dynamic/intercomm_create
> >
> > from the ibm test suite can be used to reproduce the issue.
> >
> >
> >
> > i digged a bit and i found OPAL_ERR_DEBUGGER_RELEASE is fired in mpirun, 
> > then the tasks received
> >
> > a PMIX_ERR_DEBUGGER_RELEASE notification. it seems no event handler is 
> > registered, so the default handler
> >
> > kills the task.
> >
> >
> > for the time being, a trivial workaround is not to fire 
> > OPAL_ERR_DEBUGGER_RELEASE in the first place
> >
> > (see patch below)
> >
> >
> > could you please have a look ?
> >
> > i am not sure whether client should not be notified at all, or whether they 
> > should register a dummy handler.
> >
> > fwiw, in _event_hdlr, cd->nondefault is true on RHEL6, but false on RHEL7, 
> > and that might indicate a race condition
> >
> >
> > Cheers,
> >
> >
> > Gilles
> >
> > diff --git a/orte/orted/orted_submit.c b/orte/orted/orted_submit.c
> > index b9d571c..0de0e79 100644
> > --- a/orte/orted/orted_submit.c
> > +++ b/orte/orted/orted_submit.c
> > @@ -2155,6 +2155,7 @@ static bool mpir_breakpoint_fired = false;
> >
> > static void _send_notification(void)
> > {
> > +#if 0
> > opal_buffer_t buf;
> > int status = OPAL_ERR_DEBUGGER_RELEASE;
> > orte_grpcomm_signature_t sig;
> > @@ -2209,6 +2210,7 @@ static void _send_notification(void)
> > }
> > OBJ_DESTRUCT(&sig);
> > OBJ_DESTRUCT(&buf);
> > +#endif
> > }
> >
> > static void orte_debugger_dump(void)
> >
> >
> >
> > ___
> > devel mailing list
> > de...@open-mpi.org 
> > Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel 
> > 
> > Link to this post: 
> > http://www.open-mpi.org/community/lists/devel/2016/07/19214.php 
> > 
> 
> ___
> devel mailing list
> de...@open-mpi.org 
> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel 
> 
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2016/07/19215.php 
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2016/07/19216.php



[OMPI devel] v2.0.1 PRs: open season

2016-07-15 Thread Jeff Squyres (jsquyres)
v2.0.1 is officially open to accept PRs.

Please note that many v2.0.1 PRs still need reviews:

- 36 open v2.0.1 PRs
- only 13 have reviews

Please start getting reviews for your v2.0.1 PRs -- no review, no merge:


https://github.com/open-mpi/ompi-release/pulls?utf8=%E2%9C%93&q=is%3Apr%20is%3Aopen%20milestone%3Av2.0.1

Also, some of the PRs are a little old -- I just kicked off CI on PRs that 
hadn't had a CI run in the past week (although the Mellanox Jenkins looks like 
it might be failing tests due to a local issue -- hopefully we can get that 
fixed up shortly).

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/