[OMPI devel] mpirun hang (regression in bffb2b7a4bb49c9188d942201b8a8f04872ff63c)

2014-12-24 Thread Gilles Gouaillardet
Ralph,

i tried to debug the issue reported by Siegmar at
http://www.open-mpi.org/community/lists/users/2014/12/26052.php

i have not been able to try this on an heterogeneous cluster yet, but i
could
reproduce a hang with 2 nodes and 3 tasks :

mpirun -host node0,node1 -np 3 --mca btl tcp,self --mca coll ^ml
./helloworld

git bisect pointed to commit
https://github.com/hppritcha/ompi/commit/bffb2b7a4bb49c9188d942201b8a8f04872ff63c,
and reverting a subpart of this commit "fixes" the hang
(see attached patch)

your change correctly prevents the use of uninitialized data (worst case
scenario is a crash),
but has some undesired side effects that eventually causes a hang.


could you please have a look at it ?

Cheers,

Gilles
diff --git a/orte/orted/pmix/pmix_server.c b/orte/orted/pmix/pmix_server.c
index 4f0493c..0f4c816 100644
--- a/orte/orted/pmix/pmix_server.c
+++ b/orte/orted/pmix/pmix_server.c
@@ -1241,9 +1241,9 @@ static void pmix_server_dmdx_resp(int status, orte_process_name_t* sender,
 /* pass across any returned blobs */
 opal_dss.copy_payload(reply, buffer);
 stored = true;
-OBJ_RETAIN(reply);
-PMIX_SERVER_QUEUE_SEND(req->peer, req->tag, reply);
 }
+OBJ_RETAIN(reply);
+PMIX_SERVER_QUEUE_SEND(req->peer, req->tag, reply);
 } else {
 /* If peer has an access to shared memory dstore, check
  * if we already stored data for the target process.


Re: [OMPI devel] mpirun hang (regression in bffb2b7a4bb49c9188d942201b8a8f04872ff63c)

2014-12-24 Thread Ralph Castain
Looks like someone already took care of it this morning - thanks!

> On Dec 23, 2014, at 11:26 PM, Gilles Gouaillardet 
>  wrote:
> 
> Ralph,
> 
> i tried to debug the issue reported by Siegmar at
> http://www.open-mpi.org/community/lists/users/2014/12/26052.php
> 
> i have not been able to try this on an heterogeneous cluster yet, but i
> could
> reproduce a hang with 2 nodes and 3 tasks :
> 
> mpirun -host node0,node1 -np 3 --mca btl tcp,self --mca coll ^ml
> ./helloworld
> 
> git bisect pointed to commit
> https://github.com/hppritcha/ompi/commit/bffb2b7a4bb49c9188d942201b8a8f04872ff63c,
> and reverting a subpart of this commit "fixes" the hang
> (see attached patch)
> 
> your change correctly prevents the use of uninitialized data (worst case
> scenario is a crash),
> but has some undesired side effects that eventually causes a hang.
> 
> 
> could you please have a look at it ?
> 
> Cheers,
> 
> Gilles
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/12/16720.php



Re: [OMPI devel] [OMPI commits] Git: open-mpi/ompi branch master updated. dev-612-g05af80b

2014-12-24 Thread Ralph Castain
Hi Nadezhda

I’m afraid this commit is still incorrect as it means that reply can be used 
without ever being initialized. Somehow, you have to OBJ_NEW reply before you 
can use it.

Could you please correct this?

Thanks
Ralph

> On Dec 24, 2014, at 3:30 AM, git...@crest.iu.edu wrote:
> 
> This is an automated email from the git hooks/post-receive script. It was
> generated because a ref change was pushed to the repository containing
> the project "open-mpi/ompi".
> 
> The branch, master has been updated
>   via  05af80b3025dbb95bdd4280087450791291d7219 (commit)
>  from  b9349d2eb9117c61205c98d5d2d5175d26971d23 (commit)
> 
> Those revisions listed above that are new to this repository have
> not appeared on any other notification email; so we list those
> revisions in full, below.
> 
> - Log -
> https://github.com/open-mpi/ompi/commit/05af80b3025dbb95bdd4280087450791291d7219
> 
> commit 05af80b3025dbb95bdd4280087450791291d7219
> Author: Nadezhda Kogteva 
> Date:   Wed Dec 24 13:25:23 2014 +0200
> 
>Fix commit bffb2b7a4bb49c9188d942201b8a8f04872ff63c which broke pmix 
> server functionality
> 
> diff --git a/orte/orted/pmix/pmix_server.c b/orte/orted/pmix/pmix_server.c
> index 4f0493c..0f4c816 100644
> --- a/orte/orted/pmix/pmix_server.c
> +++ b/orte/orted/pmix/pmix_server.c
> @@ -1241,9 +1241,9 @@ static void pmix_server_dmdx_resp(int status, 
> orte_process_name_t* sender,
> /* pass across any returned blobs */
> opal_dss.copy_payload(reply, buffer);
> stored = true;
> -OBJ_RETAIN(reply);
> -PMIX_SERVER_QUEUE_SEND(req->peer, req->tag, reply);
> }
> +OBJ_RETAIN(reply);
> +PMIX_SERVER_QUEUE_SEND(req->peer, req->tag, reply);
> } else {
> /* If peer has an access to shared memory dstore, check
>  * if we already stored data for the target process.
> 
> 
> ---
> 
> Summary of changes:
> orte/orted/pmix/pmix_server.c | 4 ++--
> 1 file changed, 2 insertions(+), 2 deletions(-)
> 
> 
> hooks/post-receive
> -- 
> open-mpi/ompi
> ___
> ompi-commits mailing list
> ompi-comm...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/ompi-commits



Re: [OMPI devel] [OMPI commits] Git: open-mpi/ompi branch master updated. dev-612-g05af80b

2014-12-24 Thread Elena Elkina
Hi Ralph,

As I remember the idea of this code was to create a reply once (and set
flag stored to true) but send this reply multiple times (to each process
from the list of requests). Flag stored is set to false earlier in the
code. It means that once (for the first request in the loop
pmix_server_pending_dmx_reqs) it will create the object.

Best regards,
Elena

On Wed, Dec 24, 2014 at 7:05 PM, Ralph Castain  wrote:

> Hi Nadezhda
>
> I’m afraid this commit is still incorrect as it means that reply can be
> used without ever being initialized. Somehow, you have to OBJ_NEW reply
> before you can use it.
>
> Could you please correct this?
>
> Thanks
> Ralph
>
> > On Dec 24, 2014, at 3:30 AM, git...@crest.iu.edu wrote:
> >
> > This is an automated email from the git hooks/post-receive script. It was
> > generated because a ref change was pushed to the repository containing
> > the project "open-mpi/ompi".
> >
> > The branch, master has been updated
> >   via  05af80b3025dbb95bdd4280087450791291d7219 (commit)
> >  from  b9349d2eb9117c61205c98d5d2d5175d26971d23 (commit)
> >
> > Those revisions listed above that are new to this repository have
> > not appeared on any other notification email; so we list those
> > revisions in full, below.
> >
> > - Log -
> >
> https://github.com/open-mpi/ompi/commit/05af80b3025dbb95bdd4280087450791291d7219
> >
> > commit 05af80b3025dbb95bdd4280087450791291d7219
> > Author: Nadezhda Kogteva 
> > Date:   Wed Dec 24 13:25:23 2014 +0200
> >
> >Fix commit bffb2b7a4bb49c9188d942201b8a8f04872ff63c which broke pmix
> server functionality
> >
> > diff --git a/orte/orted/pmix/pmix_server.c
> b/orte/orted/pmix/pmix_server.c
> > index 4f0493c..0f4c816 100644
> > --- a/orte/orted/pmix/pmix_server.c
> > +++ b/orte/orted/pmix/pmix_server.c
> > @@ -1241,9 +1241,9 @@ static void pmix_server_dmdx_resp(int status,
> orte_process_name_t* sender,
> > /* pass across any returned blobs */
> > opal_dss.copy_payload(reply, buffer);
> > stored = true;
> > -OBJ_RETAIN(reply);
> > -PMIX_SERVER_QUEUE_SEND(req->peer, req->tag, reply);
> > }
> > +OBJ_RETAIN(reply);
> > +PMIX_SERVER_QUEUE_SEND(req->peer, req->tag, reply);
> > } else {
> > /* If peer has an access to shared memory dstore, check
> >  * if we already stored data for the target process.
> >
> >
> > ---
> >
> > Summary of changes:
> > orte/orted/pmix/pmix_server.c | 4 ++--
> > 1 file changed, 2 insertions(+), 2 deletions(-)
> >
> >
> > hooks/post-receive
> > --
> > open-mpi/ompi
> > ___
> > ompi-commits mailing list
> > ompi-comm...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/ompi-commits
>
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/12/16722.php


Re: [OMPI devel] OMPI devel] [OMPI commits] Git: open-mpi/ompi branch master updated. dev-612-g05af80b

2014-12-24 Thread Gilles Gouaillardet
Ralph,

I had second thougts on what i wrote earlier, and i think the code is correct. 
e.g. reply cannot be used uninitialized.

That being said, i think reply should be initialized to null and OBJ_RELEASE'd 
if not null on exit in order to avoid a memory leak.

Sorry for the confusion,

Gilles 

Ralph Castain さんのメール:
>Hi Nadezhda
>
>I’m afraid this commit is still incorrect as it means that reply can be used 
>without ever being initialized. Somehow, you have to OBJ_NEW reply before you 
>can use it.
>
>Could you please correct this?
>
>Thanks
>Ralph
>
>> On Dec 24, 2014, at 3:30 AM, git...@crest.iu.edu wrote:
>> 
>> This is an automated email from the git hooks/post-receive script. It was
>> generated because a ref change was pushed to the repository containing
>> the project "open-mpi/ompi".
>> 
>> The branch, master has been updated
>>   via  05af80b3025dbb95bdd4280087450791291d7219 (commit)
>>  from  b9349d2eb9117c61205c98d5d2d5175d26971d23 (commit)
>> 
>> Those revisions listed above that are new to this repository have
>> not appeared on any other notification email; so we list those
>> revisions in full, below.
>> 
>> - Log -
>> https://github.com/open-mpi/ompi/commit/05af80b3025dbb95bdd4280087450791291d7219
>> 
>> commit 05af80b3025dbb95bdd4280087450791291d7219
>> Author: Nadezhda Kogteva 
>> Date:   Wed Dec 24 13:25:23 2014 +0200
>> 
>>Fix commit bffb2b7a4bb49c9188d942201b8a8f04872ff63c which broke pmix 
>> server functionality
>> 
>> diff --git a/orte/orted/pmix/pmix_server.c b/orte/orted/pmix/pmix_server.c
>> index 4f0493c..0f4c816 100644
>> --- a/orte/orted/pmix/pmix_server.c
>> +++ b/orte/orted/pmix/pmix_server.c
>> @@ -1241,9 +1241,9 @@ static void pmix_server_dmdx_resp(int status, 
>> orte_process_name_t* sender,
>> /* pass across any returned blobs */
>> opal_dss.copy_payload(reply, buffer);
>> stored = true;
>> -OBJ_RETAIN(reply);
>> -PMIX_SERVER_QUEUE_SEND(req->peer, req->tag, reply);
>> }
>> +OBJ_RETAIN(reply);
>> +PMIX_SERVER_QUEUE_SEND(req->peer, req->tag, reply);
>> } else {
>> /* If peer has an access to shared memory dstore, check
>>  * if we already stored data for the target process.
>> 
>> 
>> ---
>> 
>> Summary of changes:
>> orte/orted/pmix/pmix_server.c | 4 ++--
>> 1 file changed, 2 insertions(+), 2 deletions(-)
>> 
>> 
>> hooks/post-receive
>> -- 
>> open-mpi/ompi
>> ___
>> ompi-commits mailing list
>> ompi-comm...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/ompi-commits
>
>___
>devel mailing list
>de...@open-mpi.org
>Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>Link to this post: 
>http://www.open-mpi.org/community/lists/devel/2014/12/16722.php

Re: [OMPI devel] OMPI devel] [OMPI commits] Git: open-mpi/ompi branch master updated. dev-612-g05af80b

2014-12-24 Thread Ralph Castain
The code is awfully hard to follow - I’m refactoring it on a branch now to try 
and make it more understandable and will deal with the memory leak there.


> On Dec 24, 2014, at 7:35 AM, Gilles Gouaillardet 
>  wrote:
> 
> Ralph,
> 
> I had second thougts on what i wrote earlier, and i think the code is 
> correct. e.g. reply cannot be used uninitialized.
> 
> That being said, i think reply should be initialized to null and 
> OBJ_RELEASE'd if not null on exit in order to avoid a memory leak.
> 
> Sorry for the confusion,
> 
> Gilles 
> 
> Ralph Castain さんのメール:
>> Hi Nadezhda
>> 
>> I’m afraid this commit is still incorrect as it means that reply can be used 
>> without ever being initialized. Somehow, you have to OBJ_NEW reply before 
>> you can use it.
>> 
>> Could you please correct this?
>> 
>> Thanks
>> Ralph
>> 
>>> On Dec 24, 2014, at 3:30 AM, git...@crest.iu.edu wrote:
>>> 
>>> This is an automated email from the git hooks/post-receive script. It was
>>> generated because a ref change was pushed to the repository containing
>>> the project "open-mpi/ompi".
>>> 
>>> The branch, master has been updated
>>>  via  05af80b3025dbb95bdd4280087450791291d7219 (commit)
>>> from  b9349d2eb9117c61205c98d5d2d5175d26971d23 (commit)
>>> 
>>> Those revisions listed above that are new to this repository have
>>> not appeared on any other notification email; so we list those
>>> revisions in full, below.
>>> 
>>> - Log -
>>> https://github.com/open-mpi/ompi/commit/05af80b3025dbb95bdd4280087450791291d7219
>>> 
>>> commit 05af80b3025dbb95bdd4280087450791291d7219
>>> Author: Nadezhda Kogteva 
>>> Date:   Wed Dec 24 13:25:23 2014 +0200
>>> 
>>>   Fix commit bffb2b7a4bb49c9188d942201b8a8f04872ff63c which broke pmix 
>>> server functionality
>>> 
>>> diff --git a/orte/orted/pmix/pmix_server.c b/orte/orted/pmix/pmix_server.c
>>> index 4f0493c..0f4c816 100644
>>> --- a/orte/orted/pmix/pmix_server.c
>>> +++ b/orte/orted/pmix/pmix_server.c
>>> @@ -1241,9 +1241,9 @@ static void pmix_server_dmdx_resp(int status, 
>>> orte_process_name_t* sender,
>>>/* pass across any returned blobs */
>>>opal_dss.copy_payload(reply, buffer);
>>>stored = true;
>>> -OBJ_RETAIN(reply);
>>> -PMIX_SERVER_QUEUE_SEND(req->peer, req->tag, reply);
>>>}
>>> +OBJ_RETAIN(reply);
>>> +PMIX_SERVER_QUEUE_SEND(req->peer, req->tag, reply);
>>>} else {
>>>/* If peer has an access to shared memory dstore, check
>>> * if we already stored data for the target process.
>>> 
>>> 
>>> ---
>>> 
>>> Summary of changes:
>>> orte/orted/pmix/pmix_server.c | 4 ++--
>>> 1 file changed, 2 insertions(+), 2 deletions(-)
>>> 
>>> 
>>> hooks/post-receive
>>> -- 
>>> open-mpi/ompi
>>> ___
>>> ompi-commits mailing list
>>> ompi-comm...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/ompi-commits
>> 
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/devel/2014/12/16722.php
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/12/16724.php



Re: [OMPI devel] [OMPI commits] Git: open-mpi/ompi branch master updated. dev-612-g05af80b

2014-12-24 Thread Ralph Castain
Thanks - as per the other note on the thread, I’ve been working to refactor 
things anyway and will try to make the code a little more readable so this is 
clearer.

> On Dec 24, 2014, at 7:28 AM, Elena Elkina  wrote:
> 
> Hi Ralph,
> 
> As I remember the idea of this code was to create a reply once (and set flag 
> stored to true) but send this reply multiple times (to each process from the 
> list of requests). Flag stored is set to false earlier in the code. It means 
> that once (for the first request in the loop pmix_server_pending_dmx_reqs) it 
> will create the object.
> 
> Best regards,
> Elena
> 
> On Wed, Dec 24, 2014 at 7:05 PM, Ralph Castain  > wrote:
> Hi Nadezhda
> 
> I’m afraid this commit is still incorrect as it means that reply can be used 
> without ever being initialized. Somehow, you have to OBJ_NEW reply before you 
> can use it.
> 
> Could you please correct this?
> 
> Thanks
> Ralph
> 
> > On Dec 24, 2014, at 3:30 AM, git...@crest.iu.edu 
> >  wrote:
> >
> > This is an automated email from the git hooks/post-receive script. It was
> > generated because a ref change was pushed to the repository containing
> > the project "open-mpi/ompi".
> >
> > The branch, master has been updated
> >   via  05af80b3025dbb95bdd4280087450791291d7219 (commit)
> >  from  b9349d2eb9117c61205c98d5d2d5175d26971d23 (commit)
> >
> > Those revisions listed above that are new to this repository have
> > not appeared on any other notification email; so we list those
> > revisions in full, below.
> >
> > - Log -
> > https://github.com/open-mpi/ompi/commit/05af80b3025dbb95bdd4280087450791291d7219
> >  
> > 
> >
> > commit 05af80b3025dbb95bdd4280087450791291d7219
> > Author: Nadezhda Kogteva  > >
> > Date:   Wed Dec 24 13:25:23 2014 +0200
> >
> >Fix commit bffb2b7a4bb49c9188d942201b8a8f04872ff63c which broke pmix 
> > server functionality
> >
> > diff --git a/orte/orted/pmix/pmix_server.c b/orte/orted/pmix/pmix_server.c
> > index 4f0493c..0f4c816 100644
> > --- a/orte/orted/pmix/pmix_server.c
> > +++ b/orte/orted/pmix/pmix_server.c
> > @@ -1241,9 +1241,9 @@ static void pmix_server_dmdx_resp(int status, 
> > orte_process_name_t* sender,
> > /* pass across any returned blobs */
> > opal_dss.copy_payload(reply, buffer);
> > stored = true;
> > -OBJ_RETAIN(reply);
> > -PMIX_SERVER_QUEUE_SEND(req->peer, req->tag, reply);
> > }
> > +OBJ_RETAIN(reply);
> > +PMIX_SERVER_QUEUE_SEND(req->peer, req->tag, reply);
> > } else {
> > /* If peer has an access to shared memory dstore, check
> >  * if we already stored data for the target process.
> >
> >
> > ---
> >
> > Summary of changes:
> > orte/orted/pmix/pmix_server.c | 4 ++--
> > 1 file changed, 2 insertions(+), 2 deletions(-)
> >
> >
> > hooks/post-receive
> > --
> > open-mpi/ompi
> > ___
> > ompi-commits mailing list
> > ompi-comm...@open-mpi.org 
> > http://www.open-mpi.org/mailman/listinfo.cgi/ompi-commits 
> > 
> 
> ___
> devel mailing list
> de...@open-mpi.org 
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel 
> 
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/12/16722.php 
> 
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/12/16723.php