Re: [OMPI devel] dfference between my_node_rank and my_local_rank in orte proc_info_t

2014-09-26 Thread Ralph Castain
The s1 method is correct - other two are wrong

Sent from my iPhone

> On Sep 26, 2014, at 9:24 AM, Pritchard Jr., Howard  wrote:
> 
> Hi Folks,
>  
> I’m trying to figure out something about the kind of info pmi’s are suppose
> to be feeding back up in to orte/ompi, partly because native launch
> doesn’t seem to work too well in trunk. 
>  
> One of the things I’m puzzling about is the purpose of the my_node_rank
> field in orte_proc_info_t.
>  
> I’m particularly puzzled, because with the new pmix s2 and cray components,
> we are returning the NODE RANK (i.e. which NODE in a virtual sense, the
> proc is on) via a pmix_X_parse_pmap function.   See find_my_node in both
> the pmix_cray_parse_pmap and pmix_s2_parse_pmap functions. This
> value is subsequently returned when the pmix component is queried with the
> PMIX_NODE_RANK attribute.
>  
> But, with s1, we seem to just be assigning the same value to my_node_rank
> as to my_local_rank, based on the contents of the array returned from 
> PMI_Get_clique_ranks.
>  
> It appears that in the 1.7/1.8 branch, the behavior is to do the s1 route,
> irrespective of whether pmi2 from slurm or the pmi.fuzzy from cray is used.
>  
> Thanks for any help,
>  
> Howard
>  
>  
> -
> Howard Pritchard
> HPC-5
> Los Alamos National Laboratory
>  
>  
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/09/15923.php


[OMPI devel] dfference between my_node_rank and my_local_rank in orte proc_info_t

2014-09-26 Thread Pritchard Jr., Howard
Hi Folks,

I'm trying to figure out something about the kind of info pmi's are suppose
to be feeding back up in to orte/ompi, partly because native launch
doesn't seem to work too well in trunk.

One of the things I'm puzzling about is the purpose of the my_node_rank
field in orte_proc_info_t.

I'm particularly puzzled, because with the new pmix s2 and cray components,
we are returning the NODE RANK (i.e. which NODE in a virtual sense, the
proc is on) via a pmix_X_parse_pmap function.   See find_my_node in both
the pmix_cray_parse_pmap and pmix_s2_parse_pmap functions. This
value is subsequently returned when the pmix component is queried with the
PMIX_NODE_RANK attribute.

But, with s1, we seem to just be assigning the same value to my_node_rank
as to my_local_rank, based on the contents of the array returned from 
PMI_Get_clique_ranks.

It appears that in the 1.7/1.8 branch, the behavior is to do the s1 route,
irrespective of whether pmi2 from slurm or the pmi.fuzzy from cray is used.

Thanks for any help,

Howard


-
Howard Pritchard
HPC-5
Los Alamos National Laboratory




Re: [OMPI devel] race condition in oob/tcp

2014-09-26 Thread Ralph Castain
Thanks!


On Fri, Sep 26, 2014 at 12:56 AM, Gilles Gouaillardet <
gilles.gouaillar...@iferc.org> wrote:

>  Ralph,
>
> i just commited r32799 in order to fix this issue.
> i cmr'ed (#4923) and set the target for 1.8.4
>
> Cheers,
>
> Gilles
>
>
> On 2014/09/23 22:55, Ralph Castain wrote:
>
> Thanks! I won't have time to work on it this week, but appreciate your 
> effort. Also, thanks for clarifying the race condition vis 1.8 - I agree it 
> is not a blocker for that release.
>
> Ralph
>
> On Sep 22, 2014, at 4:49 PM, Gilles Gouaillardet 
>   wrote:
>
>
>  Ralph,
>
> here is the patch i am using so far.
> i will resume working on this from Wednesday (there is at least one remaining 
> race condition yet) unless you have the time to take care of it today.
>
> so far, the race condition has only been observed in real life with the 
> grpcomm/rcd module, and this is not the default in v1.8, so imho this is not 
> a blocker for v1.8.3
>
> Cheers,
>
> Gilles
>
> On Tue, Sep 23, 2014 at 7:46 AM, Ralph Castain  
>  wrote:
> Gilles - please let me know if/when you think you'll do this. I'm debating 
> about adding it to 1.8.3, but don't want to delay that release too long. 
> Alternatively, I can take care of it if you don't have time (I'm asking if 
> you can do it solely because you have the reproducer).
>
>
> On Sep 21, 2014, at 6:54 AM, Ralph Castain  
>  wrote:
>
>
>  Sounds fine with me - please go ahead, and thanks
>
> On Sep 20, 2014, at 10:26 PM, Gilles Gouaillardet 
>   wrote:
>
>
>  Thanks for the pointer George !
>
> On Sat, Sep 20, 2014 at 5:46 AM, George Bosilca  
>  wrote:
> Or copy the handshake protocol design of the TCP BTL...
>
>
> the main difference between oob/tcp and btl/tcp is the way we resolve the 
> situation in which two processes send their first message to each other at 
> the same time.
>
> in oob/tcp, all (e.g. one or two) sockets are closed and the higher vpid is 
> directed to retry establishing a connection.
>
> in btl/tcp, the useless socket is closed (e.g. the one that was connect-ed on 
> the lower vpid and the one that was accept-ed on the higher vpid.
>
>
> my first impression is that oob/tcp is un-necessary complex and it should use 
> the simpler and most efficient protocol of btl/tcp.
> that being said, this conclusion could be too naive and for some good reasons 
> i ignore, the btl/tcp handshake protocol might not be a good fit for oob/tcp.
>
> any thoughts ?
>
> i will revamp oob/tcp in order to use the same btl/tcp handshake protocol 
> from tomorrow unless indicated otherwise
>
> Cheers,
>
> Gilles
> ___
> devel mailing listde...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/09/15885.php
>
>
> ___
> devel mailing listde...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/09/15895.php
>
> ___
> devel mailing listde...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/09/15897.php
>
>
>
> ___
> devel mailing listde...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/09/15900.php
>
>
>
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/09/15920.php
>


Re: [OMPI devel] 1.8.3rc2 available

2014-09-26 Thread Kawashima, Takahiro
just FYI:
configure && make && make install && make test
succeeded on my SPARC64/Linux/GCC (both enable-debug=yes and no).

Takahiro Kawashima,
MPI development team,
Fujitsu

> Usual place:
> 
> http://www.open-mpi.org/software/ompi/v1.8/
> 
> Please beat it up as we want to release on Fri, barring discovery of a blocker
> Ralph


Re: [OMPI devel] race condition in oob/tcp

2014-09-26 Thread Gilles Gouaillardet
Ralph,

i just commited r32799 in order to fix this issue.
i cmr'ed (#4923) and set the target for 1.8.4

Cheers,

Gilles

On 2014/09/23 22:55, Ralph Castain wrote:
> Thanks! I won't have time to work on it this week, but appreciate your 
> effort. Also, thanks for clarifying the race condition vis 1.8 - I agree it 
> is not a blocker for that release.
>
> Ralph
>
> On Sep 22, 2014, at 4:49 PM, Gilles Gouaillardet 
>  wrote:
>
>> Ralph,
>>
>> here is the patch i am using so far.
>> i will resume working on this from Wednesday (there is at least one 
>> remaining race condition yet) unless you have the time to take care of it 
>> today.
>>
>> so far, the race condition has only been observed in real life with the 
>> grpcomm/rcd module, and this is not the default in v1.8, so imho this is not 
>> a blocker for v1.8.3
>>
>> Cheers,
>>
>> Gilles
>>
>> On Tue, Sep 23, 2014 at 7:46 AM, Ralph Castain  wrote:
>> Gilles - please let me know if/when you think you'll do this. I'm debating 
>> about adding it to 1.8.3, but don't want to delay that release too long. 
>> Alternatively, I can take care of it if you don't have time (I'm asking if 
>> you can do it solely because you have the reproducer).
>>
>>
>> On Sep 21, 2014, at 6:54 AM, Ralph Castain  wrote:
>>
>>> Sounds fine with me - please go ahead, and thanks
>>>
>>> On Sep 20, 2014, at 10:26 PM, Gilles Gouaillardet 
>>>  wrote:
>>>
 Thanks for the pointer George !

 On Sat, Sep 20, 2014 at 5:46 AM, George Bosilca  
 wrote:
 Or copy the handshake protocol design of the TCP BTL...


 the main difference between oob/tcp and btl/tcp is the way we resolve the 
 situation in which two processes send their first message to each other at 
 the same time.

 in oob/tcp, all (e.g. one or two) sockets are closed and the higher vpid 
 is directed to retry establishing a connection.

 in btl/tcp, the useless socket is closed (e.g. the one that was connect-ed 
 on the lower vpid and the one that was accept-ed on the higher vpid.


 my first impression is that oob/tcp is un-necessary complex and it should 
 use the simpler and most efficient protocol of btl/tcp.
 that being said, this conclusion could be too naive and for some good 
 reasons i ignore, the btl/tcp handshake protocol might not be a good fit 
 for oob/tcp.

 any thoughts ?

 i will revamp oob/tcp in order to use the same btl/tcp handshake protocol 
 from tomorrow unless indicated otherwise

 Cheers,

 Gilles
 ___
 devel mailing list
 de...@open-mpi.org
 Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
 Link to this post: 
 http://www.open-mpi.org/community/lists/devel/2014/09/15885.php
>>
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/devel/2014/09/15895.php
>>
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/devel/2014/09/15897.php
>
>
>
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/09/15900.php