Re: [OMPI devel] r31916 question

2014-06-19 Thread Ralph Castain
No, ORTE (nor OMPI) makes no such assumption. That's up to the scheduler. We 
will launch a separate orted for each job, though, to avoid cross-contamination

On Jun 19, 2014, at 8:00 AM, Pritchard, Howard P <howa...@lanl.gov> wrote:

> Hi Ralph,
>  
> Thanks for the explanation.  Does ORTE/OMPI always assume that for multi-node 
> jobs,
> there will only be one user’s job/node?At my previous employer we were 
> having
> to do some changes to runtime components in order to support slurm, for which 
> the customers’
> default settings was to prefer filling of nodes with jobs even if that meant 
> multi-node
> jobs of different users were intermingled within nodes.  The customers did 
> not want
> to have to use the exclusive option.
>  
> Just a heads up if folks who are working on cray xe/xc systems are making 
> assumptions
> that the way things work now with aprun will hold true going forwards.
>  
> Howard
>  
>  
> From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Ralph Castain
> Sent: Wednesday, June 18, 2014 5:00 PM
> To: Open MPI Developers
> Subject: Re: [OMPI devel] r31916 question
>  
> You know, looking at the code and the comments, the rationale for putting the 
> nids in order was to prep the list for the regex generator. If you look in 
> the plm_ras_module, you'll see that we pass the nodelist to 
> orte_plm_base_orted_append_basic_args. ORNL used static ports for alps to get 
> better scaling, and so that function creates a regular expression from the 
> nodelist. We then pass that to each orted upon launch so it can compute the 
> URI for all other orteds in the system, thus allowing it to connect back to 
> mpirun thru the routing tree (instead of making a direct connection).
>  
> HTH
> Ralph
>  
> On Jun 18, 2014, at 3:55 PM, Ralph Castain <r...@open-mpi.org> wrote:
> 
> 
> Ah, I see - yes, you'd get_attribute to retrieve it. Alternatively, you have 
> it sitting right there in an array, so you could just use the array to order 
> the list
>  
>  
> On Jun 18, 2014, at 3:47 PM, Pritchard, Howard P <howa...@lanl.gov> wrote:
> 
> 
> Hi Ralph,
>  
> It is setting the attribute, but then for some reason there seems to be a 
> need to have the node ids (nids) in
> ascending order, so there’s some code looking at the old launch_id field, 
> which no longer exists.
>  
> I’m fixing it.  I’d like to learn the cycle of getting fixes in to trunk.
>  
> Thanks,
>  
> Howard
>  
>  
> From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Ralph Castain
> Sent: Wednesday, June 18, 2014 4:45 PM
> To: Open MPI Developers
> Subject: Re: [OMPI devel] r31916 question
>  
> Huh - thought I got that. Sorry I missed it. Let me take a look and ensure 
> that the alps ras module is setting that attribute
>  
> On Jun 18, 2014, at 2:40 PM, Pritchard, Howard P <howa...@lanl.gov> wrote:
> 
> 
> 
> Hello Folks,
>  
> I’m looking at commit 31916 and notice a lot of fields were remote from 
> orte_node_t.
> This is now preventing ras_alps_module.c from compiling owing to use of a 
> “launch_id”
> field.
>  
> In lieu of the direct use of launch_id, should I replace the code around 587 
> of this file with
> use of orte_get_attribute with ORTE_NODE_LAUNCH_ID for the attribute to be 
> retrieved?
>  
> Thanks,
>  
> Howard
>  
>  
> -
> Howard Pritchard
> HPC-5
> Los Alamos National Laboratory
>  
>  
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/06/15008.php
>  
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/06/15010.php
>  
>  
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/06/15017.php



Re: [OMPI devel] r31916 question

2014-06-19 Thread Adrian Reber
The fault tolerance code also needs additional changes because of this
commit. I have the changes prepared but not committed.

On Wed, Jun 18, 2014 at 03:45:11PM -0700, Ralph Castain wrote:
> Huh - thought I got that. Sorry I missed it. Let me take a look and ensure 
> that the alps ras module is setting that attribute
> 
> On Jun 18, 2014, at 2:40 PM, Pritchard, Howard P  wrote:
> 
> > Hello Folks,
> >  
> > I’m looking at commit 31916 and notice a lot of fields were remote from 
> > orte_node_t.
> > This is now preventing ras_alps_module.c from compiling owing to use of a 
> > “launch_id”
> > field.
> >  
> > In lieu of the direct use of launch_id, should I replace the code around 
> > 587 of this file with
> > use of orte_get_attribute with ORTE_NODE_LAUNCH_ID for the attribute to be 
> > retrieved?
> >  
> > Thanks,
> >  
> > Howard
> >  
> >  
> > -
> > Howard Pritchard
> > HPC-5
> > Los Alamos National Laboratory
> >  
> >  
> > ___
> > devel mailing list
> > de...@open-mpi.org
> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > Link to this post: 
> > http://www.open-mpi.org/community/lists/devel/2014/06/15008.php
> 

> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/06/15009.php



Re: [OMPI devel] r31916 question

2014-06-19 Thread Pritchard, Howard P
Hi Ralph,

Thanks for the explanation.  Does ORTE/OMPI always assume that for multi-node 
jobs,
there will only be one user's job/node?At my previous employer we were 
having
to do some changes to runtime components in order to support slurm, for which 
the customers'
default settings was to prefer filling of nodes with jobs even if that meant 
multi-node
jobs of different users were intermingled within nodes.  The customers did not 
want
to have to use the exclusive option.

Just a heads up if folks who are working on cray xe/xc systems are making 
assumptions
that the way things work now with aprun will hold true going forwards.

Howard


From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Ralph Castain
Sent: Wednesday, June 18, 2014 5:00 PM
To: Open MPI Developers
Subject: Re: [OMPI devel] r31916 question

You know, looking at the code and the comments, the rationale for putting the 
nids in order was to prep the list for the regex generator. If you look in the 
plm_ras_module, you'll see that we pass the nodelist to 
orte_plm_base_orted_append_basic_args. ORNL used static ports for alps to get 
better scaling, and so that function creates a regular expression from the 
nodelist. We then pass that to each orted upon launch so it can compute the URI 
for all other orteds in the system, thus allowing it to connect back to mpirun 
thru the routing tree (instead of making a direct connection).

HTH
Ralph

On Jun 18, 2014, at 3:55 PM, Ralph Castain 
<r...@open-mpi.org<mailto:r...@open-mpi.org>> wrote:


Ah, I see - yes, you'd get_attribute to retrieve it. Alternatively, you have it 
sitting right there in an array, so you could just use the array to order the 
list


On Jun 18, 2014, at 3:47 PM, Pritchard, Howard P 
<howa...@lanl.gov<mailto:howa...@lanl.gov>> wrote:


Hi Ralph,

It is setting the attribute, but then for some reason there seems to be a need 
to have the node ids (nids) in
ascending order, so there's some code looking at the old launch_id field, which 
no longer exists.

I'm fixing it.  I'd like to learn the cycle of getting fixes in to trunk.

Thanks,

Howard


From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Ralph Castain
Sent: Wednesday, June 18, 2014 4:45 PM
To: Open MPI Developers
Subject: Re: [OMPI devel] r31916 question

Huh - thought I got that. Sorry I missed it. Let me take a look and ensure that 
the alps ras module is setting that attribute

On Jun 18, 2014, at 2:40 PM, Pritchard, Howard P 
<howa...@lanl.gov<mailto:howa...@lanl.gov>> wrote:



Hello Folks,

I'm looking at commit 31916 and notice a lot of fields were remote from 
orte_node_t.
This is now preventing ras_alps_module.c from compiling owing to use of a 
"launch_id"
field.

In lieu of the direct use of launch_id, should I replace the code around 587 of 
this file with
use of orte_get_attribute with ORTE_NODE_LAUNCH_ID for the attribute to be 
retrieved?

Thanks,

Howard


-
Howard Pritchard
HPC-5
Los Alamos National Laboratory


___
devel mailing list
de...@open-mpi.org<mailto:de...@open-mpi.org>
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
Link to this post: 
http://www.open-mpi.org/community/lists/devel/2014/06/15008.php

___
devel mailing list
de...@open-mpi.org<mailto:de...@open-mpi.org>
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
Link to this post: 
http://www.open-mpi.org/community/lists/devel/2014/06/15010.php




Re: [OMPI devel] r31916 question

2014-06-18 Thread Ralph Castain
You know, looking at the code and the comments, the rationale for putting the 
nids in order was to prep the list for the regex generator. If you look in the 
plm_ras_module, you'll see that we pass the nodelist to 
orte_plm_base_orted_append_basic_args. ORNL used static ports for alps to get 
better scaling, and so that function creates a regular expression from the 
nodelist. We then pass that to each orted upon launch so it can compute the URI 
for all other orteds in the system, thus allowing it to connect back to mpirun 
thru the routing tree (instead of making a direct connection).

HTH
Ralph

On Jun 18, 2014, at 3:55 PM, Ralph Castain <r...@open-mpi.org> wrote:

> Ah, I see - yes, you'd get_attribute to retrieve it. Alternatively, you have 
> it sitting right there in an array, so you could just use the array to order 
> the list
> 
> 
> On Jun 18, 2014, at 3:47 PM, Pritchard, Howard P <howa...@lanl.gov> wrote:
> 
>> Hi Ralph,
>>  
>> It is setting the attribute, but then for some reason there seems to be a 
>> need to have the node ids (nids) in
>> ascending order, so there’s some code looking at the old launch_id field, 
>> which no longer exists.
>>  
>> I’m fixing it.  I’d like to learn the cycle of getting fixes in to trunk.
>>  
>> Thanks,
>>  
>> Howard
>>  
>>  
>> From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Ralph Castain
>> Sent: Wednesday, June 18, 2014 4:45 PM
>> To: Open MPI Developers
>> Subject: Re: [OMPI devel] r31916 question
>>  
>> Huh - thought I got that. Sorry I missed it. Let me take a look and ensure 
>> that the alps ras module is setting that attribute
>>  
>> On Jun 18, 2014, at 2:40 PM, Pritchard, Howard P <howa...@lanl.gov> wrote:
>> 
>> 
>> Hello Folks,
>>  
>> I’m looking at commit 31916 and notice a lot of fields were remote from 
>> orte_node_t.
>> This is now preventing ras_alps_module.c from compiling owing to use of a 
>> “launch_id”
>> field.
>>  
>> In lieu of the direct use of launch_id, should I replace the code around 587 
>> of this file with
>> use of orte_get_attribute with ORTE_NODE_LAUNCH_ID for the attribute to be 
>> retrieved?
>>  
>> Thanks,
>>  
>> Howard
>>  
>>  
>> -
>> Howard Pritchard
>> HPC-5
>> Los Alamos National Laboratory
>>  
>>  
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/devel/2014/06/15008.php
>>  
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/devel/2014/06/15010.php
> 



Re: [OMPI devel] r31916 question

2014-06-18 Thread Pritchard, Howard P
Hi Ralph,

It is setting the attribute, but then for some reason there seems to be a need 
to have the node ids (nids) in
ascending order, so there's some code looking at the old launch_id field, which 
no longer exists.

I'm fixing it.  I'd like to learn the cycle of getting fixes in to trunk.

Thanks,

Howard


From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Ralph Castain
Sent: Wednesday, June 18, 2014 4:45 PM
To: Open MPI Developers
Subject: Re: [OMPI devel] r31916 question

Huh - thought I got that. Sorry I missed it. Let me take a look and ensure that 
the alps ras module is setting that attribute

On Jun 18, 2014, at 2:40 PM, Pritchard, Howard P 
<howa...@lanl.gov<mailto:howa...@lanl.gov>> wrote:


Hello Folks,

I'm looking at commit 31916 and notice a lot of fields were remote from 
orte_node_t.
This is now preventing ras_alps_module.c from compiling owing to use of a 
"launch_id"
field.

In lieu of the direct use of launch_id, should I replace the code around 587 of 
this file with
use of orte_get_attribute with ORTE_NODE_LAUNCH_ID for the attribute to be 
retrieved?

Thanks,

Howard


-
Howard Pritchard
HPC-5
Los Alamos National Laboratory


___
devel mailing list
de...@open-mpi.org<mailto:de...@open-mpi.org>
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
Link to this post: 
http://www.open-mpi.org/community/lists/devel/2014/06/15008.php



Re: [OMPI devel] r31916 question

2014-06-18 Thread Ralph Castain
Huh - thought I got that. Sorry I missed it. Let me take a look and ensure that 
the alps ras module is setting that attribute

On Jun 18, 2014, at 2:40 PM, Pritchard, Howard P  wrote:

> Hello Folks,
>  
> I’m looking at commit 31916 and notice a lot of fields were remote from 
> orte_node_t.
> This is now preventing ras_alps_module.c from compiling owing to use of a 
> “launch_id”
> field.
>  
> In lieu of the direct use of launch_id, should I replace the code around 587 
> of this file with
> use of orte_get_attribute with ORTE_NODE_LAUNCH_ID for the attribute to be 
> retrieved?
>  
> Thanks,
>  
> Howard
>  
>  
> -
> Howard Pritchard
> HPC-5
> Los Alamos National Laboratory
>  
>  
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/06/15008.php