Re: [OMPI devel] r31916 question
No, ORTE (nor OMPI) makes no such assumption. That's up to the scheduler. We will launch a separate orted for each job, though, to avoid cross-contamination On Jun 19, 2014, at 8:00 AM, Pritchard, Howard P <howa...@lanl.gov> wrote: > Hi Ralph, > > Thanks for the explanation. Does ORTE/OMPI always assume that for multi-node > jobs, > there will only be one user’s job/node?At my previous employer we were > having > to do some changes to runtime components in order to support slurm, for which > the customers’ > default settings was to prefer filling of nodes with jobs even if that meant > multi-node > jobs of different users were intermingled within nodes. The customers did > not want > to have to use the exclusive option. > > Just a heads up if folks who are working on cray xe/xc systems are making > assumptions > that the way things work now with aprun will hold true going forwards. > > Howard > > > From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Ralph Castain > Sent: Wednesday, June 18, 2014 5:00 PM > To: Open MPI Developers > Subject: Re: [OMPI devel] r31916 question > > You know, looking at the code and the comments, the rationale for putting the > nids in order was to prep the list for the regex generator. If you look in > the plm_ras_module, you'll see that we pass the nodelist to > orte_plm_base_orted_append_basic_args. ORNL used static ports for alps to get > better scaling, and so that function creates a regular expression from the > nodelist. We then pass that to each orted upon launch so it can compute the > URI for all other orteds in the system, thus allowing it to connect back to > mpirun thru the routing tree (instead of making a direct connection). > > HTH > Ralph > > On Jun 18, 2014, at 3:55 PM, Ralph Castain <r...@open-mpi.org> wrote: > > > Ah, I see - yes, you'd get_attribute to retrieve it. Alternatively, you have > it sitting right there in an array, so you could just use the array to order > the list > > > On Jun 18, 2014, at 3:47 PM, Pritchard, Howard P <howa...@lanl.gov> wrote: > > > Hi Ralph, > > It is setting the attribute, but then for some reason there seems to be a > need to have the node ids (nids) in > ascending order, so there’s some code looking at the old launch_id field, > which no longer exists. > > I’m fixing it. I’d like to learn the cycle of getting fixes in to trunk. > > Thanks, > > Howard > > > From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Ralph Castain > Sent: Wednesday, June 18, 2014 4:45 PM > To: Open MPI Developers > Subject: Re: [OMPI devel] r31916 question > > Huh - thought I got that. Sorry I missed it. Let me take a look and ensure > that the alps ras module is setting that attribute > > On Jun 18, 2014, at 2:40 PM, Pritchard, Howard P <howa...@lanl.gov> wrote: > > > > Hello Folks, > > I’m looking at commit 31916 and notice a lot of fields were remote from > orte_node_t. > This is now preventing ras_alps_module.c from compiling owing to use of a > “launch_id” > field. > > In lieu of the direct use of launch_id, should I replace the code around 587 > of this file with > use of orte_get_attribute with ORTE_NODE_LAUNCH_ID for the attribute to be > retrieved? > > Thanks, > > Howard > > > - > Howard Pritchard > HPC-5 > Los Alamos National Laboratory > > > ___ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/06/15008.php > > ___ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/06/15010.php > > > ___ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/06/15017.php
Re: [OMPI devel] r31916 question
The fault tolerance code also needs additional changes because of this commit. I have the changes prepared but not committed. On Wed, Jun 18, 2014 at 03:45:11PM -0700, Ralph Castain wrote: > Huh - thought I got that. Sorry I missed it. Let me take a look and ensure > that the alps ras module is setting that attribute > > On Jun 18, 2014, at 2:40 PM, Pritchard, Howard Pwrote: > > > Hello Folks, > > > > I’m looking at commit 31916 and notice a lot of fields were remote from > > orte_node_t. > > This is now preventing ras_alps_module.c from compiling owing to use of a > > “launch_id” > > field. > > > > In lieu of the direct use of launch_id, should I replace the code around > > 587 of this file with > > use of orte_get_attribute with ORTE_NODE_LAUNCH_ID for the attribute to be > > retrieved? > > > > Thanks, > > > > Howard > > > > > > - > > Howard Pritchard > > HPC-5 > > Los Alamos National Laboratory > > > > > > ___ > > devel mailing list > > de...@open-mpi.org > > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > > Link to this post: > > http://www.open-mpi.org/community/lists/devel/2014/06/15008.php > > ___ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/06/15009.php
Re: [OMPI devel] r31916 question
Hi Ralph, Thanks for the explanation. Does ORTE/OMPI always assume that for multi-node jobs, there will only be one user's job/node?At my previous employer we were having to do some changes to runtime components in order to support slurm, for which the customers' default settings was to prefer filling of nodes with jobs even if that meant multi-node jobs of different users were intermingled within nodes. The customers did not want to have to use the exclusive option. Just a heads up if folks who are working on cray xe/xc systems are making assumptions that the way things work now with aprun will hold true going forwards. Howard From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Ralph Castain Sent: Wednesday, June 18, 2014 5:00 PM To: Open MPI Developers Subject: Re: [OMPI devel] r31916 question You know, looking at the code and the comments, the rationale for putting the nids in order was to prep the list for the regex generator. If you look in the plm_ras_module, you'll see that we pass the nodelist to orte_plm_base_orted_append_basic_args. ORNL used static ports for alps to get better scaling, and so that function creates a regular expression from the nodelist. We then pass that to each orted upon launch so it can compute the URI for all other orteds in the system, thus allowing it to connect back to mpirun thru the routing tree (instead of making a direct connection). HTH Ralph On Jun 18, 2014, at 3:55 PM, Ralph Castain <r...@open-mpi.org<mailto:r...@open-mpi.org>> wrote: Ah, I see - yes, you'd get_attribute to retrieve it. Alternatively, you have it sitting right there in an array, so you could just use the array to order the list On Jun 18, 2014, at 3:47 PM, Pritchard, Howard P <howa...@lanl.gov<mailto:howa...@lanl.gov>> wrote: Hi Ralph, It is setting the attribute, but then for some reason there seems to be a need to have the node ids (nids) in ascending order, so there's some code looking at the old launch_id field, which no longer exists. I'm fixing it. I'd like to learn the cycle of getting fixes in to trunk. Thanks, Howard From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Ralph Castain Sent: Wednesday, June 18, 2014 4:45 PM To: Open MPI Developers Subject: Re: [OMPI devel] r31916 question Huh - thought I got that. Sorry I missed it. Let me take a look and ensure that the alps ras module is setting that attribute On Jun 18, 2014, at 2:40 PM, Pritchard, Howard P <howa...@lanl.gov<mailto:howa...@lanl.gov>> wrote: Hello Folks, I'm looking at commit 31916 and notice a lot of fields were remote from orte_node_t. This is now preventing ras_alps_module.c from compiling owing to use of a "launch_id" field. In lieu of the direct use of launch_id, should I replace the code around 587 of this file with use of orte_get_attribute with ORTE_NODE_LAUNCH_ID for the attribute to be retrieved? Thanks, Howard - Howard Pritchard HPC-5 Los Alamos National Laboratory ___ devel mailing list de...@open-mpi.org<mailto:de...@open-mpi.org> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel Link to this post: http://www.open-mpi.org/community/lists/devel/2014/06/15008.php ___ devel mailing list de...@open-mpi.org<mailto:de...@open-mpi.org> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel Link to this post: http://www.open-mpi.org/community/lists/devel/2014/06/15010.php
Re: [OMPI devel] r31916 question
You know, looking at the code and the comments, the rationale for putting the nids in order was to prep the list for the regex generator. If you look in the plm_ras_module, you'll see that we pass the nodelist to orte_plm_base_orted_append_basic_args. ORNL used static ports for alps to get better scaling, and so that function creates a regular expression from the nodelist. We then pass that to each orted upon launch so it can compute the URI for all other orteds in the system, thus allowing it to connect back to mpirun thru the routing tree (instead of making a direct connection). HTH Ralph On Jun 18, 2014, at 3:55 PM, Ralph Castain <r...@open-mpi.org> wrote: > Ah, I see - yes, you'd get_attribute to retrieve it. Alternatively, you have > it sitting right there in an array, so you could just use the array to order > the list > > > On Jun 18, 2014, at 3:47 PM, Pritchard, Howard P <howa...@lanl.gov> wrote: > >> Hi Ralph, >> >> It is setting the attribute, but then for some reason there seems to be a >> need to have the node ids (nids) in >> ascending order, so there’s some code looking at the old launch_id field, >> which no longer exists. >> >> I’m fixing it. I’d like to learn the cycle of getting fixes in to trunk. >> >> Thanks, >> >> Howard >> >> >> From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Ralph Castain >> Sent: Wednesday, June 18, 2014 4:45 PM >> To: Open MPI Developers >> Subject: Re: [OMPI devel] r31916 question >> >> Huh - thought I got that. Sorry I missed it. Let me take a look and ensure >> that the alps ras module is setting that attribute >> >> On Jun 18, 2014, at 2:40 PM, Pritchard, Howard P <howa...@lanl.gov> wrote: >> >> >> Hello Folks, >> >> I’m looking at commit 31916 and notice a lot of fields were remote from >> orte_node_t. >> This is now preventing ras_alps_module.c from compiling owing to use of a >> “launch_id” >> field. >> >> In lieu of the direct use of launch_id, should I replace the code around 587 >> of this file with >> use of orte_get_attribute with ORTE_NODE_LAUNCH_ID for the attribute to be >> retrieved? >> >> Thanks, >> >> Howard >> >> >> - >> Howard Pritchard >> HPC-5 >> Los Alamos National Laboratory >> >> >> ___ >> devel mailing list >> de...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >> Link to this post: >> http://www.open-mpi.org/community/lists/devel/2014/06/15008.php >> >> ___ >> devel mailing list >> de...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >> Link to this post: >> http://www.open-mpi.org/community/lists/devel/2014/06/15010.php >
Re: [OMPI devel] r31916 question
Hi Ralph, It is setting the attribute, but then for some reason there seems to be a need to have the node ids (nids) in ascending order, so there's some code looking at the old launch_id field, which no longer exists. I'm fixing it. I'd like to learn the cycle of getting fixes in to trunk. Thanks, Howard From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Ralph Castain Sent: Wednesday, June 18, 2014 4:45 PM To: Open MPI Developers Subject: Re: [OMPI devel] r31916 question Huh - thought I got that. Sorry I missed it. Let me take a look and ensure that the alps ras module is setting that attribute On Jun 18, 2014, at 2:40 PM, Pritchard, Howard P <howa...@lanl.gov<mailto:howa...@lanl.gov>> wrote: Hello Folks, I'm looking at commit 31916 and notice a lot of fields were remote from orte_node_t. This is now preventing ras_alps_module.c from compiling owing to use of a "launch_id" field. In lieu of the direct use of launch_id, should I replace the code around 587 of this file with use of orte_get_attribute with ORTE_NODE_LAUNCH_ID for the attribute to be retrieved? Thanks, Howard - Howard Pritchard HPC-5 Los Alamos National Laboratory ___ devel mailing list de...@open-mpi.org<mailto:de...@open-mpi.org> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel Link to this post: http://www.open-mpi.org/community/lists/devel/2014/06/15008.php
Re: [OMPI devel] r31916 question
Huh - thought I got that. Sorry I missed it. Let me take a look and ensure that the alps ras module is setting that attribute On Jun 18, 2014, at 2:40 PM, Pritchard, Howard Pwrote: > Hello Folks, > > I’m looking at commit 31916 and notice a lot of fields were remote from > orte_node_t. > This is now preventing ras_alps_module.c from compiling owing to use of a > “launch_id” > field. > > In lieu of the direct use of launch_id, should I replace the code around 587 > of this file with > use of orte_get_attribute with ORTE_NODE_LAUNCH_ID for the attribute to be > retrieved? > > Thanks, > > Howard > > > - > Howard Pritchard > HPC-5 > Los Alamos National Laboratory > > > ___ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/06/15008.php