[OMPI devel] OpenMPI, PLPA and Linux cpuset/cgroup support
Hi all, Not sure if this is a OpenMPI query or a PLPA query, but given that PLPA seems to have some support for it already I thought I'd start here. :-) We run a quad core Opteron cluster with Torque 2.3.x which uses the kernels cpuset support to constrain a job to just the cores it has been allocated. However, we are seeing occasionally that where a job has been allocated multiple cores on the same node we get two compute bound MPI processes in the job scheduled onto the same core (obviously a kernel issue). So CPU affinity would be an obvious solution, but it needs to be done with reference to the cores that are available to it in its cpuset. This information is already retrievable by PLPA (for instance "plpa-taskset -cp $$" will retrieve the cores allocated to the shell you run the command from) but I'm not sure if OpenMPI makes use of this when binding CPUs using the linux paffinity MCA parameter ? Our testing (with 1.3.2) seems to show it doesn't, and I don't think there are any significant differences with the snapshots in 1.4. Am I correct in this ? If so, are there any plans to make it do this ? cheers, Chris -- Christopher Samuel - (03) 9925 4751 - Systems Manager The Victorian Partnership for Advanced Computing P.O. Box 201, Carlton South, VIC 3053, Australia VPAC is a not-for-profit Registered Research Agency
Re: [OMPI devel] [OMPI users] where can i get a tracing tool
Hi Eugene, The FAQ page looks very good! Some links on the left side do not work, but I assume they will work tomorrow, when the real page goes alive. Thanks, Nik Eugene Loh wrote: Zou, Lin (GE, Research, Consultant) wrote: Hi all, I want to trace my program, having used vampirTrace to generate tracing info, except for Vampir, where can I download free tools to parse the tracing info? Thanks in advance. Lin This message appeared on the users list yesterday. For a long time, I've been meaning to add a perf-tool section to the FAQ. I finally did so, incorporating questions and answers from the users and devel lists that I've seen on this subject in the last few months. I just put the changes back and as soon as I see the pages "live" I'll respond to the user on the user list. Please take a look. You can make changes as you like or give me feedback and I can do it. I acknowledge that there is a conflict of interests in my recommending Sun MPI Analyzer, but I believe I've done so tastefully and appropriately! Throw tomatoes if you see fit. P.S. Until the page goes live, I'll also leave it at http://www.osl.iu.edu/~eloh/faq/?category=perftools . Or, check out a workspace. ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] [OMPI users] where can i get a tracing tool
Hi Eugene, the FAQ page looks very nice. I just sent the following answer to Lin Zou: . for a quick view of what is inside the trace you could try 'otfprofile' to generate a tex/ps file with some information. This tool is a component of the latest stand-alone version of the Open Trace Format (OTF) - see http://www.tu-dresden.de/zih/otf/. However, if you need more detailed information about the trace you would need to get a evaluation version of Vampir - see http://www.vampir.eu. In addition to the evaluation version of Vampir a free version with some functional limitations will be available in the near future. . Could you also mention the tool 'otfprofile' under the section 7, please? As soon as the free version of Vampir is available this could also be mentioned. Thanks, Matthias On Tue, 2009-07-14 at 18:54 -0700, Eugene Loh wrote: > Zou, Lin (GE, Research, Consultant) wrote: > > Hi all, > > I want to trace my program, having used vampirTrace to generate > > tracing info, except for Vampir, where can I download free tools to > > parse the tracing info? > > Thanks in advance. > > Lin > This message appeared on the users list yesterday. For a long time, > I've been meaning to add a perf-tool section to the FAQ. I finally > did so, incorporating questions and answers from the users and devel > lists that I've seen on this subject in the last few months. I just > put the changes back and as soon as I see the pages "live" I'll > respond to the user on the user list. Please take a look. You can > make changes as you like or give me feedback and I can do it. > > I acknowledge that there is a conflict of interests in my recommending > Sun MPI Analyzer, but I believe I've done so tastefully and > appropriately! Throw tomatoes if you see fit. > > P.S. Until the page goes live, I'll also leave it at > http://www.osl.iu.edu/~eloh/faq/?category=perftools . Or, check out a > workspace. > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Matthias Jurenz Technische Universität Dresden Center for Information Services and High Performance Computing (ZIH) 01062 Dresden, Germany Phone : (+49) 351/463-31945 Fax : (+49) 351/463-37773 e-mail: matthias.jur...@tu-dresden.de WWW : http://www.tu-dresden.de/zih smime.p7s Description: S/MIME cryptographic signature
Re: [OMPI devel] OpenMPI, PLPA and Linux cpuset/cgroup support
Interesting. No, we don't take PLPA cpu sets into account when retrieving the allocation. Just to be clear: from an OMPI perspective, I don't think this is an issue of binding, but rather an issue of allocation. If we knew we had been allocated only a certain number of cores on a node, then we would only map that many procs to the node. When we subsequently "bind", we should then bind those procs to the correct cores (I think). Could you check this? You can run a trivial job using the -npernode x option, where x matched the #cores you were allocated on the nodes. If you do this, do we bind to the correct cores? If we do, then that would confirm that we just aren't picking up the right number of cores allocated to us. If it is wrong, then this is a PLPA issue where it isn't binding to the right core. Thanks Ralph On Jul 15, 2009, at 12:28 AM, Chris Samuel wrote: Hi all, Not sure if this is a OpenMPI query or a PLPA query, but given that PLPA seems to have some support for it already I thought I'd start here. :-) We run a quad core Opteron cluster with Torque 2.3.x which uses the kernels cpuset support to constrain a job to just the cores it has been allocated. However, we are seeing occasionally that where a job has been allocated multiple cores on the same node we get two compute bound MPI processes in the job scheduled onto the same core (obviously a kernel issue). So CPU affinity would be an obvious solution, but it needs to be done with reference to the cores that are available to it in its cpuset. This information is already retrievable by PLPA (for instance "plpa-taskset -cp $$" will retrieve the cores allocated to the shell you run the command from) but I'm not sure if OpenMPI makes use of this when binding CPUs using the linux paffinity MCA parameter ? Our testing (with 1.3.2) seems to show it doesn't, and I don't think there are any significant differences with the snapshots in 1.4. Am I correct in this ? If so, are there any plans to make it do this ? cheers, Chris -- Christopher Samuel - (03) 9925 4751 - Systems Manager The Victorian Partnership for Advanced Computing P.O. Box 201, Carlton South, VIC 3053, Australia VPAC is a not-for-profit Registered Research Agency ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] [OMPI users] where can i get a tracing tool
On Jul 15, 2009, at 6:17 AM, Matthias Jurenz wrote: the FAQ page looks very nice. Ditto -- thanks for doing it, Eugene! I just sent the following answer to Lin Zou: Did that go on-list? It would be good to see that stuff in the publicly-searchable web archives. I mention this because our Google Analytics clearly show that lots of people are searching our mailing list, looking for answers to their questions. Could you also mention the tool 'otfprofile' under the section 7, please? As soon as the free version of Vampir is available this could also be mentioned. Do you guys not have write access to the SVN repo for the web pages? If not, we should just add you -- that would certainly make it easier... -- Jeff Squyres Cisco Systems
Re: [OMPI devel] [RFC] Move the datatype engine in the OPAL layer
On Jul 14, 2009, at 1:23 PM, Rainer Keller wrote: https://svn.open-mpi.org/trac/ompi/wiki/HowtoTesting That is most helpful -- thanks! What about the latency issue? > >> Performance tests on the ompi-ddt branch have proven that there is no > >> performance penalties associated with this change (tests done using > >> NetPipe-3.7.1 on smoky using BTL/sm, giving 1.6usecs on this > >> platform). > > > > 1.6us sounds like pretty high sm latency... Is this a slow platform? -- Jeff Squyres Cisco Systems
Re: [OMPI devel] OpenMPI, PLPA and Linux cpuset/cgroup support
- "Ralph Castain" wrote: Hi Ralph, > Interesting. No, we don't take PLPA cpu sets into account when > retrieving the allocation. Understood. > Just to be clear: from an OMPI perspective, I don't think this is an > issue of binding, but rather an issue of allocation. If we knew we had > been allocated only a certain number of cores on a node, then we would > only map that many procs to the node. When we subsequently "bind", we > should then bind those procs to the correct cores (I think). Hmm, OpenMPI should already know this from the PBS TM API when launching the job, we've never had to get our users to specify how many procs per node to start (and they will generally have no idea how many to ask for in advance as they are at the mercy of the scheduler, unless they select a whole nodes with ppn=8). > Could you check this? You can run a trivial job using the > -npernode x option, where x matched the #cores you were > allocated on the nodes. > > If you do this, do we bind to the correct cores? I'll give this a shot tomorrow when I'm back in the office (just checking email late at night here), I'll try it under strace to to see what it tries to sched_setaffinity() to. > If we do, then that would confirm that we just aren't > picking up the right number of cores allocated to us. > If it is wrong, then this is a PLPA issue where it > isn't binding to the right core. Interesting, will let you know the test results tomorrow! cheers, Chris -- Christopher Samuel - (03) 9925 4751 - Systems Manager The Victorian Partnership for Advanced Computing P.O. Box 201, Carlton South, VIC 3053, Australia VPAC is a not-for-profit Registered Research Agency
Re: [OMPI devel] [OMPI users] where can i get a tracing tool
On Tue, 2009-07-14 at 18:54 -0700, Eugene Loh wrote: > P.S. Until the page goes live, I'll also leave it at > http://www.osl.iu.edu/~eloh/faq/?category=perftools . Or, check out a > workspace. I'm happy with it. Ashley. -- Ashley Pittman, Bath, UK. Padb - A parallel job inspection tool for cluster computing http://padb.pittman.org.uk
Re: [OMPI devel] [OMPI users] where can i get a tracing tool
Hi Jeff, On Wed, 2009-07-15 at 07:13 -0400, Jeff Squyres wrote: > On Jul 15, 2009, at 6:17 AM, Matthias Jurenz wrote: > > > the FAQ page looks very nice. > > > > Ditto -- thanks for doing it, Eugene! > > > I just sent the following answer to Lin Zou: > > > > Did that go on-list? It would be good to see that stuff in the > publicly-searchable web archives. I mention this because our Google > Analytics clearly show that lots of people are searching our mailing > list, looking for answers to their questions. > I sent the answer directly to the user, 'cause I didn't subscribe to the user-list. I'll do that asap ;-) > > Could you also mention the tool 'otfprofile' under the section 7, > > please? As soon as the free version of Vampir is available this could > > also be mentioned. > > > > > Do you guys not have write access to the SVN repo for the web pages? > If not, we should just add you -- that would certainly make it easier... > Unfortunately, we haven't write access to the repository for the web pages. Could you add us (me and Andreas), please? Thanks, Matthias smime.p7s Description: S/MIME cryptographic signature
[OMPI devel] selectively bind MPI to one HCA out of available ones
Hi all, I have a cluster where both HCA's of blade are active, but connected to different subnet. Is there an option in MPI to select one HCA out of available one's? I know it can be done by making changes in openmpi code, but i need clean interface like option during mpi launch time to select mthca0 or mthca1? Any help is appreciated. Btw i just checked Mvapich and feature is there inside. Regards Neeraj Chourasia (MTS) Computational Research Laboratories Ltd. (A wholly Owned Subsidiary of TATA SONS Ltd) B-101, ICC Trade Towers, Senapati Bapat Road Pune 411016 (Mah) INDIA (O) +91-20-6620 9863 (Fax) +91-20-6620 9862 M: +91.9225520634 =-=-= Notice: The information contained in this e-mail message and/or attachments to it may contain confidential or privileged information. If you are not the intended recipient, any dissemination, use, review, distribution, printing or copying of the information contained in this e-mail message and/or attachments to it are strictly prohibited. If you have received this communication in error, please notify us by reply e-mail or telephone and immediately and permanently delete the message and any attachments. Internet communications cannot be guaranteed to be timely, secure, error or virus-free. The sender does not accept liability for any errors or omissions.Thank you =-=-=
Re: [OMPI devel] OpenMPI, PLPA and Linux cpuset/cgroup support
Hmmm...I believe I made a mis-statement. Shocking to those who know me, I am sure! :-) Just to correct my comments: OMPI knows how many "slots" have been allocated to us, but not which "cores". So I'll assign the correct number of procs to each node, but they won't know that we were allocated cores 2 and 4 (for example), as opposed to some other combination. When we subsequently bind, we bind to logical cpus based on our node rank - i.e., what rank I am relative to my local peers on this node. PLPA then translates that into a physical core. My guess is that you are correct and PLPA isn't looking to see specifically -which- cores were allocated to the job, but instead is simply translating logical cpu=0 to the first physical core in the node. The test I asked you to run, though, will confirm this. Please do let us know as this is definitely something we should fix. Thanks! Ralph On Wed, Jul 15, 2009 at 6:11 AM, Chris Samuel wrote: > > - "Ralph Castain" wrote: > > Hi Ralph, > > > Interesting. No, we don't take PLPA cpu sets into account when > > retrieving the allocation. > > Understood. > > > Just to be clear: from an OMPI perspective, I don't think this is an > > issue of binding, but rather an issue of allocation. If we knew we had > > been allocated only a certain number of cores on a node, then we would > > only map that many procs to the node. When we subsequently "bind", we > > should then bind those procs to the correct cores (I think). > > Hmm, OpenMPI should already know this from the PBS TM API when > launching the job, we've never had to get our users to specify > how many procs per node to start (and they will generally have > no idea how many to ask for in advance as they are at the mercy > of the scheduler, unless they select a whole nodes with ppn=8). > > > Could you check this? You can run a trivial job using the > > -npernode x option, where x matched the #cores you were > > allocated on the nodes. > > > > If you do this, do we bind to the correct cores? > > I'll give this a shot tomorrow when I'm back in the office > (just checking email late at night here), I'll try it under > strace to to see what it tries to sched_setaffinity() to. > > > If we do, then that would confirm that we just aren't > > picking up the right number of cores allocated to us. > > If it is wrong, then this is a PLPA issue where it > > isn't binding to the right core. > > Interesting, will let you know the test results tomorrow! > > cheers, > Chris > -- > Christopher Samuel - (03) 9925 4751 - Systems Manager > The Victorian Partnership for Advanced Computing > P.O. Box 201, Carlton South, VIC 3053, Australia > VPAC is a not-for-profit Registered Research Agency > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel >
Re: [OMPI devel] [OMPI users] where can i get a tracing tool
On Jul 15, 2009, at 8:57 AM, Matthias Jurenz wrote: I sent the answer directly to the user, 'cause I didn't subscribe to the user-list. I'll do that asap ;-) Thanks -- I appreciate it. I know it's a somewhat high-volume list. I can bounce you the original question so that you can reply to it and have it threaded properly. > > Could you also mention the tool 'otfprofile' under the section 7, > > please? As soon as the free version of Vampir is available this could > > also be mentioned. > > Do you guys not have write access to the SVN repo for the web pages? > If not, we should just add you -- that would certainly make it easier... Unfortunately, we haven't write access to the repository for the web pages. Could you add us (me and Andreas), please? Will do -- can you remind me of your SVN ID's again? -- Jeff Squyres Cisco Systems
Re: [OMPI devel] [OMPI users] where can i get a tracing tool
On Jul 15, 2009, at 10:24 AM, Jeff Squyres (jsquyres) wrote: Thanks -- I appreciate it. I know it's a somewhat high-volume list. I can bounce you the original question so that you can reply to it and have it threaded properly. Disregard -- you replied already. Many thanks! -- Jeff Squyres Cisco Systems
Re: [OMPI devel] [OMPI users] where can i get a tracing tool
On Wed, 2009-07-15 at 10:24 -0400, Jeff Squyres wrote: > On Jul 15, 2009, at 8:57 AM, Matthias Jurenz wrote: > > > I sent the answer directly to the user, 'cause I didn't subscribe to > > the > > user-list. I'll do that asap ;-) > > > > Thanks -- I appreciate it. I know it's a somewhat high-volume list. > I can bounce you the original question so that you can reply to it and > have it threaded properly. > > > > > Could you also mention the tool 'otfprofile' under the section 7, > > > > please? As soon as the free version of Vampir is available this > > could > > > > also be mentioned. > > > > > > Do you guys not have write access to the SVN repo for the web pages? > > > If not, we should just add you -- that would certainly make it > > easier... > > > > Unfortunately, we haven't write access to the repository for the web > > pages. Could you add us (me and Andreas), please? > > > > > Will do -- can you remind me of your SVN ID's again? > Sure! Our SVN ID's are: jurenz and knuepfer smime.p7s Description: S/MIME cryptographic signature
Re: [OMPI devel] [OMPI users] where can i get a tracing tool
On Jul 15, 2009, at 10:37 AM, Matthias Jurenz wrote: Sure! Our SVN ID's are: jurenz and knuepfer Done! You should have write access -- let me know if you don't. I think you guys have seen it before, but here's the wiki page about adding / editing wiki pages: https://svn.open-mpi.org/trac/ompi/wiki/OMPIFAQEntries Eugene recently added a bunch of good stuff in there. -- Jeff Squyres Cisco Systems
[OMPI devel] Fwd: [all-osl-users] Upgrading of the OSL SVN server
FYI. Begin forwarded message: From: "DongInn Kim" Date: July 15, 2009 10:39:01 AM EDT To: Subject: Re: [all-osl-users] Upgrading of the OSL SVN server I am sorry that we can not upgrade subversion this time because of the technical issues on the interaction between the new subversion and SourceHaven web application. Once this issue is cleared, I will make another schedule to upgrade subversion. Until then, we will use the old version(ver 1.4.2) of subversion like we did before. The servers are up and running with subversion-1.4.2 now. Best Regards, - DongInn On 7/13/09 10:20 AM, Kim, DongInn wrote: > Hi, > > The new version(1.6.3) of subversion was released on June 2009. It has a lot of good features included and many bugs are fixed. > http://subversion.tigris.org/servlets/ProjectNewsList > > The OSL would like to upgrade the current subversion(1.4.2) to get the benefit of the new version. > The upgrade would start at 8:00AM(E.T.) on July 15, 2009. > > The subversion service and trac websites services would NOT be available during the following time period. > - 5:00am-11:00am Pacific US time > - 6:00am-12:00pm Mountain US time > - 7:00am-1:00pm Central US time > - 8:00am-2:00pm Eastern US time > - 12:00pm-6:00pm GMT > > Please let me know if you have any concerns or questions about this upgrade. > > Regards, > -- Jeff Squyres Cisco Systems
[OMPI devel] Fwd: [all-osl-users] Upgrading of the OSL SVN server
FYI. Begin forwarded message: From: DongInn Kim Date: July 15, 2009 10:39:01 AM EDT To: all-osl-us...@osl.iu.edu Subject: Re: [all-osl-users] Upgrading of the OSL SVN server I am sorry that we can not upgrade subversion this time because of the technical issues on the interaction between the new subversion and SourceHaven web application. Once this issue is cleared, I will make another schedule to upgrade subversion. Until then, we will use the old version(ver 1.4.2) of subversion like we did before. The servers are up and running with subversion-1.4.2 now. Best Regards, - DongInn On 7/13/09 10:20 AM, Kim, DongInn wrote: Hi, The new version(1.6.3) of subversion was released on June 2009. It has a lot of good features included and many bugs are fixed. http://subversion.tigris.org/servlets/ProjectNewsList The OSL would like to upgrade the current subversion(1.4.2) to get the benefit of the new version. The upgrade would start at 8:00AM(E.T.) on July 15, 2009. The subversion service and trac websites services would NOT be available during the following time period. - 5:00am-11:00am Pacific US time - 6:00am-12:00pm Mountain US time - 7:00am-1:00pm Central US time - 8:00am-2:00pm Eastern US time - 12:00pm-6:00pm GMT Please let me know if you have any concerns or questions about this upgrade. Regards,
Re: [OMPI devel] Fwd: [all-osl-users] Upgrading of the OSL SVN server
*Quickness competition round 1* Jeff vs. Josh 1 : 0 ;-)) Josh Hursey wrote: > FYI. > > > Begin forwarded message: > >> From: DongInn Kim >> Date: July 15, 2009 10:39:01 AM EDT >> To: all-osl-us...@osl.iu.edu >> Subject: Re: [all-osl-users] Upgrading of the OSL SVN server >> >> I am sorry that we can not upgrade subversion this time because of the >> technical issues on the interaction between the new subversion and >> SourceHaven web application. >> >> Once this issue is cleared, I will make another schedule to upgrade >> subversion. Until then, we will use the old version(ver 1.4.2) of >> subversion like we did before. >> >> The servers are up and running with subversion-1.4.2 now. >> >> Best Regards, >> >> - DongInn >> >> On 7/13/09 10:20 AM, Kim, DongInn wrote: >>> Hi, >>> >>> The new version(1.6.3) of subversion was released on June 2009. It >>> has a lot of good features included and many bugs are fixed. >>> http://subversion.tigris.org/servlets/ProjectNewsList >>> >>> The OSL would like to upgrade the current subversion(1.4.2) to get >>> the benefit of the new version. >>> The upgrade would start at 8:00AM(E.T.) on July 15, 2009. >>> >>> The subversion service and trac websites services would NOT be >>> available during the following time period. >>> - 5:00am-11:00am Pacific US time >>> - 6:00am-12:00pm Mountain US time >>> - 7:00am-1:00pm Central US time >>> - 8:00am-2:00pm Eastern US time >>> - 12:00pm-6:00pm GMT >>> >>> Please let me know if you have any concerns or questions about this >>> upgrade. >>> >>> Regards, >>> > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel > -- Holger Mickler Technische Universität Dresden Center for Information Services and High Performance Computing (ZIH) 01062 Dresden Germany Contact Room: Willers-Bau A306 Phone: +49 351 463-37903 Fax:+49 351 463-37773 email: holger.mick...@tu-dresden.de
[OMPI devel] DDT and spawn issue?
I [very briefly] read about the DDT spawn issues, so I went to look at ompi/op/op.c. I notice that there's a new comment above the op datatype<-->op map construction area that says: /* XXX TODO */ svn blame says: 21641 rusraink /* XXX TODO */ r21641 is the big merge from the past weekend where the DDT split came in. Has this area been looked at and the comment is out of date? Or does it need to be updated with new mappings? (I honestly have not looked any farther than this -- the new comment caught my eye) -- Jeff Squyres Cisco Systems
Re: [OMPI devel] DDT and spawn issue?
Yes, this appears to be at least partially part of the problem Edgar is seeing. We're trying to figure out how most of the tests passed so far with a wrong mapping. Interesting enough, while the mapping seems wrong the lookup is symmetric so most of the time we end-up with the correct op by pure luck. We're looking into this. george. On Jul 15, 2009, at 11:50 , Jeff Squyres wrote: I [very briefly] read about the DDT spawn issues, so I went to look at ompi/op/op.c. I notice that there's a new comment above the op datatype<-->op map construction area that says: /* XXX TODO */ svn blame says: 21641 rusraink /* XXX TODO */ r21641 is the big merge from the past weekend where the DDT split came in. Has this area been looked at and the comment is out of date? Or does it need to be updated with new mappings? (I honestly have not looked any farther than this -- the new comment caught my eye) -- Jeff Squyres Cisco Systems ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] DDT and spawn issue?
Thanks George!! On Wed, Jul 15, 2009 at 9:57 AM, George Bosilca wrote: > Yes, this appears to be at least partially part of the problem Edgar is > seeing. We're trying to figure out how most of the tests passed so far with > a wrong mapping. Interesting enough, while the mapping seems wrong the > lookup is symmetric so most of the time we end-up with the correct op by > pure luck. > > We're looking into this. > > george. > > > On Jul 15, 2009, at 11:50 , Jeff Squyres wrote: > > I [very briefly] read about the DDT spawn issues, so I went to look at >> ompi/op/op.c. I notice that there's a new comment above the op >> datatype<-->op map construction area that says: >> >> /* XXX TODO */ >> >> svn blame says: >> >> 21641 rusraink /* XXX TODO */ >> >> r21641 is the big merge from the past weekend where the DDT split came in. >> >> Has this area been looked at and the comment is out of date? Or does it >> need to be updated with new mappings? (I honestly have not looked any >> farther than this -- the new comment caught my eye) >> >> -- >> Jeff Squyres >> Cisco Systems >> >> ___ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel >
Re: [OMPI devel] DDT and spawn issue?
Hi Jeff, Ralph and Edgar send fwd an email about this. We (George and myselve) are currently looking into this. With the changes we have I can get IBM/spawn to work "sometimes", aka sometimes, it segfaults. Thanks, Rainer On Wednesday 15 July 2009 11:50:13 am Jeff Squyres wrote: > I [very briefly] read about the DDT spawn issues, so I went to look at > ompi/op/op.c. I notice that there's a new comment above the op > datatype<-->op map construction area that says: > > /* XXX TODO */ > > svn blame says: > > 21641 rusraink /* XXX TODO */ > > r21641 is the big merge from the past weekend where the DDT split came > in. > > Has this area been looked at and the comment is out of date? Or does > it need to be updated with new mappings? (I honestly have not looked > any farther than this -- the new comment caught my eye) -- Rainer Keller, PhD Tel: +1 (865) 241-6293 Oak Ridge National Lab Fax: +1 (865) 241-4811 PO Box 2008 MS 6164 Email: kel...@ornl.gov Oak Ridge, TN 37831-2008AIM/Skype: rusraink
Re: [OMPI devel] [OMPI users] where can i get a tracing tool
Done. Hit "reload" on the URL below, check out an SVN repository, or wait for these changes to be pushed to the live site. Matthias Jurenz wrote: Could you also mention the tool 'otfprofile' under the section 7, please? On Tue, 2009-07-14 at 18:54 -0700, Eugene Loh wrote: P.S. Until the page goes live, I'll also leave it at http://www.osl.iu.edu/~eloh/faq/?category=perftools .
Re: [OMPI devel] DDT and spawn issue?
Perhaps we should add a requirement for testing on 2-3 different systems before long-term (or "big change") branches like this come to the trunk? I say this because it seems like at least some of these problems were based on bad luck -- i.e., the stuff worked on the platform that it was being tested and developed on, even though there are bugs left. Having fallen victim to this myself many times ("worked for me on Cisco machines! I dunno why it's failing for you... :-("), I think we all recognize the value of just running the same code on someone else's systems -- it has a good tendency to turn up issues that don't show up on yours. I'm not trying to say that every little trunk commit needs to be validated -- but "big" changes like this could certainly benefit from multiple validations. Cisco is very willing to be a 2nd platform for testing for stuff that we can run without too much trouble, especially via MTT (e.g., I already have the right kind of networks to test, etc.). BTW, is anyone going to comment about the latency issue that I asked about? (in case you can't tell, I'm moderately displeased about how this whole branch came to the trunk... :-\ ) On Jul 15, 2009, at 12:04 PM, Rainer Keller wrote: Hi Jeff, Ralph and Edgar send fwd an email about this. We (George and myselve) are currently looking into this. With the changes we have I can get IBM/spawn to work "sometimes", aka sometimes, it segfaults. Thanks, Rainer On Wednesday 15 July 2009 11:50:13 am Jeff Squyres wrote: > I [very briefly] read about the DDT spawn issues, so I went to look at > ompi/op/op.c. I notice that there's a new comment above the op > datatype<-->op map construction area that says: > > /* XXX TODO */ > > svn blame says: > > 21641 rusraink /* XXX TODO */ > > r21641 is the big merge from the past weekend where the DDT split came > in. > > Has this area been looked at and the comment is out of date? Or does > it need to be updated with new mappings? (I honestly have not looked > any farther than this -- the new comment caught my eye) -- Rainer Keller, PhD Tel: +1 (865) 241-6293 Oak Ridge National Lab Fax: +1 (865) 241-4811 PO Box 2008 MS 6164 Email: kel...@ornl.gov Oak Ridge, TN 37831-2008AIM/Skype: rusraink -- Jeff Squyres Cisco Systems
[OMPI devel] Fwd: [OMPI svn-full] svn:open-mpi r21686
I have a question regarding the mapping. How can I declare a partial mapping ? In fact I only care about how some of the processes are mapped on some specific nodes. Right now if the rmaps doesn't contain information about all nodes, we give up (before this patch we segfaulted). Does it means we always have to declare the whole mapping or it's just that we overlooked this strange case? george. Begin forwarded message: Author: bosilca Date: 2009-07-15 15:36:53 EDT (Wed, 15 Jul 2009) New Revision: 21686 URL: https://svn.open-mpi.org/trac/ompi/changeset/21686 Log: Reorder the nidmap encoding function. Add a check to make sure we don't write outside the boundaries of the allocated array. However, the problem is still there. If we have rmaps file containing only partial information the num_procs get set to the wrong value (the number of hosts in the rmaps file instead of the number of processes requested on the command line).
Re: [OMPI devel] Fwd: [OMPI svn-full] svn:open-mpi r21686
The routed comm system relies on each daemon having complete information as to where every process is located, so the expectation was that only full maps would ever be sent. Thus, the nidmap code is setup to always send a full map. I don't know how to even generate a "partial" map. I assume you are doing something offline? Is this to update changed info? If so, you'll also have to do something to update the daemon's maps or the comm system will break down. Ralph On Wed, Jul 15, 2009 at 1:40 PM, George Bosilca wrote: > I have a question regarding the mapping. How can I declare a partial > mapping ? In fact I only care about how some of the processes are mapped on > some specific nodes. Right now if the rmaps doesn't contain information > about all nodes, we give up (before this patch we segfaulted). > > Does it means we always have to declare the whole mapping or it's just that > we overlooked this strange case? > > george. > > Begin forwarded message: > > Author: bosilca >> Date: 2009-07-15 15:36:53 EDT (Wed, 15 Jul 2009) >> New Revision: 21686 >> URL: https://svn.open-mpi.org/trac/ompi/changeset/21686 >> >> Log: >> Reorder the nidmap encoding function. Add a check to make sure we don't >> write >> outside the boundaries of the allocated array. >> >> However, the problem is still there. If we have rmaps file containing only >> partial information the num_procs get set to the wrong value (the number >> of >> hosts in the rmaps file instead of the number of processes requested on >> the >> command line). >> > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel >
Re: [OMPI devel] DDT and spawn issue?
Actually I don't think this will help. I looked on MTT and there are no errors related to this (logically all reductions should have failed) ... and MTT is supposed to run on several platforms. What happens inside is really strange, but as we do the same mistake when we look-up the op as hen we store it, this works on most cases. Moreover, even with the op corrected we still see segfaults, and it looks more and more as some memory overwrite problem... Before the commit we even test it on a Sicortex machine (which is clearly a different architecture than the x86_64) and this didn't trigger any errors either. Regarding the latency issue, there is not much to say about. The platform we tested on is clearly older than what other people test on, but this is all about. The two versions (before and after the data- type move) have the same latency, there is no reason to focus on the latency number. george. On Jul 15, 2009, at 12:18 , Jeff Squyres wrote: Perhaps we should add a requirement for testing on 2-3 different systems before long-term (or "big change") branches like this come to the trunk? I say this because it seems like at least some of these problems were based on bad luck -- i.e., the stuff worked on the platform that it was being tested and developed on, even though there are bugs left. Having fallen victim to this myself many times ("worked for me on Cisco machines! I dunno why it's failing for you... :-("), I think we all recognize the value of just running the same code on someone else's systems -- it has a good tendency to turn up issues that don't show up on yours. I'm not trying to say that every little trunk commit needs to be validated -- but "big" changes like this could certainly benefit from multiple validations. Cisco is very willing to be a 2nd platform for testing for stuff that we can run without too much trouble, especially via MTT (e.g., I already have the right kind of networks to test, etc.). BTW, is anyone going to comment about the latency issue that I asked about? (in case you can't tell, I'm moderately displeased about how this whole branch came to the trunk... :-\ ) On Jul 15, 2009, at 12:04 PM, Rainer Keller wrote: Hi Jeff, Ralph and Edgar send fwd an email about this. We (George and myselve) are currently looking into this. With the changes we have I can get IBM/spawn to work "sometimes", aka sometimes, it segfaults. Thanks, Rainer On Wednesday 15 July 2009 11:50:13 am Jeff Squyres wrote: > I [very briefly] read about the DDT spawn issues, so I went to look at > ompi/op/op.c. I notice that there's a new comment above the op > datatype<-->op map construction area that says: > > /* XXX TODO */ > > svn blame says: > > 21641 rusraink /* XXX TODO */ > > r21641 is the big merge from the past weekend where the DDT split came > in. > > Has this area been looked at and the comment is out of date? Or does > it need to be updated with new mappings? (I honestly have not looked > any farther than this -- the new comment caught my eye) -- Rainer Keller, PhD Tel: +1 (865) 241-6293 Oak Ridge National Lab Fax: +1 (865) 241-4811 PO Box 2008 MS 6164 Email: kel...@ornl.gov Oak Ridge, TN 37831-2008AIM/Skype: rusraink -- Jeff Squyres Cisco Systems ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] Fwd: [OMPI svn-full] svn:open-mpi r21686
I think I found a better solution (in r21688). Here is what I was trying to do. I have a more or less homogeneous cluster. In fact all processors are identical, except that some are quad core and some dual core. Of course I care how my processes are mapped on the quad cores, but not really on the dual cores. My approach was to use the following configuration files. In /home/bosilca/.openmpi/mca-params.conf I have: orte_default_hostfile=/home/bosilca/.openmpi/machinefile rmaps_rank_file_path = /home/bosilca/.openmpi/rankfile rmaps_rank_file_priority = 100 In /home/bosilca/.openmpi/machinefile I have the full description of the cluster. As an example: node01 slots=4 node02 slots=4 node03 slots=2 node04 slots=2 And in the /home/bosilca/.openmpi/rankfile file I have: rank 0=+n0 slot=0 rank 1=+n0 slot=1 rank 2=+n1 slot=0 rank 3=+n1 slot=1 As long as I spawn jobs with less than 4 processes everything worked fine. But when I used more than 4 processes, orterun segfaulted. After debugging I found that the nodes, lrank and nrank arrays were allocated based on the jdata->num_procs, but then filled based on the total number of processes in the jdata->nodes array. As it appears that the jdata->num_procs is somehow modified based on the number of entries in the rankfile, we end-up writing outside the allocation and then segfault. Now with the latest patch, we can cope with such a scenario by only packing the known information (and thus not writing outside the allocated arrays). This might not be the best approach, but it is doing what I'm looking for ... george. On Jul 15, 2009, at 15:50 , Ralph Castain wrote: The routed comm system relies on each daemon having complete information as to where every process is located, so the expectation was that only full maps would ever be sent. Thus, the nidmap code is setup to always send a full map. I don't know how to even generate a "partial" map. I assume you are doing something offline? Is this to update changed info? If so, you'll also have to do something to update the daemon's maps or the comm system will break down. Ralph On Wed, Jul 15, 2009 at 1:40 PM, George Bosilca wrote: I have a question regarding the mapping. How can I declare a partial mapping ? In fact I only care about how some of the processes are mapped on some specific nodes. Right now if the rmaps doesn't contain information about all nodes, we give up (before this patch we segfaulted). Does it means we always have to declare the whole mapping or it's just that we overlooked this strange case? george. Begin forwarded message: Author: bosilca Date: 2009-07-15 15:36:53 EDT (Wed, 15 Jul 2009) New Revision: 21686 URL: https://svn.open-mpi.org/trac/ompi/changeset/21686 Log: Reorder the nidmap encoding function. Add a check to make sure we don't write outside the boundaries of the allocated array. However, the problem is still there. If we have rmaps file containing only partial information the num_procs get set to the wrong value (the number of hosts in the rmaps file instead of the number of processes requested on the command line). ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] Fwd: [OMPI svn-full] svn:open-mpi r21686
Ah - interesting scenario! Definitely a "bug" in the code, then. What it looks like, though, is that the jdata->num_procs is wrong. There shouldn't be any way that the num_procs in the node array is different than jdata->num_procs. My guess is that the rank_file mapper isn't correctly maintaining the bookkeeping when we map the procs beyond those in the rankfile. I'll dig into it - have to fix something for Lenny anyway. Meantime, this change looks fine regardless as it (a) is better code and (b) protects us against such errors. Thanks for catching it! Ralph On Wed, Jul 15, 2009 at 2:30 PM, George Bosilca wrote: > I think I found a better solution (in r21688). Here is what I was trying to > do. > > I have a more or less homogeneous cluster. In fact all processors are > identical, except that some are quad core and some dual core. Of course I > care how my processes are mapped on the quad cores, but not really on the > dual cores. > > My approach was to use the following configuration files. > > In /home/bosilca/.openmpi/mca-params.conf I have: > > orte_default_hostfile=/home/bosilca/.openmpi/machinefile > rmaps_rank_file_path = /home/bosilca/.openmpi/rankfile > rmaps_rank_file_priority = 100 > > In /home/bosilca/.openmpi/machinefile I have the full description of the > cluster. As an example: > node01 slots=4 > node02 slots=4 > node03 slots=2 > node04 slots=2 > > And in the /home/bosilca/.openmpi/rankfile file I have: > rank 0=+n0 slot=0 > rank 1=+n0 slot=1 > rank 2=+n1 slot=0 > rank 3=+n1 slot=1 > > As long as I spawn jobs with less than 4 processes everything worked fine. > But when I used more than 4 processes, orterun segfaulted. After debugging I > found that the nodes, lrank and nrank arrays were allocated based on the > jdata->num_procs, but then filled based on the total number of processes in > the jdata->nodes array. As it appears that the jdata->num_procs is somehow > modified based on the number of entries in the rankfile, we end-up writing > outside the allocation and then segfault. Now with the latest patch, we can > cope with such a scenario by only packing the known information (and thus > not writing outside the allocated arrays). > > This might not be the best approach, but it is doing what I'm looking for > ... > > george. > > > On Jul 15, 2009, at 15:50 , Ralph Castain wrote: > > The routed comm system relies on each daemon having complete information >> as to where every process is located, so the expectation was that only full >> maps would ever be sent. Thus, the nidmap code is setup to always send a >> full map. >> >> I don't know how to even generate a "partial" map. I assume you are doing >> something offline? Is this to update changed info? If so, you'll also have >> to do something to update the daemon's maps or the comm system will break >> down. >> >> Ralph >> >> On Wed, Jul 15, 2009 at 1:40 PM, George Bosilca >> wrote: >> I have a question regarding the mapping. How can I declare a partial >> mapping ? In fact I only care about how some of the processes are mapped on >> some specific nodes. Right now if the rmaps doesn't contain information >> about all nodes, we give up (before this patch we segfaulted). >> >> Does it means we always have to declare the whole mapping or it's just >> that we overlooked this strange case? >> >> george. >> >> Begin forwarded message: >> >> >> Author: bosilca >> Date: 2009-07-15 15:36:53 EDT (Wed, 15 Jul 2009) >> New Revision: 21686 >> URL: https://svn.open-mpi.org/trac/ompi/changeset/21686 >> >> Log: >> Reorder the nidmap encoding function. Add a check to make sure we don't >> write >> outside the boundaries of the allocated array. >> >> However, the problem is still there. If we have rmaps file containing only >> partial information the num_procs get set to the wrong value (the number >> of >> hosts in the rmaps file instead of the number of processes requested on >> the >> command line). >> >> ___ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> >> ___ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel >
Re: [OMPI devel] Fwd: [OMPI svn-full] svn:open-mpi r21686
Found the bug - we indeed failed to update the jdata->num_procs field when adding the non-rf-mapped procs to the job. Fix coming shortly. On Jul 15, 2009, at 2:40 PM, Ralph Castain wrote: Ah - interesting scenario! Definitely a "bug" in the code, then. What it looks like, though, is that the jdata->num_procs is wrong. There shouldn't be any way that the num_procs in the node array is different than jdata->num_procs. My guess is that the rank_file mapper isn't correctly maintaining the bookkeeping when we map the procs beyond those in the rankfile. I'll dig into it - have to fix something for Lenny anyway. Meantime, this change looks fine regardless as it (a) is better code and (b) protects us against such errors. Thanks for catching it! Ralph On Wed, Jul 15, 2009 at 2:30 PM, George Bosilca wrote: I think I found a better solution (in r21688). Here is what I was trying to do. I have a more or less homogeneous cluster. In fact all processors are identical, except that some are quad core and some dual core. Of course I care how my processes are mapped on the quad cores, but not really on the dual cores. My approach was to use the following configuration files. In /home/bosilca/.openmpi/mca-params.conf I have: orte_default_hostfile=/home/bosilca/.openmpi/machinefile rmaps_rank_file_path = /home/bosilca/.openmpi/rankfile rmaps_rank_file_priority = 100 In /home/bosilca/.openmpi/machinefile I have the full description of the cluster. As an example: node01 slots=4 node02 slots=4 node03 slots=2 node04 slots=2 And in the /home/bosilca/.openmpi/rankfile file I have: rank 0=+n0 slot=0 rank 1=+n0 slot=1 rank 2=+n1 slot=0 rank 3=+n1 slot=1 As long as I spawn jobs with less than 4 processes everything worked fine. But when I used more than 4 processes, orterun segfaulted. After debugging I found that the nodes, lrank and nrank arrays were allocated based on the jdata->num_procs, but then filled based on the total number of processes in the jdata->nodes array. As it appears that the jdata->num_procs is somehow modified based on the number of entries in the rankfile, we end-up writing outside the allocation and then segfault. Now with the latest patch, we can cope with such a scenario by only packing the known information (and thus not writing outside the allocated arrays). This might not be the best approach, but it is doing what I'm looking for ... george. On Jul 15, 2009, at 15:50 , Ralph Castain wrote: The routed comm system relies on each daemon having complete information as to where every process is located, so the expectation was that only full maps would ever be sent. Thus, the nidmap code is setup to always send a full map. I don't know how to even generate a "partial" map. I assume you are doing something offline? Is this to update changed info? If so, you'll also have to do something to update the daemon's maps or the comm system will break down. Ralph On Wed, Jul 15, 2009 at 1:40 PM, George Bosilca wrote: I have a question regarding the mapping. How can I declare a partial mapping ? In fact I only care about how some of the processes are mapped on some specific nodes. Right now if the rmaps doesn't contain information about all nodes, we give up (before this patch we segfaulted). Does it means we always have to declare the whole mapping or it's just that we overlooked this strange case? george. Begin forwarded message: Author: bosilca Date: 2009-07-15 15:36:53 EDT (Wed, 15 Jul 2009) New Revision: 21686 URL: https://svn.open-mpi.org/trac/ompi/changeset/21686 Log: Reorder the nidmap encoding function. Add a check to make sure we don't write outside the boundaries of the allocated array. However, the problem is still there. If we have rmaps file containing only partial information the num_procs get set to the wrong value (the number of hosts in the rmaps file instead of the number of processes requested on the command line). ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
[OMPI devel] MPI_Accumulate() with MPI_PROC_NULL target rank
The MPI 2-1 standard says: "MPI_PROC_NULL is a valid target rank in the MPI RMA calls MPI_ACCUMULATE, MPI_GET, and MPI_PUT. The effect is the same as for MPI_PROC_NULL in MPI point-to-point communication. After any RMA operation with rank MPI_PROC_NULL, it is still necessary to finish the RMA epoch with the synchronization method that started the epoch." Unfortunately, MPI_Accumulate() is not quite the same as point-to-point, as a reduction is involved. Suppose you make this call (let me abuse and use keyword arguments): MPI_Accumulate(..., target_rank=MPI_PROC_NULL, target_datatype=MPI_BYTE, op=MPI_SUM, ...) IIUC, the call fails (with MPI_ERR_OP) in Open MPI because MPI_BYTE is an invalid datatype for MPI_SUM. But provided that the target rank is MPI_PROC_NULL, would it make sense for the call to success? -- Lisandro Dalcín --- Centro Internacional de Métodos Computacionales en Ingeniería (CIMEC) Instituto de Desarrollo Tecnológico para la Industria Química (INTEC) Consejo Nacional de Investigaciones Científicas y Técnicas (CONICET) PTLC - Güemes 3450, (3000) Santa Fe, Argentina Tel/Fax: +54-(0)342-451.1594
Re: [OMPI devel] MPI_Accumulate() with MPI_PROC_NULL target rank
On Wed, 15 Jul 2009, Lisandro Dalcin wrote: The MPI 2-1 standard says: "MPI_PROC_NULL is a valid target rank in the MPI RMA calls MPI_ACCUMULATE, MPI_GET, and MPI_PUT. The effect is the same as for MPI_PROC_NULL in MPI point-to-point communication. After any RMA operation with rank MPI_PROC_NULL, it is still necessary to finish the RMA epoch with the synchronization method that started the epoch." Unfortunately, MPI_Accumulate() is not quite the same as point-to-point, as a reduction is involved. Suppose you make this call (let me abuse and use keyword arguments): MPI_Accumulate(..., target_rank=MPI_PROC_NULL, target_datatype=MPI_BYTE, op=MPI_SUM, ...) IIUC, the call fails (with MPI_ERR_OP) in Open MPI because MPI_BYTE is an invalid datatype for MPI_SUM. But provided that the target rank is MPI_PROC_NULL, would it make sense for the call to success? I believe no. We do full argument error checking (that you provided a valid communicator and datatype) on send, receive, put, and get when the source/dest is MPI_PROC_NULL. Therefore, I think it's logical that we extend that to include valid operations for accumulate. Brian
Re: [OMPI devel] Fwd: [OMPI svn-full] svn:open-mpi r21686
Okay, George - this is fixed in r21690. Thanks again Ralph On Jul 15, 2009, at 2:40 PM, Ralph Castain wrote: Ah - interesting scenario! Definitely a "bug" in the code, then. What it looks like, though, is that the jdata->num_procs is wrong. There shouldn't be any way that the num_procs in the node array is different than jdata->num_procs. My guess is that the rank_file mapper isn't correctly maintaining the bookkeeping when we map the procs beyond those in the rankfile. I'll dig into it - have to fix something for Lenny anyway. Meantime, this change looks fine regardless as it (a) is better code and (b) protects us against such errors. Thanks for catching it! Ralph On Wed, Jul 15, 2009 at 2:30 PM, George Bosilca wrote: I think I found a better solution (in r21688). Here is what I was trying to do. I have a more or less homogeneous cluster. In fact all processors are identical, except that some are quad core and some dual core. Of course I care how my processes are mapped on the quad cores, but not really on the dual cores. My approach was to use the following configuration files. In /home/bosilca/.openmpi/mca-params.conf I have: orte_default_hostfile=/home/bosilca/.openmpi/machinefile rmaps_rank_file_path = /home/bosilca/.openmpi/rankfile rmaps_rank_file_priority = 100 In /home/bosilca/.openmpi/machinefile I have the full description of the cluster. As an example: node01 slots=4 node02 slots=4 node03 slots=2 node04 slots=2 And in the /home/bosilca/.openmpi/rankfile file I have: rank 0=+n0 slot=0 rank 1=+n0 slot=1 rank 2=+n1 slot=0 rank 3=+n1 slot=1 As long as I spawn jobs with less than 4 processes everything worked fine. But when I used more than 4 processes, orterun segfaulted. After debugging I found that the nodes, lrank and nrank arrays were allocated based on the jdata->num_procs, but then filled based on the total number of processes in the jdata->nodes array. As it appears that the jdata->num_procs is somehow modified based on the number of entries in the rankfile, we end-up writing outside the allocation and then segfault. Now with the latest patch, we can cope with such a scenario by only packing the known information (and thus not writing outside the allocated arrays). This might not be the best approach, but it is doing what I'm looking for ... george. On Jul 15, 2009, at 15:50 , Ralph Castain wrote: The routed comm system relies on each daemon having complete information as to where every process is located, so the expectation was that only full maps would ever be sent. Thus, the nidmap code is setup to always send a full map. I don't know how to even generate a "partial" map. I assume you are doing something offline? Is this to update changed info? If so, you'll also have to do something to update the daemon's maps or the comm system will break down. Ralph On Wed, Jul 15, 2009 at 1:40 PM, George Bosilca wrote: I have a question regarding the mapping. How can I declare a partial mapping ? In fact I only care about how some of the processes are mapped on some specific nodes. Right now if the rmaps doesn't contain information about all nodes, we give up (before this patch we segfaulted). Does it means we always have to declare the whole mapping or it's just that we overlooked this strange case? george. Begin forwarded message: Author: bosilca Date: 2009-07-15 15:36:53 EDT (Wed, 15 Jul 2009) New Revision: 21686 URL: https://svn.open-mpi.org/trac/ompi/changeset/21686 Log: Reorder the nidmap encoding function. Add a check to make sure we don't write outside the boundaries of the allocated array. However, the problem is still there. If we have rmaps file containing only partial information the num_procs get set to the wrong value (the number of hosts in the rmaps file instead of the number of processes requested on the command line). ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] OpenMPI, PLPA and Linux cpuset/cgroup support
- "Ralph Castain" wrote: > Could you check this? You can run a trivial job using the -npernode x > option, where x matched the #cores you were allocated on the nodes. > If you do this, do we bind to the correct cores? Nope, I'm afraid it doesn't - submitted a job asking for 4 cores on one node and was allocated cores 0-3 in the cpuset. Grep'ing the strace output for anything mentioning affinity shows: [csamuel@tango027 CPI]$ fgrep affinity cpi-trace.txt 11412 execve("/usr/local/openmpi/1.3.3-gcc/bin/mpiexec", ["mpiexec", "--mca", "paffinity", "linux", "-npernode", "4", "/home/csamuel/Sources/Tests/CPI/"...], [/* 56 vars */]) = 0 11412 sched_getaffinity(0, 128, { f }) = 8 11412 sched_setaffinity(0, 8, { 0 }) = -1 EFAULT (Bad address) 11416 sched_getaffinity(0, 128, 11416 <... sched_getaffinity resumed> { f }) = 8 11416 sched_setaffinity(0, 8, { 0 } 11416 <... sched_setaffinity resumed> ) = -1 EFAULT (Bad address) 11414 sched_getaffinity(0, 128, 11414 <... sched_getaffinity resumed> { f }) = 8 11414 sched_setaffinity(0, 8, { 0 } 11414 <... sched_setaffinity resumed> ) = -1 EFAULT (Bad address) 11413 sched_getaffinity(0, 128, 11413 <... sched_getaffinity resumed> { f }) = 8 11413 sched_setaffinity(0, 8, { 0 } 11413 <... sched_setaffinity resumed> ) = -1 EFAULT (Bad address) 11415 sched_getaffinity(0, 128, 11415 <... sched_getaffinity resumed> { f }) = 8 11415 sched_setaffinity(0, 8, { 0 } 11415 <... sched_setaffinity resumed> ) = -1 EFAULT (Bad address) 11413 sched_getaffinity(11413, 8, 11415 sched_getaffinity(11415, 8, 11413 <... sched_getaffinity resumed> { f }) = 8 11415 <... sched_getaffinity resumed> { f }) = 8 11414 sched_getaffinity(11414, 8, 11414 <... sched_getaffinity resumed> { f }) = 8 11416 sched_getaffinity(11416, 8, 11416 <... sched_getaffinity resumed> { f }) = 8 I can confirm that it's not worked by checking what plpa-taskset says about a process (for example 11414): [root@tango027 plpa-taskset]# ./plpa-taskset -cp 11414 pid 11414's current affinity list: 0-3 According to the manual page: EFAULT A supplied memory address was invalid. This is on a dual socket quad core AMD Shanghai system running the 2.6.28.9 kernel (not had a chance to upgrade recently). Will do some more poking around after lunch. cheers, Chris -- Christopher Samuel - (03) 9925 4751 - Systems Manager The Victorian Partnership for Advanced Computing P.O. Box 201, Carlton South, VIC 3053, Australia VPAC is a not-for-profit Registered Research Agency
Re: [OMPI devel] OpenMPI, PLPA and Linux cpuset/cgroup support
Looking at your command line, did you remember to set -mca mpi_paffinity_alone 1? If not, we won't set affinity on the processes. On Jul 15, 2009, at 8:11 PM, Chris Samuel wrote: - "Ralph Castain" wrote: Could you check this? You can run a trivial job using the -npernode x option, where x matched the #cores you were allocated on the nodes. If you do this, do we bind to the correct cores? Nope, I'm afraid it doesn't - submitted a job asking for 4 cores on one node and was allocated cores 0-3 in the cpuset. Grep'ing the strace output for anything mentioning affinity shows: [csamuel@tango027 CPI]$ fgrep affinity cpi-trace.txt 11412 execve("/usr/local/openmpi/1.3.3-gcc/bin/mpiexec", ["mpiexec", "--mca", "paffinity", "linux", "-npernode", "4", "/home/csamuel/ Sources/Tests/CPI/"...], [/* 56 vars */]) = 0 11412 sched_getaffinity(0, 128, { f }) = 8 11412 sched_setaffinity(0, 8, { 0 }) = -1 EFAULT (Bad address) 11416 sched_getaffinity(0, 128, 11416 <... sched_getaffinity resumed> { f }) = 8 11416 sched_setaffinity(0, 8, { 0 } 11416 <... sched_setaffinity resumed> ) = -1 EFAULT (Bad address) 11414 sched_getaffinity(0, 128, 11414 <... sched_getaffinity resumed> { f }) = 8 11414 sched_setaffinity(0, 8, { 0 } 11414 <... sched_setaffinity resumed> ) = -1 EFAULT (Bad address) 11413 sched_getaffinity(0, 128, 11413 <... sched_getaffinity resumed> { f }) = 8 11413 sched_setaffinity(0, 8, { 0 } 11413 <... sched_setaffinity resumed> ) = -1 EFAULT (Bad address) 11415 sched_getaffinity(0, 128, 11415 <... sched_getaffinity resumed> { f }) = 8 11415 sched_setaffinity(0, 8, { 0 } 11415 <... sched_setaffinity resumed> ) = -1 EFAULT (Bad address) 11413 sched_getaffinity(11413, 8, 11415 sched_getaffinity(11415, 8, 11413 <... sched_getaffinity resumed> { f }) = 8 11415 <... sched_getaffinity resumed> { f }) = 8 11414 sched_getaffinity(11414, 8, 11414 <... sched_getaffinity resumed> { f }) = 8 11416 sched_getaffinity(11416, 8, 11416 <... sched_getaffinity resumed> { f }) = 8 I can confirm that it's not worked by checking what plpa-taskset says about a process (for example 11414): [root@tango027 plpa-taskset]# ./plpa-taskset -cp 11414 pid 11414's current affinity list: 0-3 According to the manual page: EFAULT A supplied memory address was invalid. This is on a dual socket quad core AMD Shanghai system running the 2.6.28.9 kernel (not had a chance to upgrade recently). Will do some more poking around after lunch. cheers, Chris -- Christopher Samuel - (03) 9925 4751 - Systems Manager The Victorian Partnership for Advanced Computing P.O. Box 201, Carlton South, VIC 3053, Australia VPAC is a not-for-profit Registered Research Agency ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel