[OMPI devel] how to add a component in the ompi?
Hi Jeff & All i want to add a new component in the ompi, 1: i make a dir ~/mca/btl/ht 2:Then,i have made sure some basic functions i need to implement. such as: mca_btl_ht_add_procs, mca_btl_ht_del_procs, mca_btl_ht_alloc mca_btl_ht_free mca_btl_ht_finalize. 3:after these functions,i must compile these funcitons,i copied the makefire under ~/mca/btl/tcp/, i have seen these: MCA_btl_ALL_COMPONENTS = self sm elan gm mx ofud openib portals tcp udapl MCA_btl_ALL_SUBDIRS = mca/btl/self mca/btl/sm mca/btl/elan mca/btl/gm mca/btl/mx mca/btl/o fud mca/btl/openib mca/btl/portals mca/btl/tcp mca/btl/udapl add my component into these two lines is just not enough. Can you help me out on making a right Makefile for my component under folder ~/mca/blt/ht? Thanks & Regards, Yaohui Hu
Re: [OMPI devel] RFC: increase default AC/AM/LT requirements
Brian and I chatted about this in person -- he doesn't have too strong of an opinion here. We checked versions shipped in RHEL: RHEL4: AC 2.59, AM 1.9.2, LT 1.5.6 RHEL5: AC 2.59, AM 1.9.2, LT 1.5.22 Meaning: they're both really ancient. I personally don't mind forcing developers to have more modern versions because we're a fairly small group of people (vs. users). Does anyone else have an opinion here? On Feb 25, 2010, at 1:55 PM, Barrett, Brian W wrote: > I think our last set of minimums was based on being able to use RHEL4 out of > the box. Updating to whatever ships with RHEL5 probably makes sense, but I > think that still leaves you at a LT 1.5.x release. Being higher than that > requires new Autotools, which seems like asking for trouble. > > Brian > > On Feb 25, 2010, at 4:47 PM, Jeff Squyres wrote: > > > WHAT: Bump minimum required versions of GNU autotools up to modern > > versions. I suggest the following, but could be talked down a version or > > two: > > Autoconf: 2.65 > > Automake: 1.11.1 > > Libtool: 2.2.6b > > > > WHY: Stop carrying patches and workarounds for old versions. > > > > WHERE: autogen.sh, make_dist_tarball, various Makefile.am's, configure.ac, > > *.m4. > > > > WHEN: No real rush. Somewhere in 1.5.x. > > > > TIMEOUT: Friday March 5, 2010 > > > > > > > > I was debugging a complex Automake timestamp issue yesterday and discovered > > that it was caused by the fact that we are patching an old version of > > libtool.m4. It took a little while to figure out both the problem and an > > acceptable workaround. During this process, I noticed that autogen.sh > > still carries patches to fix bugs in some *really* old versions of Libtool > > (e.g., 1.5.22). Hence, I am send this RFC to increase the minimum required > > versions. > > > > Keep in mind: > > > > 1. This ONLY affects developers. Those who build from tarballs don't even > > need to have the Autotools installed. > > 2. Autotool patches should always be pushed upstream. We should only > > maintain patches for things that have been pushed upstream but have not yet > > been released. > > 3. We already have much more recent Autotools requirements for official > > distribution tarballs; see the chart here: > > > >http://www.open-mpi.org/svn/building.php > > > > Specifically: although official tarballs require recent Autotools, we allow > > developers to use much older versions. Why are we still carrying around > > this old kruft? Does some developer out there have a requirement to use > > older Autotools? > > > > If not, this RFC proposes to only allow recent versions of the Autotools to > > build Open MPI. I believe there's reasonable m4 these days that can make > > autogen/configure/whatever abort early if the versions are not new enough. > > This would allow us, at a minimum, to drop some of the libtool patches > > we're carrying. There may be some Makefile.am workarounds that are no > > longer necessary, too. > > > > There's no real rush on this; if this RFC passes, we can set a concrete, > > fixed date some point in the future where we switch over to requiring new > > versions. This should give everyone plenty of time to update if you need > > to, etc. > > > > -- > > Jeff Squyres > > jsquy...@cisco.com > > For corporate legal information go to: > > http://www.cisco.com/web/about/doing_business/legal/cri/ > > > > > > ___ > > devel mailing list > > de...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/devel > > > > -- > Brian W. Barrett > Dept. 1423: Scalable System Software > Sandia National Laboratories > > > > > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel > -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI devel] RFC: Rename --enable-*-threads andENABLE*THREAD*(take 2)
There was way too much information on this thread that I was looking for ;) And a lot of misunderstandings ... If we want to allow ORTE to be on his own thread, then we should clearly banish the progress_thread from this equation. I would prefer ORTE to be as separated from the rest of the MPI library as possible, and therefore avoid most of the locks and overheads on the MPI itself. Moving ORTE (as it only use TCP sockets) on it's own poll is the best looking approach, and this can be easily done once we upgrade out libevent to the 2.0. To be honest the progress thread was not the smartest idea we had. It makes the code more complex, and the added benefit is quite small. Again, once we move to the libevent-2 we will cleanup the code, and have a more consistent approach. george. On Mar 8, 2010, at 10:11 , Ralph Castain wrote: > > On Mar 8, 2010, at 6:43 AM, Jeff Squyres wrote: > >> On Mar 7, 2010, at 8:13 PM, Ralph Castain wrote: >> How about calling it --enable-opal-event-progress-thread, or even --enable-open-libevent-progress-thread? >>> >>> Why not add another 100+ characters to the name while we are at it? :-/ >>> >> >> :-) >> >> I didn't really think the length mattered here, since it's a configure >> argument. There has been a *lot* of confusion over the name of this >> particular switch over the past few years, so I'm suggesting that a longer, >> more descriptive name might be a little better. Just my $0.02... > > I honestly don't think that is the source of the confusion. The revised name > tells you exactly what that configure option does - it enables a thread at > the opal layer that calls opal_progress. Period. > > The confusion is over how that is used within the code, given that opal > doesn't have any communication system (as George pointed out). So having an > opal progress thread running will cause the event library to tick over, but > that does? It isn't directly tied to any existing subsystem, but rather > cuts across any of them that are sitting on sockets/file descriptors etc. > using the event library. > > If you look at the other progress threads in the system (e.g., openib), > you'll find that they don't use the event library to monitor their fd's - > they poll them directly. So enabling the opal progress thread doesn't > directly affect them. > > So I would say let's leave the name alone, and change it if/when someone > figures out how to utilize that capability. > >> The openib BTL can have up to 2 progress threads (!) -- the async verbs event notifier and the RDMA CM agent. They really should be consolidated. If there's infrastructure to consolidate them via opal or something else, then so much the better... >>> >>> Agreed, though I think that is best done as a separate effort from this RFC. >>> >> >> Agreed -- sorry, I wasn't clear. I wasn't trying to propose that that work >> be added to this RFC; I was just trying to mention that there could be a >> good use for the work from this RFC if such infrastructure was provided. >> >>> I believe there was a concern over latency if all the BTLs are driven by >>> one progress thread that sequentially runs across their respective file >>> descriptors, but I may be remembering it incorrectly... >>> >> >> >> FWIW, I believe the openib progress threads were written the they way they >> were (i.e., without any opal progress thread support) because, at least in >> the current setup, to get the opal progress thread support, you have to turn >> on all the heavyweight locks, etc. These two progress threads now simply >> pthread_create() and do minimal locking between the main thread and >> themselves, without affecting the rest of the locking machinery in the code >> base. >> >> I'm not saying that's perfect (or even good); I'm just saying that that's >> the way it currently is. Indeed, at a minimum, Pasha and I have long talked >> about merging these two progress threads into 1. It would be even better if >> we could merge these two project threads into some other infrastructure. >> But it's always been somewhat of a low priority; we've never gotten a round >> tuit... >> >> -- >> Jeff Squyres >> jsquy...@cisco.com >> For corporate legal information go to: >> http://www.cisco.com/web/about/doing_business/legal/cri/ >> >> ___ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel > > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] RFC: Rename --enable-*-threads andENABLE*THREAD*(take 2)
No problem with me. Why don't we then modify this RFC to simply eliminate the --enable-opal-progress-thread option, and leave the rest as-is so we can enable the opal thread machinery without enabling MPI thread multiple? I agree that having an ORTE-level thread makes the most sense, if it is needed. On Mar 10, 2010, at 10:20 AM, George Bosilca wrote: > There was way too much information on this thread that I was looking for ;) > And a lot of misunderstandings ... > > If we want to allow ORTE to be on his own thread, then we should clearly > banish the progress_thread from this equation. I would prefer ORTE to be as > separated from the rest of the MPI library as possible, and therefore avoid > most of the locks and overheads on the MPI itself. Moving ORTE (as it only > use TCP sockets) on it's own poll is the best looking approach, and this can > be easily done once we upgrade out libevent to the 2.0. > > To be honest the progress thread was not the smartest idea we had. It makes > the code more complex, and the added benefit is quite small. Again, once we > move to the libevent-2 we will cleanup the code, and have a more consistent > approach. > > george. > > On Mar 8, 2010, at 10:11 , Ralph Castain wrote: > >> >> On Mar 8, 2010, at 6:43 AM, Jeff Squyres wrote: >> >>> On Mar 7, 2010, at 8:13 PM, Ralph Castain wrote: >>> > How about calling it --enable-opal-event-progress-thread, or even > --enable-open-libevent-progress-thread? Why not add another 100+ characters to the name while we are at it? :-/ >>> >>> :-) >>> >>> I didn't really think the length mattered here, since it's a configure >>> argument. There has been a *lot* of confusion over the name of this >>> particular switch over the past few years, so I'm suggesting that a longer, >>> more descriptive name might be a little better. Just my $0.02... >> >> I honestly don't think that is the source of the confusion. The revised name >> tells you exactly what that configure option does - it enables a thread at >> the opal layer that calls opal_progress. Period. >> >> The confusion is over how that is used within the code, given that opal >> doesn't have any communication system (as George pointed out). So having an >> opal progress thread running will cause the event library to tick over, but >> that does? It isn't directly tied to any existing subsystem, but rather >> cuts across any of them that are sitting on sockets/file descriptors etc. >> using the event library. >> >> If you look at the other progress threads in the system (e.g., openib), >> you'll find that they don't use the event library to monitor their fd's - >> they poll them directly. So enabling the opal progress thread doesn't >> directly affect them. >> >> So I would say let's leave the name alone, and change it if/when someone >> figures out how to utilize that capability. >> >>> > The openib BTL can have up to 2 progress threads (!) -- the async verbs > event notifier and the RDMA CM agent. They really should be > consolidated. If there's infrastructure to consolidate them via opal or > something else, then so much the better... Agreed, though I think that is best done as a separate effort from this RFC. >>> >>> Agreed -- sorry, I wasn't clear. I wasn't trying to propose that that work >>> be added to this RFC; I was just trying to mention that there could be a >>> good use for the work from this RFC if such infrastructure was provided. >>> I believe there was a concern over latency if all the BTLs are driven by one progress thread that sequentially runs across their respective file descriptors, but I may be remembering it incorrectly... >>> >>> >>> FWIW, I believe the openib progress threads were written the they way they >>> were (i.e., without any opal progress thread support) because, at least in >>> the current setup, to get the opal progress thread support, you have to >>> turn on all the heavyweight locks, etc. These two progress threads now >>> simply pthread_create() and do minimal locking between the main thread and >>> themselves, without affecting the rest of the locking machinery in the code >>> base. >>> >>> I'm not saying that's perfect (or even good); I'm just saying that that's >>> the way it currently is. Indeed, at a minimum, Pasha and I have long >>> talked about merging these two progress threads into 1. It would be even >>> better if we could merge these two project threads into some other >>> infrastructure. But it's always been somewhat of a low priority; we've >>> never gotten a round tuit... >>> >>> -- >>> Jeff Squyres >>> jsquy...@cisco.com >>> For corporate legal information go to: >>> http://www.cisco.com/web/about/doing_business/legal/cri/ >>> >>> ___ >>> devel mailing list >>> de...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/
Re: [OMPI devel] RFC: increase default AC/AM/LT requirements
Hey Jeff, as said I'd be in favor of updating. This would get rid of some of the tests in autogen, speeding up things. Regards, Rainer On Wednesday 10 March 2010 12:07:56 pm Jeff Squyres wrote: > Brian and I chatted about this in person -- he doesn't have too strong of > an opinion here. > > We checked versions shipped in RHEL: > > RHEL4: AC 2.59, AM 1.9.2, LT 1.5.6 > RHEL5: AC 2.59, AM 1.9.2, LT 1.5.22 > > Meaning: they're both really ancient. > > I personally don't mind forcing developers to have more modern versions > because we're a fairly small group of people (vs. users). Does anyone > else have an opinion here? > > On Feb 25, 2010, at 1:55 PM, Barrett, Brian W wrote: > > I think our last set of minimums was based on being able to use RHEL4 out > > of the box. Updating to whatever ships with RHEL5 probably makes sense, > > but I think that still leaves you at a LT 1.5.x release. Being higher > > than that requires new Autotools, which seems like asking for trouble. > > > > Brian > > > > On Feb 25, 2010, at 4:47 PM, Jeff Squyres wrote: > > > WHAT: Bump minimum required versions of GNU autotools up to modern > > > versions. I suggest the following, but could be talked down a version > > > or two: Autoconf: 2.65 > > > Automake: 1.11.1 > > > Libtool: 2.2.6b > > > > > > WHY: Stop carrying patches and workarounds for old versions. > > > > > > WHERE: autogen.sh, make_dist_tarball, various Makefile.am's, > > > configure.ac, *.m4. > > > > > > WHEN: No real rush. Somewhere in 1.5.x. > > > > > > TIMEOUT: Friday March 5, 2010 > > > > > > > > > > > > I was debugging a complex Automake timestamp issue yesterday and > > > discovered that it was caused by the fact that we are patching an old > > > version of libtool.m4. It took a little while to figure out both the > > > problem and an acceptable workaround. During this process, I noticed > > > that autogen.sh still carries patches to fix bugs in some *really* old > > > versions of Libtool (e.g., 1.5.22). Hence, I am send this RFC to > > > increase the minimum required versions. > > > > > > Keep in mind: > > > > > > 1. This ONLY affects developers. Those who build from tarballs don't > > > even need to have the Autotools installed. 2. Autotool patches should > > > always be pushed upstream. We should only maintain patches for things > > > that have been pushed upstream but have not yet been released. 3. We > > > already have much more recent Autotools requirements for official > > > distribution tarballs; see the chart here: > > > > > >http://www.open-mpi.org/svn/building.php > > > > > > Specifically: although official tarballs require recent Autotools, we > > > allow developers to use much older versions. Why are we still > > > carrying around this old kruft? Does some developer out there have a > > > requirement to use older Autotools? > > > > > > If not, this RFC proposes to only allow recent versions of the > > > Autotools to build Open MPI. I believe there's reasonable m4 these > > > days that can make autogen/configure/whatever abort early if the > > > versions are not new enough. This would allow us, at a minimum, to > > > drop some of the libtool patches we're carrying. There may be some > > > Makefile.am workarounds that are no longer necessary, too. > > > > > > There's no real rush on this; if this RFC passes, we can set a > > > concrete, fixed date some point in the future where we switch over to > > > requiring new versions. This should give everyone plenty of time to > > > update if you need to, etc. > > > > > > -- > > > Jeff Squyres > > > jsquy...@cisco.com > > > For corporate legal information go to: > > > http://www.cisco.com/web/about/doing_business/legal/cri/ > > > > > > > > > ___ > > > devel mailing list > > > de...@open-mpi.org > > > http://www.open-mpi.org/mailman/listinfo.cgi/devel > > > > -- > > Brian W. Barrett > > Dept. 1423: Scalable System Software > > Sandia National Laboratories > > > > > > > > > > > > ___ > > devel mailing list > > de...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/devel > -- Rainer Keller, PhD Tel: +1 (865) 241-6293 Oak Ridge National Lab Fax: +1 (865) 241-4811 PO Box 2008 MS 6164 Email: kel...@ornl.gov Oak Ridge, TN 37831-2008AIM/Skype: rusraink
Re: [OMPI devel] RFC: Merge tmp fault recovery branch into trunk
Yesterday evening George, Thomas and I discussed some of their concerns about this RFC at the MPI Forum meeting. After the discussion, we seemed to be in agreement that the RecoS framework is a good idea and the concepts and fixes in this RFC should move forward with a couple of notes: - They wanted to test the branch a bit more over the next couple of days. Some MCA parameters that you will need are at the bottom of this message. - Reiterate that this RFC only addresses ORTE stability, not OMPI stability. The OMPI stability extension is a second step for the line of work, and should/will fit in nicely with the RecoS framework being proposed in this RFC. The OMPI layer stability will require a significant amount of work, but the RecoS framework will provide the ORTE layer stability that is required as a foundation for OMPI layer stability in the future. - The purpose of the ErrMgr becomes slightly unclear with the addition of the RecoS framework, since both are focused on responding to faults in the system (and RecoS, when enabled, overrides most/all of the ErrMgr functionality). Should the RecoS framework be merged with the ErrMgr framework to create a new ErrMgr interface? We are typing to decide if we should merge these frameworks, but at this point we are interested in hearing how other developers feel about merging the ErrMgr and RecoS frameworks, which would change the ErrMgr API. Are there any developers out there that are developing ErrMgr components, or are using any particular features of the existing ErrMgr framework that they would like to see preserved in the next revision. By default, the existing default abort behavior of the ErrMgr framework will be preserved, so the user will have to 'opt-in' to any fault recovery capabilities. So we are continuing the discussion a bit more off-list, and will return to the list with an updated RFC (and possibly a new branch) soon (hopefully end of the week/early next week). I would like to briefly discuss this RFC at the Open MPI teleconf next Tuesday. -- Josh On Feb 26, 2010, at 8:06 AM, Josh Hursey wrote: > Sounds good to me. > > For those casually following this RFC let me summarize its current state. > > Josh and George (and anyone else that wishes to participate attending the > forum) will meet sometime at the next MPI Forum meeting (March 8-10). I will > post any relevant notes from this meeting back to the list afterwards. So the > RFC is on hold pending the outcome of that meeting. For those developers > interested in this RFC that will not be able to attend, feel free to continue > using this thread for discussion. > > Thanks, > Josh > > On Feb 26, 2010, at 6:09 AM, George Bosilca wrote: > >> >> On Feb 26, 2010, at 01:50 , Josh Hursey wrote: >> >>> Any of those options are fine with me. I was thinking that if you wanted to >>> talk sooner, we might be able to help explain our intentions with this >>> framework a bit better. I figure that the framework interface will change a >>> bit as we all advance and incorporate our various techniques into it. I >>> think that the current interface is a good first step, but there are >>> certainly many more steps to come. >>> >>> I am fine delaying this code a bit, just not too long. Meeting at the forum >>> for a while might be a good option (we could probably even arrange to call >>> in others if you wanted). >> >> Sounds good, let do this. >> >> Thanks, >> george. >> >>> >>> Cheers, >>> Josh >>> >>> On Feb 25, 2010, at 6:45 PM, Ralph Castain wrote: >>> If Josh is going to be at the forum, perhaps you folks could chat there? Might as well take advantage of being colocated, if possible. Otherwise, I'm available pretty much any time. I can't contribute much about the MPI recovery issues, but can contribute to the RTE issues if that helps. On Thu, Feb 25, 2010 at 7:39 PM, George Bosilca wrote: Josh, Next week is a little bit too early as will need some time to figure out how to integrate with this new framework, and at what extent our code and requirements fit into. Then the week after is the MPI Forum. How about on Thursday 11 March? Thanks, george. On Feb 25, 2010, at 12:46 , Josh Hursey wrote: > Per my previous suggestion, would it be useful to chat on the phone early > next week about our various strategies? ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> >>> >>> ___ >>> devel mailing list >>> de...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> >> >> __
Re: [OMPI devel] RFC: Merge tmp fault recovery branch into trunk
Josh, You mentioned some MCA parameters that you would include in the email, but I don't see those parameters anywhere. Could you please put those in here to make testing easier for people. Wesley On Wed, Mar 10, 2010 at 1:26 PM, Josh Hursey wrote: > Yesterday evening George, Thomas and I discussed some of their concerns > about this RFC at the MPI Forum meeting. After the discussion, we seemed to > be in agreement that the RecoS framework is a good idea and the concepts and > fixes in this RFC should move forward with a couple of notes: > > - They wanted to test the branch a bit more over the next couple of days. > Some MCA parameters that you will need are at the bottom of this message. > > - Reiterate that this RFC only addresses ORTE stability, not OMPI > stability. The OMPI stability extension is a second step for the line of > work, and should/will fit in nicely with the RecoS framework being proposed > in this RFC. The OMPI layer stability will require a significant amount of > work, but the RecoS framework will provide the ORTE layer stability that is > required as a foundation for OMPI layer stability in the future. > > - The purpose of the ErrMgr becomes slightly unclear with the addition of > the RecoS framework, since both are focused on responding to faults in the > system (and RecoS, when enabled, overrides most/all of the ErrMgr > functionality). Should the RecoS framework be merged with the ErrMgr > framework to create a new ErrMgr interface? > > We are typing to decide if we should merge these frameworks, but at this > point we are interested in hearing how other developers feel about merging > the ErrMgr and RecoS frameworks, which would change the ErrMgr API. Are > there any developers out there that are developing ErrMgr components, or are > using any particular features of the existing ErrMgr framework that they > would like to see preserved in the next revision. By default, the existing > default abort behavior of the ErrMgr framework will be preserved, so the > user will have to 'opt-in' to any fault recovery capabilities. > > So we are continuing the discussion a bit more off-list, and will return to > the list with an updated RFC (and possibly a new branch) soon (hopefully end > of the week/early next week). I would like to briefly discuss this RFC at > the Open MPI teleconf next Tuesday. > > -- Josh > > On Feb 26, 2010, at 8:06 AM, Josh Hursey wrote: > > > Sounds good to me. > > > > For those casually following this RFC let me summarize its current state. > > > > Josh and George (and anyone else that wishes to participate attending the > forum) will meet sometime at the next MPI Forum meeting (March 8-10). I will > post any relevant notes from this meeting back to the list afterwards. So > the RFC is on hold pending the outcome of that meeting. For those developers > interested in this RFC that will not be able to attend, feel free to > continue using this thread for discussion. > > > > Thanks, > > Josh > > > > On Feb 26, 2010, at 6:09 AM, George Bosilca wrote: > > > >> > >> On Feb 26, 2010, at 01:50 , Josh Hursey wrote: > >> > >>> Any of those options are fine with me. I was thinking that if you > wanted to talk sooner, we might be able to help explain our intentions with > this framework a bit better. I figure that the framework interface will > change a bit as we all advance and incorporate our various techniques into > it. I think that the current interface is a good first step, but there are > certainly many more steps to come. > >>> > >>> I am fine delaying this code a bit, just not too long. Meeting at the > forum for a while might be a good option (we could probably even arrange to > call in others if you wanted). > >> > >> Sounds good, let do this. > >> > >> Thanks, > >> george. > >> > >>> > >>> Cheers, > >>> Josh > >>> > >>> On Feb 25, 2010, at 6:45 PM, Ralph Castain wrote: > >>> > If Josh is going to be at the forum, perhaps you folks could chat > there? Might as well take advantage of being colocated, if possible. > > Otherwise, I'm available pretty much any time. I can't contribute much > about the MPI recovery issues, but can contribute to the RTE issues if that > helps. > > > On Thu, Feb 25, 2010 at 7:39 PM, George Bosilca > wrote: > Josh, > > Next week is a little bit too early as will need some time to figure > out how to integrate with this new framework, and at what extent our code > and requirements fit into. Then the week after is the MPI Forum. How about > on Thursday 11 March? > > Thanks, > george. > > On Feb 25, 2010, at 12:46 , Josh Hursey wrote: > > > Per my previous suggestion, would it be useful to chat on the phone > early next week about our various strategies? > > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel > > ___
Re: [OMPI devel] RFC: Merge tmp fault recovery branch into trunk
Wesley, Thanks for catching that oversight. Below are the MCA parameters that you should need at the moment: # # Use the C/R Process Migration Recovery Supervisor recos_base_enable=1 # Only use the 'rsh' launcher, other launchers will be supported later plm=rsh # The resilient mapper knows how to use RecoS and deal with recovering procs rmaps=resilient # 'cm' component is the only one that can handle failures at the moment routed=cm # Let me know if you have any troubles. -- Josh On Mar 10, 2010, at 10:36 AM, Wesley Bland wrote: > Josh, > > You mentioned some MCA parameters that you would include in the email, but I > don't see those parameters anywhere. Could you please put those in here to > make testing easier for people. > > Wesley > > On Wed, Mar 10, 2010 at 1:26 PM, Josh Hursey wrote: > Yesterday evening George, Thomas and I discussed some of their concerns about > this RFC at the MPI Forum meeting. After the discussion, we seemed to be in > agreement that the RecoS framework is a good idea and the concepts and fixes > in this RFC should move forward with a couple of notes: > > - They wanted to test the branch a bit more over the next couple of days. > Some MCA parameters that you will need are at the bottom of this message. > > - Reiterate that this RFC only addresses ORTE stability, not OMPI stability. > The OMPI stability extension is a second step for the line of work, and > should/will fit in nicely with the RecoS framework being proposed in this > RFC. The OMPI layer stability will require a significant amount of work, but > the RecoS framework will provide the ORTE layer stability that is required as > a foundation for OMPI layer stability in the future. > > - The purpose of the ErrMgr becomes slightly unclear with the addition of > the RecoS framework, since both are focused on responding to faults in the > system (and RecoS, when enabled, overrides most/all of the ErrMgr > functionality). Should the RecoS framework be merged with the ErrMgr > framework to create a new ErrMgr interface? > > We are typing to decide if we should merge these frameworks, but at this > point we are interested in hearing how other developers feel about merging > the ErrMgr and RecoS frameworks, which would change the ErrMgr API. Are there > any developers out there that are developing ErrMgr components, or are using > any particular features of the existing ErrMgr framework that they would like > to see preserved in the next revision. By default, the existing default abort > behavior of the ErrMgr framework will be preserved, so the user will have to > 'opt-in' to any fault recovery capabilities. > > So we are continuing the discussion a bit more off-list, and will return to > the list with an updated RFC (and possibly a new branch) soon (hopefully end > of the week/early next week). I would like to briefly discuss this RFC at the > Open MPI teleconf next Tuesday. > > -- Josh > > On Feb 26, 2010, at 8:06 AM, Josh Hursey wrote: > > > Sounds good to me. > > > > For those casually following this RFC let me summarize its current state. > > > > Josh and George (and anyone else that wishes to participate attending the > > forum) will meet sometime at the next MPI Forum meeting (March 8-10). I > > will post any relevant notes from this meeting back to the list afterwards. > > So the RFC is on hold pending the outcome of that meeting. For those > > developers interested in this RFC that will not be able to attend, feel > > free to continue using this thread for discussion. > > > > Thanks, > > Josh > > > > On Feb 26, 2010, at 6:09 AM, George Bosilca wrote: > > > >> > >> On Feb 26, 2010, at 01:50 , Josh Hursey wrote: > >> > >>> Any of those options are fine with me. I was thinking that if you wanted > >>> to talk sooner, we might be able to help explain our intentions with this > >>> framework a bit better. I figure that the framework interface will change > >>> a bit as we all advance and incorporate our various techniques into it. I > >>> think that the current interface is a good first step, but there are > >>> certainly many more steps to come. > >>> > >>> I am fine delaying this code a bit, just not too long. Meeting at the > >>> forum for a while might be a good option (we could probably even arrange > >>> to call in others if you wanted). > >> > >> Sounds good, let do this. > >> > >> Thanks, > >> george. > >> > >>> > >>> Cheers, > >>> Josh > >>> > >>> On Feb 25, 2010, at 6:45 PM, Ralph Castain wrote: > >>> > If Josh is going to be at the forum, perhaps you folks could chat there? > Might as well take advantage of being colocated, if possible. > > Otherwise, I'm available pretty much any time. I can't contribute much > about the MPI recovery issues, but can contribute to the RTE issues if > that helps. > > > On Thu, Feb 25, 2010 at 7:39 PM,
Re: [OMPI devel] how to add a component in the ompi?
Once you add a directory under ompi/mca/btl/ and add the relevant files, then the next time you run "autogen.sh", it should just "find" the component and add it to the configure and build process. You should not need to edit ompi/mca/btl/Makefile.am yourself. Have a look at these wiki pages; they explain point-by-point how to add a component into OMPI's source tree: https://svn.open-mpi.org/trac/ompi/wiki/devel/CreateComponent https://svn.open-mpi.org/trac/ompi/wiki/devel/CreateFramework https://svn.open-mpi.org/trac/ompi/wiki/devel/Autogen On Mar 10, 2010, at 8:53 AM, hu yaohui wrote: > Hi Jeff & All > i want to add a new component in the ompi, > 1: i make a dir ~/mca/btl/ht > 2:Then,i have made sure some basic functions i need to implement. > such as: > mca_btl_ht_add_procs, > mca_btl_ht_del_procs, > mca_btl_ht_alloc > mca_btl_ht_free > mca_btl_ht_finalize. > 3:after these functions,i must compile these funcitons,i copied the makefire > under ~/mca/btl/tcp/, > i have seen these: > > MCA_btl_ALL_COMPONENTS = self sm elan gm mx ofud openib portals tcp udapl > MCA_btl_ALL_SUBDIRS = mca/btl/self mca/btl/sm mca/btl/elan mca/btl/gm > mca/btl/mx mca/btl/o > fud mca/btl/openib mca/btl/portals mca/btl/tcp mca/btl/udapl > > add my component into these two lines is just not enough. > Can you help me out on making a right Makefile for my component under folder > ~/mca/blt/ht? > > Thanks & Regards, > Yaohui Hu > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/