Re: [OMPI devel] Jan ORTE meeting
Ralph Castain wrote: On Dec 4, 2008, at 3:25 PM, Jeff Squyres wrote: I don't know who's interested, so I thought I'd bring it up on the devel list: let's start the basics for the January ORTE meeting. We may be able to sketch out an agenda, but frankly, it may depend on how far we get in the December meeting. So we may not be able to fully decide that yet. But I thought we might be able to list out the high level goals and start discussing location, length of the meeting, and dates. * Cisco has some heavy travel restrictions right now, so I would not be able to attend unless we have it here in Louisville again, or perhaps Bloomington (i.e., I can make day trips to drive there). I do have a Telepresence unit in Louisville now, so we could conference in Boxborough (i.e., Sun) for 1-2 hours at a time, if desirable. That being said, my presence is probably not critical to these meetings, so don't let this be a gating factor. * My January is fairly open at this point; I have no scheduled travel [yet]. I *may* become unavailable the week of Jan 5 (i.e., first full week in Jan), which might not be desirable for an ORTE meeting anyway because a) we'll all be recovering from the holidays, and b) it's unlikely that anyone will have had a chance to do much/anything since the December ORTE meeting. * Point of information: the next Forum meeting is in the California Bay area on Feb 9-11, 2009. My Jan is open, and I can travel wherever required. However, given that many people are under heavy travel constraints, and that the majority of you will be at the MPI Forum, would it make sense to tack it on before/after that meeting? I'm not sure Jan is a requirement, especially given the holiday break after the Dec meeting. Just wanted to add that Sun is interested in these meetings too but are also under strict travel restrictions so doing this via CISCOs telepresence, concall or at the Forum meeting suits us best then a special meeting. --td -- Jeff Squyres Cisco Systems ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] BTL move - the notion
Richard Graham wrote: Let me start the e-mail conversation, and see how far we get. Goal: The goal several of us have is to be able to use the btl’s outside of the MPI layer in Open MPI. The layer itself is generic, w/o specific knowledge of Upper Level Protocols, so is well suited for this sort of use. Technical Approach: What we have suggested is to start the process with the Open MPI code base, and make it independent of the mpi-layer (which it is now), and the run-time layer. Before we get into any specific technical details, the first question I have is are people totally opposed to the notion of making the btl’s independent of MPI and the run-time ? This does not mean that it can’t be used by it, but that there are well defined abstraction layers, i.e., are people against the goal in the first place ? I am not against the idea of separating the BTLs out from OMPI. However, it would help to know what we are really trying to accomplish this and why using MPI is a non-starter. Is the issue to heavyweight of a protocol or is it the infrastructure. I guess one question I have is if we separate BTLs from OMPI do we lose valuable information needed to establish and maintain the connections and could we run into some chicken/egg problems. I assume the last issue is only of concern if we remove the orte/opal dependencies. What are alternative suggestions to the technical approach ? The technical approach is really the implementation logistics, right? That is how do we apply these changes to the trunk such that they get in asap as to not require significant ongoing maintenance by the implementors and not disturb the community members that are doing other work. Branch and patch - protects the community members the most until it comes to the flag day of pushing the patch. But as you mention below this has a heavy cost on the implementors and eventually a potentially large blackout period. Incremental approach - If we believe this project will be large I success we try and map out all the different pieces and try and figure out ways we can compartmentalize each piece such that they can be putback separately from each other. This is similar to the branch and patch approach except we try and do several patchs that each can be reasonably tested and putback separate from the others. The hope is that each patch is not that large and thus easier for the implementors to maintain and merge. But this will require a more thought out plan as to how things are done which might be detrimental to any agile development. Scorched earth - Map a calendar time frame that we say from X to Y the trunk will be under major renovation to move out the BTLs from OMPI. This help the BTL movement developers but could put any other development at risk. It also commits us completely to doing the BTL separation so if things start falling a part it will definitely delay the next release. I personally prefer the Incremental Approach but we will need to have a very well thought out plan to get this to work. This approach could devolve into the other two approaches without careful planning, which I don't believe anyone would really like to to see. HTH, --td One suggestion has been to branch and patch. To me this is a long-term maintenance nightmare. What are peoples thoughts here ? Rich ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] Jan ORTE meeting
Since Rich was already unable to attend the July meeting, I would like to find a way to accommodate his schedule if possible. I'm not sure that it is critical that we have a meeting too hard on the heels of the Dec one, so perhaps something in Feb (or even March) makes the most sense. Given travel restrictions, perhaps a telepresence approach makes the most sense. I'm willing and able to travel within reason, so the Forum or midwest is fine with me. I sympathize with Rich's travel situation wrt the Forum dates, but I don't know how to provide the telepresence to ORNL. Jeff: do you? Perhaps we should discuss this at the Tues telecon? Might go faster there. Anyone who isn't going to be on that telecon should please speak up on this email thread ASAP. Ralph On Dec 5, 2008, at 4:12 AM, Terry Dontje wrote: Ralph Castain wrote: On Dec 4, 2008, at 3:25 PM, Jeff Squyres wrote: I don't know who's interested, so I thought I'd bring it up on the devel list: let's start the basics for the January ORTE meeting. We may be able to sketch out an agenda, but frankly, it may depend on how far we get in the December meeting. So we may not be able to fully decide that yet. But I thought we might be able to list out the high level goals and start discussing location, length of the meeting, and dates. * Cisco has some heavy travel restrictions right now, so I would not be able to attend unless we have it here in Louisville again, or perhaps Bloomington (i.e., I can make day trips to drive there). I do have a Telepresence unit in Louisville now, so we could conference in Boxborough (i.e., Sun) for 1-2 hours at a time, if desirable. That being said, my presence is probably not critical to these meetings, so don't let this be a gating factor. * My January is fairly open at this point; I have no scheduled travel [yet]. I *may* become unavailable the week of Jan 5 (i.e., first full week in Jan), which might not be desirable for an ORTE meeting anyway because a) we'll all be recovering from the holidays, and b) it's unlikely that anyone will have had a chance to do much/anything since the December ORTE meeting. * Point of information: the next Forum meeting is in the California Bay area on Feb 9-11, 2009. My Jan is open, and I can travel wherever required. However, given that many people are under heavy travel constraints, and that the majority of you will be at the MPI Forum, would it make sense to tack it on before/after that meeting? I'm not sure Jan is a requirement, especially given the holiday break after the Dec meeting. Just wanted to add that Sun is interested in these meetings too but are also under strict travel restrictions so doing this via CISCOs telepresence, concall or at the Forum meeting suits us best then a special meeting. --td -- Jeff Squyres Cisco Systems ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] BTL move - the notion
I'll answer this outside of Terry's reply so we can stay under George's page limit. :-)) I don't have any philosophical opposition to the idea. Indeed, there are places where I would potentially have some use for the btl's, perhaps as an alternative comm channel in the OOB. I will point out, though, that there are several things we thought when we started this project that have proven unworkable over time. For example, the idea that the RTE could be a general purpose one without impacting OMPI proved incorrect and has been abandoned. It may well be that the notion of using the BTL's for non-OMPI projects will fall into that category as well - not saying it does, but I think it is still TBD. That said, I do have some significant concerns about -how- this is done that fall into two categories: 1. Procedural Keeping the common code in the OMPI repository can raise quite a bit of trouble with synchronizing release cycles. We are just about to exit a period of requested "quiet" time on the trunk to stabilize it for the 1.3 release. If STCI is in an active development phase, this could have caused a major problem as we would have demanded they not commit to our code repository. It is easy to foresee the reverse situation. Indeed, from working on several other similar projects, this problem is not only common, but frequent. How do we intend to work this out? I am also concerned about slowing down OMPI's development efforts due to the need to coordinate proposed changes with an even broader community, and one that will have conflicting requirements/schedules. We already have problems getting people to stay adequately involved as changes are proposed and made, especially as the communities members have become involved in other efforts over time. It would become unworkable if we take months to touch base with everyone who might be impacted and get general consensus on changes required by OMPI. As Terry said, we have to maintain OMPI's agility. We all need to keep something in mind here. While this discussion is about the BTL's and coordinating with STCI, we are talking about a general method of operation that will have to be extended to anyone with a similar request. There already are other groups out there, some competing with STCI, that have issued similar requests for sharing various pieces of the code base (the ones coming to me mostly pertain to the RTE). So whatever we do should be generalizable - it can't just be a point solution for STCI. I am disturbed by the immediate rejection of methods developed and used by other large code projects that address this very problem. Both Hg and GIT were developed specifically with this code sharing synchronization issue in mind, and have enjoyed rapid adoption and get rave reviews for their solutions. It provides maximum flexibility, but requires a bit of a learning curve and admittedly more attention to maintenance details. However, other projects in similar circumstances have found it highly beneficial. I would think we should at least consider what is becoming the state-of-the-art method for code sharing before simply rejecting this approach as too much maintenance. 2. Technical I think we all agree that STCI and OMPI have different objectives and requirements. OMPI is facing the need to launch and operate at extreme scales by next summer, has received a lot of interest in having it report errors into various systems, etc. We don't have all the answers as to what will be necessary to meet these requirements, but indications so far are that tighter integration, not deeper abstraction, between the various layers will be needed. By that, I don't mean we will violate abstraction layers, but rather that the various layers need to work more as a tightly tuned instrument, with each layer operating based on a clear knowledge of how the other layers are functioning. For example, for modex-less operations, the MPI/BTLs have to know that the RTE/OS will be providing certain information. This means that they don't have to go out and discover it themselves every time. Yes, we will leave that as the default behavior so that small and/or unmanaged clusters can operate, but we have to also introduce logic that can detect when we are utilizing this alternative capability and exploit it. While we are trying our best to avoid introducing RTE-like calls into the code, the fact is that we may well have to do so (we have already identified one btl that will definitely need to). It is simply too early to make the decision to cut that off now - we don't know what the long-term impacts of such a decision will be. Finally, although I don't do much on the MPI layer, I am concerned about performance. I would tend to oppose any additional abstraction until we can measure the performance impact. Thus, I would like to see the BTL move done on a tmp bran
Re: [OMPI devel] Jan ORTE meeting
On Dec 5, 2008, at 8:05 AM, Ralph Castain wrote: Since Rich was already unable to attend the July meeting, I would like to find a way to accommodate his schedule if possible. I'm not sure that it is critical that we have a meeting too hard on the heels of the Dec one, so perhaps something in Feb (or even March) makes the most sense. Given travel restrictions, perhaps a telepresence approach makes the most sense. I'm willing and able to travel within reason, so the Forum or midwest is fine with me. I sympathize with Rich's travel situation wrt the Forum dates, but I don't know how to provide the telepresence to ORNL. Jeff: do you? I'm checking on whether there are Telepresence rooms in Tennessee. FWIW, I don't think there are any in ABQ yet. Also, keep in mind that I can usually only get mainstream Cisco Telepresence rooms for 1-2 hours at a time. Boxborough is a "mainstream" Cisco office (read: a big office with lots of people who use Telepresence). Louisville is small and relatively easy to schedule Telepresence time. Perhaps we should discuss this at the Tues telecon? Might go faster there. Anyone who isn't going to be on that telecon should please speak up on this email thread ASAP. That sounds good. Let's talk next Tuesday - right after we release v1.3!! :-) -- Jeff Squyres Cisco Systems
Re: [OMPI devel] BTL move - the notion
> > On 12/5/08 6:49 AM, "Terry D. Dontje" wrote: > > Richard Graham wrote: > > Let me start the e-mail conversation, and see how far we get. > > > > Goal: The goal several of us have is to be able to use the btl’s > > outside of the MPI layer in Open MPI. The layer itself is generic, w/o > > specific knowledge of Upper Level Protocols, so is well suited for > > this sort of use. > > > > Technical Approach: What we have suggested is to start the process > > with the Open MPI code base, and make it independent of the mpi-layer > > (which it is now), and the run-time layer. > > > > Before we get into any specific technical details, > > the first question I have is are people totally opposed to the notion > > of making the btl’s independent of MPI and the run-time ? > > This does not mean that it can’t be used by it, but that there are > > well defined abstraction layers, i.e., are people against the goal in > > the first place ? > > > I am not against the idea of separating the BTLs out from OMPI. However, > it would help to know what we are really trying to accomplish this and > why using MPI is a non-starter. Is the issue to heavyweight of a > protocol or is it the infrastructure. I guess one question I have is if > we separate BTLs from OMPI do we lose valuable information needed to > establish and maintain the connections and could we run into some > chicken/egg problems. I assume the last issue is only of concern if we > remove the orte/opal dependencies. Not quite sure about the MPI question. The btl's are ULP neutral communications primitives (by design), and we want to re-use these outside MPI. Run-time (actually for FT in MPI), and other ULP's. So OPAL dependencies will be maintained as these are what give us the portability layer. What needs to be a bit more generic is how these are used by ULP's, and specifically issues revolving around indexing. I am guessing that these are issues that will come up when addressing how to use other run-times in the context of OMPI. > What are alternative suggestions to the technical approach ? > The technical approach is really the implementation logistics, right? > That is how do we apply these changes to the trunk such that they get in > asap as to not require significant ongoing maintenance by the > implementors and not disturb the community members that are doing other > work. Yes. First, I am advocating a phased approach, to minimize disruption to the trunk. The first phase is renaming structures, and moving them in the code tree. The second is moving the btl and supporting code (mpools, rcache, allocator, ?, and have already gotten feedback that should consider moving the bml, which is very reasonable) to a new location in the code tree. These, I expect, should touch a lot of code, but it either compiles or it does not. No data structure changes or any other such changes will be made at this stage. The final phase is removing any dependencies on other layers. At this stage all I can think of is the notifier, but I am not doing the work, so there could be other changes. Here we need to talk as a community on how to best do this. It is clear that we need the notifier in this layer, and maybe we use an approach that Ralph has suggested and use #defines. At this stage I do foresee the need to make a change to the btl's, for general use - we need to add attributes that tell us if a given btl can bootstrap itself, and if forked processes can also use this btl in the children. The larger changes I was concerned about I think have more to do with enabling other run-time support within the ompi code base, and these will be addressed in a separate track, as Jeff has suggested. This is where I expect larger changes within ompi, but this has more to do with ompi than will others being able to use the btl's. > > Branch and patch - protects the community members the most until it > comes to the flag day of pushing the patch. But as you mention below > this has a heavy cost on the implementors and eventually a potentially > large blackout period. > > Incremental approach - If we believe this project will be large I > success we try and map out all the different pieces and try and figure > out ways we can compartmentalize each piece such that they can be > putback separately from each other. This is similar to the branch and > patch approach except we try and do several patchs that each can be > reasonably tested and putback separate from the others. The hope is that > each patch is not that large and thus easier for the implementors to > maintain and merge. But this will require a more thought out plan as to > how things are done which might be detrimental to any agile development. > > Scorched earth - Map a calendar time frame that we say from X to Y the > trunk will be under major renovation to move out the BTLs from OMPI. > This help the BTL movement developers but could put any other > development at risk. It also commits us completely to d
Re: [OMPI devel] BTL move - the notion
>> > think we all agree that STCI and OMPI have different objectives and >> requirements. OMPI is facing the need to launch and operate at extreme scales >> by next summer, has received a lot of interest in having it report errors >> into various systems, etc. We don't have all the answers as to what will be >> necessary to meet these requirements, but indications so far are that tighter >> integration, not deeper abstraction, between the various layers will be >> needed. By that, I don't mean we will violate abstraction layers, but rather >> that the various layers need to work more as a tightly tuned instrument, with >> each layer operating based on a clear knowledge of how the other layers are >> functioning. OMPI and STCI are two different things together, and I have vested interest in both, and have no desire to have either go south. You have a set of requirement at LANL which are important, and we also have a set of requirement at ORNL, and as such we need to compromise on these in the code base. We have MPI level goals, which will be accomplished in the OMPI code base, and tools and other related goals that will be accomplished in other code bases. We both have the need to function well at the high end, so have the same set of goals there. > > For example, for modex-less operations, the MPI/BTLs have to know that the RTE/OS will be providing certain information. This means that they don't have to go out and discover it themselves every time. Yes, we will leave that as the default behavior so that small and/or unmanaged clusters can operate, but we have to also introduce logic that can detect when we are utilizing this alternative capability and exploit it. While we are trying our best to avoid introducing RTE-like calls into the code, the fact is that we may well have to do so (we have already identified one btl that will definitely need to). It is simply too early to make the decision to cut that off now - we don't know what the long-term impacts of such a decision will be. This is where discussions will need to go both ways. Your changes also can impact us, and we need to agree to those changes, just as much as you need to agree with the changes we are proposing. This is not a code base focused on a single institution's requirements, and we all do our best (and I believe tend to succeed) at helping meet all of our needs. > > Finally, although I don't do much on the MPI layer, I am concerned about performance. I would tend to oppose any additional abstraction until we can measure the performance impact. Thus, I would like to see the BTL move done on a tmp branch (technology to branch up to the implementer - I don't care) so we can verify that it isn't hurting us in some unforeseeable manner. Agreed - at least for the last phase of what we are suggesting, but we can talk about this. I am a bit confused about how the location of the source code has anything to do with how it performs at run-time. At this stage we have said nothing about changing the way the btl works, just cosmetic things. When it comes to enabling the use of stci with ompi, then these issues will come up, and need to be addressed very carefully. To be honest, since we don't want to change the btl's (aside from add some attributes) I don't expect this to be an issue, UNLESS we end up needing to change some data structures for abstraction purposes. This is where we need to be very careful. If you look at what has happened with the btl's (actually first the PTL's) historically, I have been one of the ones pushing hard for improved performance - why would this change now ? > > > So I guess my concerns really boil down to dealing with conflicting schedules and requirements, how to support multiple possibly competing groups that want to share one or more parts of our code base, and retaining an OMPI-first philosophy when it comes to what changes get made. My proposed solution is: This is the problem we face all the time, and on a regular basis we as a community do our best to help each other out. This is one of the reasons 1.3 is as late as it is, and this is a good thing that will continue as long as this is a community project. > > 1. shift our repository to a technical solution that supports broader code sharing > > 2. have the non-OMPI groups access our code base via that technology. They can "pull" changes at will, subject to the licensing agreement. It is true that they may have to do some local editing if the change hits a spot where they have local mods to support their system, but both Hg and GIT are very good at handling this - much better than svn ever has been. > > 3. if there are minor mods required to make the BTL code area easier to share via the above methods, then we should explore and implement them. Certainly, renaming #define values would seem a no-brainer. I suspect there are other similar things that could be done. Removing orte/opal dependencies is more controversial and would need to thorou
[OMPI devel] orte_default_hostfile
Hi, In 1.2.x, the rds_hostfile_path parameter pointed to openmpi-default- hostfile by default. This parameter has been replaced with orte_default_hostfile in 1.3, but now it defaults to . Was there any particular reason for the default value to change? Greg
[OMPI devel] Forwarding SIGTSTP and SIGCONT
We have had requests to be able to suspend/resume MPI jobs within an SGE environment. SGE sends a signal (which is configurable) to mpirun to stop the job and another signal to resume it. To support this, I propose that we add support in the ORTE to catch SIGTSTP/SIGCONT and forward these to the a.outs. Actually, SIGTSTP will be caught, forwarded, then converted to SIGSTOP before being delivered to the a.outs. The one disadvantage is that we have overridden the SIGTSTP default behavior which is typically to stop mpirun. Does anyone else have a requirement like this or does anyone have issues with these changes? FWIW, I know there is at least one other MPI that supports this type of behavior. One problem is that with SIGTSTP no longer delivering a stop signal to mpirun, one cannot CTRL-Z at their terminal to stop mpirun. I am trying to figure out how big a problem that is. Rolf PS: Here are the possible code changes. Not too major. burl-ct-v440-2 62 =>svn diff Index: orte/tools/orterun/orterun.c === --- orte/tools/orterun/orterun.c(revision 20072) +++ orte/tools/orterun/orterun.c(working copy) @@ -99,6 +99,8 @@ #ifndef __WINDOWS__ static struct opal_event sigusr1_handler; static struct opal_event sigusr2_handler; +static struct opal_event sigtstp_handler; +static struct opal_event sigcont_handler; #endif /* __WINDOWS__ */ static orte_job_t *jdata; static char *orterun_basename = NULL; @@ -511,6 +513,12 @@ opal_signal_set(&sigusr2_handler, SIGUSR2, signal_forward_callback, &sigusr2_handler); opal_signal_add(&sigusr2_handler, NULL); +opal_signal_set(&sigtstp_handler, SIGTSTP, +signal_forward_callback, &sigtstp_handler); +opal_signal_add(&sigtstp_handler, NULL); +opal_signal_set(&sigcont_handler, SIGCONT, +signal_forward_callback, &sigcont_handler); +opal_signal_add(&sigcont_handler, NULL); #endif /* __WINDOWS__ */ /* we are an hnp, so update the contact info field for later use */ @@ -763,6 +771,8 @@ /** Remove the USR signal handlers */ opal_signal_del(&sigusr1_handler); opal_signal_del(&sigusr2_handler); +opal_signal_del(&sigtstp_handler); +opal_signal_del(&sigcont_handler); #endif /* __WINDOWS__ */ /* get the daemon job object */ Index: orte/orted/orted_comm.c === --- orte/orted/orted_comm.c (revision 20072) +++ orte/orted/orted_comm.c (working copy) @@ -457,10 +457,6 @@ /SIGNAL_LOCAL_PROCS / case ORTE_DAEMON_SIGNAL_LOCAL_PROCS: -if (orte_debug_daemons_flag) { -opal_output(0, "%s orted_cmd: received signal_local_procs", -ORTE_NAME_PRINT(ORTE_PROC_MY_NAME)); -} /* unpack the jobid */ n = 1; if (ORTE_SUCCESS != (ret = opal_dss.unpack(buffer, &job, &n, ORTE_JOBID))) { @@ -474,7 +470,22 @@ ORTE_ERROR_LOG(ret); goto CLEANUP; } - + +/* Convert SIGTSTP to SIGSTOP so we can suspend a.out */ +if (SIGTSTP == signal) { +if (orte_debug_daemons_flag) { +opal_output(0, "%s orted_cmd: converted SIGTSTP to SIGSTOP before delivering", +ORTE_NAME_PRINT(ORTE_PROC_MY_NAME)); +} +signal = SIGSTOP; +} + +if (orte_debug_daemons_flag) { +opal_output(0, "%s orted_cmd: received signal_local_procs, delivering signal %d", +ORTE_NAME_PRINT(ORTE_PROC_MY_NAME), +signal); +} + /* signal them */ if (ORTE_SUCCESS != (ret = orte_odls.signal_local_procs(NULL, signal))) { ORTE_ERROR_LOG(ret); burl-ct-v440-2 63 => -- = rolf.vandeva...@sun.com 781-442-3043 =
[OMPI devel] Open MPI v1.3rc2 has been posted
Hi All, The second release candidate of Open MPI v1.3 is now available: http://www.open-mpi.org/software/ompi/v1.3/ Please run it through it's paces as best you can. -- Tim Mattox, Ph.D. - http://homepage.mac.com/tmattox/ tmat...@gmail.com || timat...@open-mpi.org