Re: [OMPI devel] SM init failures
On Mar 30, 2009, at 12:05 PM, Jeff Squyres wrote: But don't we need the whole area to be zero filled? It will be zero-filled on demand using the lseek/touch method. However, the OS may not reserve space for the skipped pages or disk blocks. Thus one could still get out of memory or file system full errors at arbitrary points. Presumably one could also get segfaults from an mmap'ed segment whose pages couldn't be allocated when the demand came. Iain
Re: [OMPI devel] SM init failures
On Mar 31, 2009, at 11:00 AM, Jeff Squyres wrote: On Mar 31, 2009, at 3:45 AM, Sylvain Jeaugey wrote: Sorry to continue off-topic but going to System V shm would be for me like going back in the past. System V shared memory used to be the main way to do shared memory on MPICH and from my (little) experience, this was truly painful : - Cleanup issues : does shmctl(IPC_RMID) solve _all_ cases ? (even kill -9 ?) - Naming issues : shm segments identified as 32 bits key potentially causing conflicts between applications or layers of the same application on one node - Space issues : the total shm size on a system is bound to /proc/sys/kernel/shmmax, needing admin configuration and causing conflicts between MPI applications running on the same node Indeed. The one saving grace here is that the cleanup issues apparently can be solved on Linux with a special flag that indicates "automatically remove this shmem when all processes attaching to it have died." That was really the impetus for [re-]investigating sysv shm. I, too, remember the sysv pain because we used it in LAM, too... What about the other issues? I remember those being a PITA about 15 to 20 years ago, but obviously a lot could have improved in the meantime. Iain
Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r20926
On Apr 1, 2009, at 4:29 PM, Jeff Squyres wrote: Should the same fixes be applied to type_create_keyval_f.c and win_create_keyval_f.c? Good question. I'll have a look at them. Iain
Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r20926
On Apr 1, 2009, at 4:58 PM, Iain Bason wrote: On Apr 1, 2009, at 4:29 PM, Jeff Squyres wrote: Should the same fixes be applied to type_create_keyval_f.c and win_create_keyval_f.c? Good question. I'll have a look at them. It looks as though those have the same problem. I will write a test to make sure. Iain
Re: [OMPI devel] opal / fortran / Flogical
On Jun 2, 2009, at 10:24 AM, Rainer Keller wrote: no, that's not an issue. The comment is correct: For any Fortran integer*kind we need to have _some_ C-representation as well, otherwise we disregard the type (tm), see e.g. the old and resolved ticket #1094. The representation chosen is set in opal/util/arch.c and is conclusive as far as I can tell... Doesn't that mean that the comment is misleading? I interpret it as saying that a Fortran "default integer" is always the same as a C "int". I believe that you are saying that it really means that *any* kind of Fortran integer must be the same as one of C's integral types, or OpenMPI won't support it at all. Shouldn't the comment be clearer? Iain
Re: [OMPI devel] opal / fortran / Flogical
On Jun 3, 2009, at 1:30 PM, Ralph Castain wrote: I'm not entirely sure what comment is being discussed. Jeff said: I see the following comment: ** The fortran integer is dismissed here, since there is no ** platform known to me, were fortran and C-integer do not match You can tell the intel compiler (and maybe others?) to compile fortran with double-sized integers and reals. Are we disregarding this? I.e., does this make this portion of the datatype heterogeneity detection incorrect? Rainer said: no, that's not an issue. The comment is correct: For any Fortran integer*kind we need to have _some_ C-representation as well, otherwise we disregard the type (tm), see e.g. the old and resolved ticket #1094. I said: Doesn't that mean that the comment is misleading? I interpret it as saying that a Fortran "default integer" is always the same as a C "int". I believe that you are saying that it really means that *any* kind of Fortran integer must be the same as one of C's integral types, or OpenMPI won't support it at all. Shouldn't the comment be clearer? I believe that you are talking about a different comment: * RHC: technically, use of the ompi_ prefix is * an abstraction violation. However, this is actually * an error in our configure scripts that transcends * all the data types and eventually should be fixed. * The guilty part is f77_check.m4. Fixing it right * now is beyond a reasonable scope - this comment is * placed here to explain the abstraction break and * indicate that it will eventually be fixed I don't know whether anyone is using either of these comments to justify anything. Iain
Re: [OMPI devel] MPI_REAL16
Jeff Squyres wrote: Thanks for looking into this, David. So if I understand that correctly, it means you have to assign all literals in your fortran program with a "_16" suffix. I don't know if that's standard Fortran or not. Yes, it is. Iain
Re: [OMPI devel] [OMPI svn] svn:open-mpi r21480
Ralph Castain wrote: I'm sorry, but this change is incorrect. If you look in orte/mca/ess/base/ess_base_std_orted.c, you will see that -all- orteds, regardless of how they are launched, open and select the PLM. I believe you are mistaken. Look in plm_base_launch_support.c: /* The daemon will attempt to open the PLM on the remote * end. Only a few environments allow this, so the daemon * only opens the PLM -if- it is specifically told to do * so by giving it a specific PLM module. To ensure we avoid * confusion, do not include any directives here */ if (0 == strcmp(orted_cmd_line[i+1], "plm")) { continue; } That code strips out anything like "-mca plm rsh" from the command line passed to a remote daemon. Meanwhile, over in ess_base_std_orted.c: /* some environments allow remote launches - e.g., ssh - so * open the PLM and select something -only- if we are given * a specific module to use */ mca_base_param_reg_string_name("plm", NULL, "Which plm component to use (empty = none)", false, false, NULL, &plm_to_use); if (NULL == plm_to_use) { plm_in_use = false; } else { plm_in_use = true; if (ORTE_SUCCESS != (ret = orte_plm_base_open())) { ORTE_ERROR_LOG(ret); error = "orte_plm_base_open"; goto error; } if (ORTE_SUCCESS != (ret = orte_plm_base_select())) { ORTE_ERROR_LOG(ret); error = "orte_plm_base_select"; goto error; } } So a PLM is loaded only if specified with "-mca plm foo", but that -mca flag is stripped out when launching the remote daemon. I also ran into this issue with tree spawning. (I didn't putback a fix because I couldn't get tree spawning actually to improve performance. My fix was not to strip out the "-mca plm foo" parameters if tree spawning had been requested.) Iain
Re: [OMPI devel] [OMPI svn] svn:open-mpi r21480
Ralph Castain wrote: Yes, but look at orte/mca/plm/rsh/plm_rsh_module.c: /* ensure that only the ssh plm is selected on the remote daemon */ var = mca_base_param_environ_variable("plm", NULL, NULL); opal_setenv(var, "rsh", true, &env); free(var); This is done in "ssh_child", right before we fork_exec the ssh command to launch the remote daemon. This is why slave spawn works, for example. My ssh does not preserve environment variables: bash-3.2$ export MY_VERY_OWN_ENVIRONMENT_VARIABLE=yes bash-3.2$ ssh cubbie env | grep MY_VERY_OWN WARNING: This is a restricted access server. If you do not have explicit permission to access this server, please disconnect immediately. Unauthorized access to this system is considered gross misconduct and may result in disciplinary action, including revocation of SWAN access privileges, immediate termination of employment, and/or prosecution to the fullest extent of the law. bash-3.2$ The rsh man page explicitly states that the local environment is not passed to the remote shell. I haven't checked qrsh. Maybe it works with that. I agree that tree_spawn doesn't seem to work right now, but it is not due to the plm not being selected. It was for me. I don't know whether it is because your rsh/ssh work differently, or for some other reason, but there is no question that my tree spawn failed because no PLM was loaded. There are other factors involved. The other factors that I came across were: * I didn't have my .ssh/config file set up to forward authentication. I added a -A flag to the ssh command in plm_base_rsh_support. * In plm_rsh_module.c:setup_launch, a NULL orted_cmd made asprintf crash. I used (orted_cmd == NULL ? "" : orted_cmd) in the call to asprintf. Once I fixed those, tree spawning worked for me. (I believe you mentioned a race condition in another conversation. I haven't run into that.) Iain
Re: [OMPI devel] MPI_REAL16
(Thanks, Nick, for explaining that kind values are compiler-dependent. I was too lazy to do that.) Jeff Squyres wrote: Given that I'll inevitably get the language wrong, can someone suggest proper verbiage for this statement in the OMPI README: - MPI_REAL16 and MPI_COMPLEX32 are only supported on platforms where a portable C datatype can be found that matches the Fortran type REAL*16, both in size and bit representation. The Intel v11 compiler, for example, supports these types, but requires the use of the "_16" suffix in Fortran when assigning constants to REAL*16 variables. The _16 suffix really has nothing to do with whether there is a C datatype that corresponds to REAL*16. There are two separate issues here: 1. In Fortran code, any floating point literal has the default kind unless otherwise specified. That means that you can get surprising results from a simple program designed to test whether a C compiler has a data type that corresponds to REAL*16: the least significant bits of a REAL*16 variable will be set to zero when the literal is assigned to it. 2. Open MPI requires the C compiler to have a data type that has the same bit representation as the Fortran compiler's REAL*16. If the C compiler does not have such a data type, then Open MPI cannot support REAL*16 in its Fortran interface. My understanding is that the Intel representative said that there is some compiler switch that allows the C compiler to have such a data type. I didn't pay enough attention to see whether there was some reason not to use the switch. She also pointed out a bug in the Fortran test code that checks for the presence of the C data type. She suggested using a _16 suffix on a literal in that test code. Nick pointed out that that _16 suffix means, "make this literal a KIND=16 literal," which may mean different things to different compilers. In particular, REAL*16 may not be the same as REAL(KIND=16). However, there is no standard way to specify, "make this literal a REAL*16 literal." That means that you have to do one of: * Declare the variable REAL(KIND=16) and use the _16 suffix on the literal. * Define some parameter QUAD using the SELECTED_REAL_KIND intrinsic, declare the variable REAL(KIND=QUAD), and use the _QUAD suffix on the literal. * Assume that REAL*16 is the same as REAL(KIND=16) and use the _16 suffix on the literal. That assumption turns out to be safer than one might imagine. It is certainly true for the Sun and Intel compilers. I am pretty sure it is true for the PGI, Pathscale, and GNU compilers. I am not aware of any compilers for which it is not true, but that doesn't mean there is no such compiler. All of which is a long winded way of saying that maybe the README ought to just say: MPI_REAL16 and MPI_COMPLEX32 are only supported on platforms where a portable C datatype can be found that matches the Fortran type REAL*16, both in size and bit representation. Iain
Re: [OMPI devel] [OMPI svn] svn:open-mpi r21504
On Jun 23, 2009, at 7:17 PM, Ralph Castain wrote: Not any more, when using regex - the only message that comes back is one/node telling the HNP that the procs have been launched. These messages flow along the route, not direct to the HNP - assuming you use the static port option. Is there any prospect of doing the same without requiring the static port option? Iain
Re: [OMPI devel] [OMPI svn] svn:open-mpi r21504
On Jun 25, 2009, at 11:10 AM, Ralph Castain wrote: They do flow along the route at all times. However, without static ports the orted has to start by directly connecting to the HNP and sending the orted's contact info to the HNP. This is the part I don't understand. Why can't they send the contact info along the route as well? Don't they have enough information to wire a route to the HNP? If not, can't they be given it at startup? Then the HNP includes that info in the launch msg, allowing the orteds to wireup their routes. So the difference is that the static ports allow us to avoid that initial HNP-direct connection, which is what causes the flood. I should warn everyone that in my experiments the HNP flood is not the only problem with tree spawning. In fact, it doesn't even seem to be the limiting problem. At the moment, it appears that the limiting problem on my cluster has to do with sshd/rshd accessing some name service (e.g., gethostbyname, getpwnam, getdefaultproject, or something like that). I am hoping to find that this is just some cluster configuration oddity. YMMV, of course. The other thing that hasn't been done yet is to have the "procs- launched" messages rollup in the collective - the HNP gets one/ daemon right now, even though it comes down the routed path. Hope to have that done next week. That will be in operation regardless of static vs non-static ports. Great! Iain
Re: [OMPI devel] [OMPI svn] svn:open-mpi r21723
On Jul 21, 2009, at 6:31 PM, Ralph Castain wrote: This commit appears to have broken the build system for Mac OSX. Could you please fix it - since it only supports Solaris and Linux, how about setting it so it continues to work in other environments?? That was the intent of the configure.m4 script in that directory. It is supposed to check for the existence of some files in /proc, which should not exist on a Mac. Could you send me the relevant portion of the config.log on Mac OSX? Iain
Re: [OMPI devel] autodetect broken
On Jul 21, 2009, at 6:34 PM, Jeff Squyres wrote: I'm quite confused about what this component did to the base functions. I haven't had a chance to digest it properly, but it "feels wrong"... Iain -- can you please explain the workings of this component and its interactions with the base? The autodetect component gets loaded after the environment component, and before the config component. So environment variables like OPAL_PREFIX will override it. When it loads, it finds the directory containing libopen-pal.so (assuming that is where the autodetect component actually is) and sets its install_dirs_data.libdir to that. The other fields of install_dirs_data are set to "${infer-libdir}". So when the base component loads autodetect, and no environment variables have set any of the fields, opal_install_dirs.everything_except_libdir is set to "$ {infer-libdir}". (If the autodetect component is statically linked into an application, then it will set bindir rather than libdir.) The base component looks for fields set to "${infer-foo}", and calls opal_install_dirs_infer to figure out what the field should be. For example, if opal_install_dirs.prefix is set to "${infer-libdir}", then it calls opal_install_dirs_infer("prefix", "libdir}", 6, &component- >install_dirs_data). Opal_install_dirs_infer expands everything in component- >install_dirs_data.libdir *except* "${prefix}". Let's say that ompi was configured so that libdir is "${prefix}/lib", and the actual path to libopen-pal.so is /usr/local/lib/libopen-pal.so. The autodetect component will have set opal_install_dirs.libdir to "/usr/local/lib". It matches the tail of "${prefix}/lib" to "/usr/local/lib", and infers that the remainder must be the prefix, so it sets opal_install_dirs.prefix to "/usr/local". Other directories (e.g., pkgdatadir) presumably cannot be inferred from libdir, and opal_install_dirs_infer will return NULL. The config component will then load some value into that field, and things will work as they did before. Iain
Re: [OMPI devel] autodetect broken
On Jul 21, 2009, at 6:34 PM, Jeff Squyres wrote: Also, it seems broken: [15:31] svbu-mpi:~/svn/ompi4 % ompi_info | grep installd -- Sorry! You were supposed to get help about: developer warning: field too long But I couldn't open the help file: /${datadir}/openmpi/help-ompi_info.txt: No such file or directory. Sorry! -- MCA installdirs: env (MCA v2.0, API v2.0, Component v1.4) MCA installdirs: autodetect (MCA v2.0, API v2.0, Component v1.4) MCA installdirs: config (MCA v2.0, API v2.0, Component v1.4) [15:31] svbu-mpi:~/svn/ompi4 % The help file should have been found. This is on Linux RHEL4, but I doubt it's a Linux-version-specific issue... Could you send me your configure options, and your OPAL_XXX environment variables? Iain
Re: [OMPI devel] [OMPI svn] svn:open-mpi r21723
On Jul 21, 2009, at 7:35 PM, Jeff Squyres wrote: However, I see that autodetect configure.m4 is checking $backtrace_installdirs_happy -- which seems like a no-no. The ordering of framework / component configure.m4 scripts is not guaranteed, so it's not a good idea to check the output of another configure.m4's macro. Grrr, I thought I had changed all those to findpc_happy. Well, that's easy enough to fix. I don't see how it could result in the component being built when it isn't supposed to be, though. Iain
Re: [OMPI devel] autodetect broken
On Jul 21, 2009, at 7:50 PM, Jeff Squyres wrote: If you have an immediate fix for this, that would be great -- otherwise, please back this commit out (I said in my previous mail that I would back it out, but I had assumed that you were gone for the day. If you're around, please make the call...). I am effectively gone for the day. (I am managing to send the odd email between my kids interrupting me.) Please do back out. I'll be able to look at fixing it tomorrow. Iain
Re: [OMPI devel] autodetect broken
On Jul 21, 2009, at 7:50 PM, Jeff Squyres wrote: On Jul 21, 2009, at 7:46 PM, Iain Bason wrote: > The help file should have been found. This is on Linux RHEL4, but I > doubt it's a Linux-version-specific issue... Could you send me your configure options, and your OPAL_XXX environment variables? $ ./configure --prefix=/home/jsquyres/bogus --disable-mpi-f77 -- enable-mpirun-prefix-by-default No OPAL_* env variables set. Same thing happens on OS X and Linux. And does it fail when actually installed in /home/jsquyres/bogus, or only when installed elsewhere? Iain
Re: [OMPI devel] autodetect broken
On Jul 21, 2009, at 7:48 PM, Jeff Squyres wrote: Arrgh!! Even with .ompi_ignore, everything is broken on OS X and Linux (perhaps this is what Ralph was referring to -- not a compile time problem?): - $ mpicc -g -Isrc -c -o libmpitest.o libmpitest.c Cannot open configuration file ${datadir}/openmpi/mpicc-wrapper- data.txt Error parsing data file mpicc: Not found - Is this just mpicc, or is it also ompi_info and mpirun failing like this? I presume the autodetect component is *not* involved, right? So this presumably is a problem with opal_install_dirs_expand? Iain
Re: [OMPI devel] autodetect broken
On Jul 22, 2009, at 10:55 AM, Brian W. Barrett wrote: The current autodetect implementation seems like the wrong approach to me. I'm rather unhappy the base functionality was hacked up like it was without any advanced notice or questions about original design intent. We seem to have a set of base functions which are now more unreadable than before, overly complex, and which leak memory. First, I should apologize for my procedural misstep. I had asked my group here at Sun whether I should do an RFC or something, and I guess I didn't make my question clear enough. I was under the impression that I should check something in and let people comment on it. That being said, I would argue that there are good reasons for adding some complexity to the base component: 1. The pre-existing implementation of expansion is broken (although the cases for which it is broken are somewhat obscure). 2. The autodetect component cannot set more than one directory without some knowledge of the relationships amongst the various directories. Giving it that knowledge would violate the independence of the components. You can see #1 by doing "OPAL_PREFIX='${datarootdir}/..' OPAL_DATAROOTDIR='/opt/SUNWhpc/HPC8.2/share' mpirun hostname" (if you have installed in /opt/SUNWhpc/HPC8.2). Yes, I know, "Why would anyone do that?" Nonetheless, I find that a poor excuse for having a bug in the code. To expand on #2 a little, consider the case where OpenMPI is configured with "--prefix=/path/one --libdir=/path/two". We can tell that libopen-pal is in /path/two, but it is not correct to assume that the prefix is therefore /path. (Unfortunately, there is code in OpenMPI that does make that sort of assumption -- see orterun.c.) We need information from the config component to avoid making incorrect assumptions. There are, of course, alternate ways of getting to the same point. But it is not feasible simply to leave the design of the base component unchanged. (More on that below.) As for readability, I am always open to constructive suggestions as to how to make code more readable. I didn't fix the memory leak because I hadn't yet found a way to do that without decreasing readability. The intent of the installdirs framework was to allow this type of behavior, but without rehacking all this infer crap into base. The autodetect component should just set $prefix in the set of functions it returns (and possibly libdir and bindir if you really want, which might actually make sense if you guess wrong), and let the expansion code take over from there. The general thought on how this would work went something like: - Run after config - If determine you have a special $prefix, set the opal_instal_dirs.prefix to NULL (yes, it's a bit of a hack) and set your special prefix. - Same with bindir and libdir if needed - Let expansion (which runs after all components have had the chance to fill in their fields) expand out with your special data If we run the autodetect component after config, and allow it to override values that are already in opal_install_dirs, then there will be no way for users to have environment variables take precedence. (Let's say someone runs with OPAL_LIBDIR=/foo. The autodetect component will not know whether opal_install_dirs.libdir has been set by the env component or by the config component.) Moreover, if the user has used an environment variable to override one of the paths in the config component, then the autodetect component may make the wrong inference. For example, let's say someone runs with OPAL_LIBDIR=/foo. The autodetect component finds libopen-pal in / usr/renamed/lib, and sets opal_install_dirs.libdir to /usr/renamed/ lib. However, it has to use the config component's idea of libdir (e.g., ${exec_prefix}/lib) to correctly infer that prefix should be / usr/renamed. Since it only has /foo from the environment variable, it will have to decide that it cannot infer the prefix. All of this will lead to behavior that users will have trouble diagnosing. While I appreciate simple code, I think that a simple user interface is more important. We could add some infrastructure so that the autodetect component can figure out the provenance of each field in opal_install_dirs, but that would make the boundary between the base component and the autodetect component unclear. And the base stays simple, the components do all the heavy lifting, and life is happy. Except in the cases where it doesn't work. I would not be opposed to putting in a "find expaneded part" type function that takes two strings like "${prefix}/lib" and "/usr/local/ lib" and returns "/usr/local" being added to the base so that other autodetect-style components don't need to handle such a case, but that's about the extent of the base changes I think are appropriate. Finally, a first quick code review rev
Re: [OMPI devel] Shared library versioning
On Jul 23, 2009, at 5:53 PM, Jeff Squyres wrote: We have talked many times about doing proper versioning for OMPI's .so libraries (e.g., libmpi.so -- *not* our component DSOs). Forgive me if this has been hashed out, but won't you run into trouble by not versioning the components? What happens when there are multiple versions of libmpi installed? The user program will pick up the correct one because of versioning, but how will libmpi pick up the correct versions of the components? Iain
[OMPI devel] RFC: Suspend/resume enhancements
WHAT: Enhance the orte_forward_job_control MCA flag by: 1. Forwarding signals to descendants of launched processes; and 2. Forwarding signals received before process launch time. (The orte_forward_job_control flag arranges for SIGTSTP and SIGCONT to be forwarded. This allows a resource manager like Sun Grid Engine to suspend a job by sending a SIGTSTP signal to mpirun.) WHY: Some programs do "mpirun prog.sh", and prog.sh starts multiple processes. Among these programs is weather prediction code from the UK Met Office. This code is used at multiple sites around the world. Since other MPI implementations* forward job control signals this way, we risk having OMPI excluded unless we implement this feature. [*I have personally verified that Intel MPI does it. I have heard that Scali does it. I don't know about the others.] HOW: To allow signals to be sent to descendants of launched processes, use the setpgrp() system call to create a new process group for each launched process. Then send the signal to the process group rather than to the process. To allow signals received before process launch time to be delivered when the processes are launched, add a job state flag to indicate whether the job is suspended. Check this flag at launch time, and send a signal immediately after launching. WHERE: http://bitbucket.org/igb/ompi-job-control/ WHEN: We would like to integrate this into the 1.5 branch. TIMEOUT: COB Tuesday, January 19, 2010. Q&A: 1. Will this work for Windows? I don't know what would be required to make this work for Windows. The current implementation is for Unix only. 2. Will this work for interactive ssh/rsh PLM? It will not work any better or worse than the current implementation. One can suspend a job by typing Ctl-Z at a terminal, but the mpirun process itself never gets suspended. That means that in order to wake the job up one has to open a different terminal to send a SIGCONT to the mpirun process. It would be desirable to fix this problem, but as this feature is intended for use with resource managers like SGE it isn't essential to make it work smoothly in an interactive shell. 3. Will the creation of new process groups prohibit SGE from killing a job properly? No. SGE has a mechanism to ensure that all a job's processes are killed, regardless of whether they create new process groups. 4. What about other resource managers? Using this flag with another resource manager might cause problems. However, the flag may not be necessary with other resource managers. (If the RM can send SIGSTOP to all the processes on all the nodes running a job, then mpirun doesn't need to forward job control signals.) According to the SLURM documentation, plugins are available (e.g., linuxproc) that would allow reliable termination of all a job's processes, regardless of whether they create new process groups. [https://computing.llnl.gov/linux/slurm/proctrack_plugins.html] 5. Will the creation of new process groups prevent mpirun from shutting down the job successfully (e.g., when it receives a SIGTERM)? No. I have tested jobs both with and without calls to MPI_Comm_Spawn, and all are properly terminated. 6. Can we avoid creating new process groups by just signaling the launched process plus any process that calls MPI_Init? No. The shell script might launch other background processes that the user wants to suspend. (The Met Office code does this.) 7. Can we avoid creating new process groups by having mpirun and orted send SIGTSTP to their own process groups, and ignore the signal that they send to themselves? No. First, mpirun might be in the same process group as other mpirun processes. Those mpiruns could get into an infinite loop forwarding SIGTSTPs to one another. Second, although the default action on receipt of SIGTSTP is to suspend the process, that only happens if the process is not in an orphaned process group. SGE starts processes in orphaned process groups.
Re: [OMPI devel] RFC: Suspend/resume enhancements
Having heard no further comments, I plan to integrate this into the trunk on Monday. Iain On Jan 5, 2010, at 6:27 AM, Terry Dontje wrote: This only happens when the orte_forward_job_control MCA flag is set to 1 and the default is that it is set to 0. Which I believe meets Ralph's criteria below. --td Ralph Castain wrote: I don't have any issue with this so long as (a) it is -only- active when someone sets a specific MCA param requesting it, and (b) that flag is -not- set by default. On Jan 4, 2010, at 11:50 AM, Iain Bason wrote: WHAT: Enhance the orte_forward_job_control MCA flag by: 1. Forwarding signals to descendants of launched processes; and 2. Forwarding signals received before process launch time. (The orte_forward_job_control flag arranges for SIGTSTP and SIGCONT to be forwarded. This allows a resource manager like Sun Grid Engine to suspend a job by sending a SIGTSTP signal to mpirun.) WHY: Some programs do "mpirun prog.sh", and prog.sh starts multiple processes. Among these programs is weather prediction code from the UK Met Office. This code is used at multiple sites around the world. Since other MPI implementations* forward job control signals this way, we risk having OMPI excluded unless we implement this feature. [*I have personally verified that Intel MPI does it. I have heard that Scali does it. I don't know about the others.] HOW: To allow signals to be sent to descendants of launched processes, use the setpgrp() system call to create a new process group for each launched process. Then send the signal to the process group rather than to the process. To allow signals received before process launch time to be delivered when the processes are launched, add a job state flag to indicate whether the job is suspended. Check this flag at launch time, and send a signal immediately after launching. WHERE: http://bitbucket.org/igb/ompi-job-control/ WHEN: We would like to integrate this into the 1.5 branch. TIMEOUT: COB Tuesday, January 19, 2010. Q&A: 1. Will this work for Windows? I don't know what would be required to make this work for Windows. The current implementation is for Unix only. 2. Will this work for interactive ssh/rsh PLM? It will not work any better or worse than the current implementation. One can suspend a job by typing Ctl-Z at a terminal, but the mpirun process itself never gets suspended. That means that in order to wake the job up one has to open a different terminal to send a SIGCONT to the mpirun process. It would be desirable to fix this problem, but as this feature is intended for use with resource managers like SGE it isn't essential to make it work smoothly in an interactive shell. 3. Will the creation of new process groups prohibit SGE from killing a job properly? No. SGE has a mechanism to ensure that all a job's processes are killed, regardless of whether they create new process groups. 4. What about other resource managers? Using this flag with another resource manager might cause problems. However, the flag may not be necessary with other resource managers. (If the RM can send SIGSTOP to all the processes on all the nodes running a job, then mpirun doesn't need to forward job control signals.) According to the SLURM documentation, plugins are available (e.g., linuxproc) that would allow reliable termination of all a job's processes, regardless of whether they create new process groups. [https://computing.llnl.gov/linux/slurm/proctrack_plugins.html] 5. Will the creation of new process groups prevent mpirun from shutting down the job successfully (e.g., when it receives a SIGTERM)? No. I have tested jobs both with and without calls to MPI_Comm_Spawn, and all are properly terminated. 6. Can we avoid creating new process groups by just signaling the launched process plus any process that calls MPI_Init? No. The shell script might launch other background processes that the user wants to suspend. (The Met Office code does this.) 7. Can we avoid creating new process groups by having mpirun and orted send SIGTSTP to their own process groups, and ignore the signal that they send to themselves? No. First, mpirun might be in the same process group as other mpirun processes. Those mpiruns could get into an infinite loop forwarding SIGTSTPs to one another. Second, although the default action on receipt of SIGTSTP is to suspend the process, that only happens if the process is not in an orphaned process group. SGE starts processes in orphaned process groups. ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.
Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r22762
On Mar 3, 2010, at 1:24 PM, Jeff Squyres wrote: > I'm not sure I agree with change #1. I understand in principle why the > change was made, but I'm uncomfortable with: > > 1. The individual entries now behave like pseudo-regexp's rather that strict > matching. We used strict matching before this for a reason. If we want to > allow regexp-like behavior, then I think we should enable that with special > characters -- that's the customary/usual way to do it. The history of this particular piece of code is that it used to use strncmp. George Bosilca changed it last summer, incidental to a larger change (r21652). The commit comment was not particularly illuminating on this issue, in my opinion: http://www.open-mpi.org/hg/hgwebdir.cgi/ompi-svn-mirror/rev/bde31d3db7ba > 2. All other _in|exclude behavior in ompi is strict matching, not prefix > matching. I'm uncomfortable with the disparity. That turns out not to be the case. Look in btl_tcp_proc.c/mca_btl_tcp_retrieve_local_interfaces. > Additionally, if loopback is now handled properly via change #2, shouldn't > the default value for the btl_tcp_if_exclude parameter now be empty? That's a good question. Enabling the "lo" interface results in intra-node messages being striped across that interface in addition to the others on a system. I don't know what impact that would have, if any. > Actually -- thinking about this a little more, does opal_net_islocalhost() > guarantee to work on peer interfaces? It looks to see whether the IP address is (v4) 127.0.0.1, or (v6) ::1. I believe that these values are dictated by the relevant RFCs (but I haven't looked to make sure). Iain
Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r22762
On Mar 3, 2010, at 3:04 PM, Jeff Squyres wrote: > > Mmmm... good point. I was thinking specifically of the if_in|exclude > behavior in the openib BTL. That uses strcmp, not strncmp. Here's a > complete list: > > ompi_info --param all all --parsable | grep include | grep :value: > mca:opal:base:param:opal_event_include:value:pollmca:btl:ofud:param:btl_ofud_if_include:value: > mca:btl:openib:param:btl_openib_if_include:value: > mca:btl:openib:param:btl_openib_ipaddr_include:value:mca:btl:openib:param:btl_openib_cpc_include:value: > mca:btl:sctp:param:btl_sctp_if_include:value: > mca:btl:tcp:param:btl_tcp_if_include:value: > mca:btl:base:param:btl_base_include:value: > mca:oob:tcp:param:oob_tcp_if_include:value: > > Do we know what these others do? I only checked openib_if_*clude -- it's > strcmp. I haven't looked at those, but it's easy to grep for strncmp... It looks as though sctp is the only other BTL that uses strncmp. Of course, if we decide to change the default so that it no longer includes "lo" then maybe using strncmp doesn't matter. The problem has been that the name of the interface is different on different platforms. (I should note that the default also excludes "sppp". I don't know anything about that interface.) >>> Additionally, if loopback is now handled properly via change #2, shouldn't >>> the default value for the btl_tcp_if_exclude parameter now be empty? >> >> That's a good question. Enabling the "lo" interface results in intra-node >> messages being striped across that interface in addition to the others on a >> system. I don't know what impact that would have, if any. > > sm and self should still be prioritized above it, right? If so, we should be > ok. Yes, that's true. It would only affect those who restrict intra-node communication to TCP. > However, I think you're right that the addition of striping across lo* in > addition to the other interfaces might have an unknown effect. > > Here's a random question -- if a user does not use the sm btl, would sending > messages through lo for on-node communication be potentially better than > sending it through a real device, given that that real device may be far away > (in the NUMA sense of "far")? I.e., are OS's typically smart enough to know > that loopback traffic may be able to stay local to the NUMA node, vs. sending > it out to a device and back? Or are OS's smart enough to know that if the > both ends of a TCP socket are on the same node -- regardless of what IP > interface they use -- and if both processes are on the same NUMA locality, > that the data can stay local and not have to make a round trip to the device? > > (I admit that this is a fairly corner case -- doing on-node communication but > *not* using the sm btl...) Good question. For the loopback interface there is no physical device, so there should be no NUMA effect. For an interface with a physical device there may be some reason that a packet would actually have to go out to the device. If there is no such reason, I would expect Unix to be smart enough not to do it, given how much intra-node TCP traffic one commonly sees on Unix. I couldn't hazard a guess about Windows. Iain