Re: [OMPI devel] SM init failures

2009-03-30 Thread Iain Bason


On Mar 30, 2009, at 12:05 PM, Jeff Squyres wrote:


But don't we need the whole area to be zero filled?


It will be zero-filled on demand using the lseek/touch method.   
However, the OS may not reserve space for the skipped pages or disk  
blocks.  Thus one could still get out of memory or file system full  
errors at arbitrary points.  Presumably one could also get segfaults  
from an mmap'ed segment whose pages couldn't be allocated when the  
demand came.


Iain



Re: [OMPI devel] SM init failures

2009-04-01 Thread Iain Bason


On Mar 31, 2009, at 11:00 AM, Jeff Squyres wrote:


On Mar 31, 2009, at 3:45 AM, Sylvain Jeaugey wrote:


Sorry to continue off-topic but going to System V shm would be for me
like going back in the past.

System V shared memory used to be the main way to do shared memory on
MPICH and from my (little) experience, this was truly painful :
 - Cleanup issues : does shmctl(IPC_RMID) solve _all_ cases ? (even  
kill

-9 ?)
 - Naming issues : shm segments identified as 32 bits key potentially
causing conflicts between applications or layers of the same  
application

on one node
 - Space issues : the total shm size on a system is bound to
/proc/sys/kernel/shmmax, needing admin configuration and causing  
conflicts

between MPI applications running on the same node



Indeed.  The one saving grace here is that the cleanup issues  
apparently can be solved on Linux with a special flag that indicates  
"automatically remove this shmem when all processes attaching to it  
have died."  That was really the impetus for [re-]investigating sysv  
shm.  I, too, remember the sysv pain because we used it in LAM, too...


What about the other issues?  I remember those being a PITA about 15  
to 20 years ago, but obviously a lot could have improved in the  
meantime.


Iain



Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r20926

2009-04-01 Thread Iain Bason


On Apr 1, 2009, at 4:29 PM, Jeff Squyres wrote:

Should the same fixes be applied to type_create_keyval_f.c and  
win_create_keyval_f.c?


Good question.  I'll have a look at them.

Iain



Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r20926

2009-04-02 Thread Iain Bason


On Apr 1, 2009, at 4:58 PM, Iain Bason wrote:



On Apr 1, 2009, at 4:29 PM, Jeff Squyres wrote:

Should the same fixes be applied to type_create_keyval_f.c and  
win_create_keyval_f.c?


Good question.  I'll have a look at them.


It looks as though those have the same problem.  I will write a test  
to make sure.


Iain



Re: [OMPI devel] opal / fortran / Flogical

2009-06-03 Thread Iain Bason


On Jun 2, 2009, at 10:24 AM, Rainer Keller wrote:

no, that's not an issue. The comment is correct: For any Fortran  
integer*kind
we need to have _some_ C-representation as well, otherwise we  
disregard the

type (tm), see e.g. the old  and resolved ticket #1094.
The representation chosen is set in opal/util/arch.c and is  
conclusive as far

as I can tell...


Doesn't that mean that the comment is misleading?  I interpret it as  
saying that a Fortran "default integer" is always the same as a C  
"int".  I believe that you are saying that it really means that *any*  
kind of Fortran integer must be the same as one of C's integral types,  
or OpenMPI won't support it at all.  Shouldn't the comment be clearer?


Iain



Re: [OMPI devel] opal / fortran / Flogical

2009-06-03 Thread Iain Bason


On Jun 3, 2009, at 1:30 PM, Ralph Castain wrote:


I'm not entirely sure what comment is being discussed.


Jeff said:


I see the following comment:

** The fortran integer is dismissed here, since there is no
** platform known to me, were fortran and C-integer do not match

You can tell the intel compiler (and maybe others?) to compile  
fortran with double-sized integers and reals.  Are we disregarding  
this?  I.e., does this make this portion of the datatype  
heterogeneity detection incorrect?


Rainer said:

no, that's not an issue. The comment is correct: For any Fortran  
integer*kind
we need to have _some_ C-representation as well, otherwise we  
disregard the

type (tm), see e.g. the old  and resolved ticket #1094.


I said:

Doesn't that mean that the comment is misleading?  I interpret it as  
saying that a Fortran "default integer" is always the same as a C  
"int".  I believe that you are saying that it really means that  
*any* kind of Fortran integer must be the same as one of C's  
integral types, or OpenMPI won't support it at all.  Shouldn't the  
comment be clearer?


I believe that you are talking about a different comment:


* RHC: technically, use of the ompi_ prefix is
* an abstraction violation. However, this is actually
* an error in our configure scripts that transcends
* all the data types and eventually should be fixed.
* The guilty part is f77_check.m4. Fixing it right
* now is beyond a reasonable scope - this comment is
* placed here to explain the abstraction break and
* indicate that it will eventually be fixed


I don't know whether anyone is using either of these comments to  
justify anything.


Iain



Re: [OMPI devel] MPI_REAL16

2009-06-22 Thread Iain Bason

Jeff Squyres wrote:

Thanks for looking into this, David.

So if I understand that correctly, it means you have to assign all 
literals in your fortran program with a "_16" suffix. I don't know if 
that's standard Fortran or not. 


Yes, it is.

Iain



Re: [OMPI devel] [OMPI svn] svn:open-mpi r21480

2009-06-22 Thread Iain Bason

Ralph Castain wrote:

I'm sorry, but this change is incorrect.

If you look in orte/mca/ess/base/ess_base_std_orted.c, you will see 
that -all- orteds, regardless of how they are launched, open and 
select the PLM.


I believe you are mistaken.  Look in plm_base_launch_support.c:

   /* The daemon will attempt to open the PLM on the remote
* end. Only a few environments allow this, so the daemon
* only opens the PLM -if- it is specifically told to do
* so by giving it a specific PLM module. To ensure we avoid
* confusion, do not include any directives here
*/
   if (0 == strcmp(orted_cmd_line[i+1], "plm")) {
   continue;
   }

That code strips out anything like "-mca plm rsh" from the command
line passed to a remote daemon.

Meanwhile, over in ess_base_std_orted.c:

   /* some environments allow remote launches - e.g., ssh - so
* open the PLM and select something -only- if we are given
* a specific module to use
*/
   mca_base_param_reg_string_name("plm", NULL,
  "Which plm component to use (empty = 
none)",

  false, false,
  NULL, &plm_to_use);

   if (NULL == plm_to_use) {
   plm_in_use = false;
   } else {
   plm_in_use = true;

   if (ORTE_SUCCESS != (ret = orte_plm_base_open())) {
   ORTE_ERROR_LOG(ret);
   error = "orte_plm_base_open";
   goto error;
   }

   if (ORTE_SUCCESS != (ret = orte_plm_base_select())) {
   ORTE_ERROR_LOG(ret);
   error = "orte_plm_base_select";
   goto error;
   }
   }

So a PLM is loaded only if specified with "-mca plm foo", but that -mca
flag is stripped out when launching the remote daemon.

I also ran into this issue with tree spawning.  (I didn't putback a fix 
because

I couldn't get tree spawning actually to improve performance.  My fix was
not to strip out the "-mca plm foo" parameters if tree spawning had been
requested.)

Iain



Re: [OMPI devel] [OMPI svn] svn:open-mpi r21480

2009-06-22 Thread Iain Bason

Ralph Castain wrote:

Yes, but look at orte/mca/plm/rsh/plm_rsh_module.c:

   
/* ensure that only the ssh plm is selected on the remote daemon */

var = mca_base_param_environ_variable("plm", NULL, NULL);
opal_setenv(var, "rsh", true, &env);
free(var);
   
This is done in "ssh_child", right before we fork_exec the ssh command 
to launch the remote daemon. This is why slave spawn works, for example.


My ssh does not preserve environment variables:

bash-3.2$ export MY_VERY_OWN_ENVIRONMENT_VARIABLE=yes
bash-3.2$ ssh cubbie env | grep MY_VERY_OWN
WARNING: This is a restricted access server. If you do not have explicit 
permission to access this server, please disconnect immediately. 
Unauthorized access to this system is considered gross misconduct and 
may result in disciplinary action, including revocation of SWAN access 
privileges, immediate termination of employment, and/or prosecution to 
the fullest extent of the law.

bash-3.2$

The rsh man page explicitly states that the local environment is not 
passed to the remote shell.


I haven't checked qrsh.  Maybe it works with that.

I agree that tree_spawn doesn't seem to work right now, but it is not 
due to the plm not being selected. 


It was for me.  I don't know whether it is because your rsh/ssh work 
differently, or for some other reason, but there is no question that my 
tree spawn failed because no PLM was loaded.



There are other factors involved.


The other factors that I came across were:

   * I didn't have my .ssh/config file set up to forward
 authentication.  I added a -A flag to the ssh command in
 plm_base_rsh_support.

   * In plm_rsh_module.c:setup_launch, a NULL orted_cmd made asprintf
 crash.  I used (orted_cmd == NULL ? "" : orted_cmd) in the call to
 asprintf.


Once I fixed those, tree spawning worked for me.  (I believe you 
mentioned a race condition in another conversation.  I haven't run into 
that.)


Iain



Re: [OMPI devel] MPI_REAL16

2009-06-22 Thread Iain Bason
(Thanks, Nick, for explaining that kind values are compiler-dependent. I 
was too lazy to do that.)


Jeff Squyres wrote:
Given that I'll inevitably get the language wrong, can someone suggest 
proper verbiage for this statement in the OMPI README:


- MPI_REAL16 and MPI_COMPLEX32 are only supported on platforms where a
portable C datatype can be found that matches the Fortran type
REAL*16, both in size and bit representation. The Intel v11
compiler, for example, supports these types, but requires the use of
the "_16" suffix in Fortran when assigning constants to REAL*16
variables. 


The _16 suffix really has nothing to do with whether there is a C 
datatype that corresponds to REAL*16. There are two separate issues here:


  1. In Fortran code, any floating point literal has the default kind
 unless otherwise specified. That means that you can get surprising
 results from a simple program designed to test whether a C
 compiler has a data type that corresponds to REAL*16: the least
 significant bits of a REAL*16 variable will be set to zero when
 the literal is assigned to it.
  2. Open MPI requires the C compiler to have a data type that has the
 same bit representation as the Fortran compiler's REAL*16. If the
 C compiler does not have such a data type, then Open MPI cannot
 support REAL*16 in its Fortran interface.

My understanding is that the Intel representative said that there is 
some compiler switch that allows the C compiler to have such a data 
type. I didn't pay enough attention to see whether there was some reason 
not to use the switch.


She also pointed out a bug in the Fortran test code that checks for the 
presence of the C data type. She suggested using a _16 suffix on a 
literal in that test code. Nick pointed out that that _16 suffix means, 
"make this literal a KIND=16 literal," which may mean different things 
to different compilers. In particular, REAL*16 may not be the same as 
REAL(KIND=16).


However, there is no standard way to specify, "make this literal a 
REAL*16 literal." That means that you have to do one of:


   * Declare the variable REAL(KIND=16) and use the _16 suffix on the
 literal.
   * Define some parameter QUAD using the SELECTED_REAL_KIND intrinsic,
 declare the variable REAL(KIND=QUAD), and use the _QUAD suffix on
 the literal.
   * Assume that REAL*16 is the same as REAL(KIND=16) and use the _16
 suffix on the literal.

That assumption turns out to be safer than one might imagine. It is 
certainly true for the Sun and Intel compilers. I am pretty sure it is 
true for the PGI, Pathscale, and GNU compilers. I am not aware of any 
compilers for which it is not true, but that doesn't mean there is no 
such compiler.


All of which is a long winded way of saying that maybe the README ought 
to just say:


   MPI_REAL16 and MPI_COMPLEX32 are only supported on platforms where a
   portable C datatype can be found that matches the Fortran type
   REAL*16, both in size and bit representation.


Iain



Re: [OMPI devel] [OMPI svn] svn:open-mpi r21504

2009-06-25 Thread Iain Bason


On Jun 23, 2009, at 7:17 PM, Ralph Castain wrote:

Not any more, when using regex - the only message that comes back is  
one/node telling the HNP that the procs have been launched. These  
messages flow along the route, not direct to the HNP - assuming you  
use the static port option.


Is there any prospect of doing the same without requiring the static  
port option?


Iain



Re: [OMPI devel] [OMPI svn] svn:open-mpi r21504

2009-06-25 Thread Iain Bason


On Jun 25, 2009, at 11:10 AM, Ralph Castain wrote:

They do flow along the route at all times. However, without static  
ports the orted has to start by directly connecting to the HNP and  
sending the orted's contact info to the HNP.


This is the part I don't understand.  Why can't they send the contact  
info along the route as well?  Don't they have enough information to  
wire a route to the HNP?  If not, can't they be given it at startup?


Then the HNP includes that info in the launch msg, allowing the  
orteds to wireup their routes.




So the difference is that the static ports allow us to avoid that  
initial HNP-direct connection, which is what causes the flood.


I should warn everyone that in my experiments the HNP flood is not the  
only problem with tree spawning.  In fact, it doesn't even seem to be  
the limiting problem.  At the moment, it appears that the limiting  
problem on my cluster has to do with sshd/rshd accessing some name  
service (e.g., gethostbyname, getpwnam, getdefaultproject, or  
something like that).


I am hoping to find that this is just some cluster configuration  
oddity.  YMMV, of course.


The other thing that hasn't been done yet is to have the "procs- 
launched" messages rollup in the collective - the HNP gets one/ 
daemon right now, even though it comes down the routed path. Hope to  
have that done next week. That will be in operation regardless of  
static vs non-static ports.


Great!

Iain




Re: [OMPI devel] [OMPI svn] svn:open-mpi r21723

2009-07-21 Thread Iain Bason


On Jul 21, 2009, at 6:31 PM, Ralph Castain wrote:

This commit appears to have broken the build system for Mac OSX.  
Could you please fix it - since it only supports Solaris and Linux,  
how about setting it so it continues to work in other environments??


That was the intent of the configure.m4 script in that directory.  It  
is supposed to check for the existence of some files in /proc, which  
should not exist on a Mac.  Could you send me the relevant portion of  
the config.log on Mac OSX?


Iain



Re: [OMPI devel] autodetect broken

2009-07-21 Thread Iain Bason


On Jul 21, 2009, at 6:34 PM, Jeff Squyres wrote:

I'm quite confused about what this component did to the base  
functions.  I haven't had a chance to digest it properly, but it  
"feels wrong"...  Iain -- can you please explain the workings of  
this component and its interactions with the base?


The autodetect component gets loaded after the environment component,  
and before the config component.  So environment variables like  
OPAL_PREFIX will override it.


When it loads, it finds the directory containing libopen-pal.so  
(assuming that is where the autodetect component actually is) and sets  
its install_dirs_data.libdir to that.  The other fields of  
install_dirs_data are set to "${infer-libdir}".  So when the base  
component loads autodetect, and no environment variables have set any  
of the fields, opal_install_dirs.everything_except_libdir is set to "$ 
{infer-libdir}".


(If the autodetect component is statically linked into an application,  
then it will set bindir rather than libdir.)


The base component looks for fields set to "${infer-foo}", and calls  
opal_install_dirs_infer to figure out what the field should be.  For  
example, if opal_install_dirs.prefix is set to "${infer-libdir}", then  
it calls opal_install_dirs_infer("prefix", "libdir}", 6, &component- 
>install_dirs_data).


Opal_install_dirs_infer expands everything in component- 
>install_dirs_data.libdir *except* "${prefix}".  Let's say that ompi  
was configured so that libdir is "${prefix}/lib", and the actual path  
to libopen-pal.so is /usr/local/lib/libopen-pal.so.  The autodetect  
component will have set opal_install_dirs.libdir to "/usr/local/lib".   
It matches the tail of "${prefix}/lib" to "/usr/local/lib", and infers  
that the remainder must be the prefix, so it sets  
opal_install_dirs.prefix to "/usr/local".


Other directories (e.g., pkgdatadir) presumably cannot be inferred  
from libdir, and opal_install_dirs_infer will return NULL.  The config  
component will then load some value into that field, and things will  
work as they did before.


Iain



Re: [OMPI devel] autodetect broken

2009-07-21 Thread Iain Bason


On Jul 21, 2009, at 6:34 PM, Jeff Squyres wrote:


Also, it seems broken:

[15:31] svbu-mpi:~/svn/ompi4 % ompi_info | grep installd
--
Sorry!  You were supposed to get help about:
   developer warning: field too long
But I couldn't open the help file:
   /${datadir}/openmpi/help-ompi_info.txt: No such file or  
directory.  Sorry!

--
MCA installdirs: env (MCA v2.0, API v2.0, Component v1.4)
MCA installdirs: autodetect (MCA v2.0, API v2.0, Component  
v1.4)

MCA installdirs: config (MCA v2.0, API v2.0, Component v1.4)
[15:31] svbu-mpi:~/svn/ompi4 %

The help file should have been found.  This is on Linux RHEL4, but I  
doubt it's a Linux-version-specific issue...


Could you send me your configure options, and your OPAL_XXX  
environment variables?


Iain



Re: [OMPI devel] [OMPI svn] svn:open-mpi r21723

2009-07-21 Thread Iain Bason


On Jul 21, 2009, at 7:35 PM, Jeff Squyres wrote:

However, I see that autodetect configure.m4 is checking  
$backtrace_installdirs_happy -- which seems like a no-no.  The  
ordering of framework / component configure.m4 scripts is not  
guaranteed, so it's not a good idea to check the output of another  
configure.m4's macro.


Grrr, I thought I had changed all those to findpc_happy.  Well, that's  
easy enough to fix.  I don't see how it could result in the component  
being built when it isn't supposed to be, though.


Iain



Re: [OMPI devel] autodetect broken

2009-07-21 Thread Iain Bason


On Jul 21, 2009, at 7:50 PM, Jeff Squyres wrote:

If you have an immediate fix for this, that would be great --  
otherwise, please back this commit out (I said in my previous mail  
that I would back it out, but I had assumed that you were gone for  
the day.  If you're around, please make the call...).


I am effectively gone for the day.  (I am managing to send the odd  
email between my kids interrupting me.)  Please do back out.  I'll be  
able to look at fixing it tomorrow.


Iain



Re: [OMPI devel] autodetect broken

2009-07-21 Thread Iain Bason


On Jul 21, 2009, at 7:50 PM, Jeff Squyres wrote:


On Jul 21, 2009, at 7:46 PM, Iain Bason wrote:

> The help file should have been found.  This is on Linux RHEL4,  
but I

> doubt it's a Linux-version-specific issue...

Could you send me your configure options, and your OPAL_XXX
environment variables?




 $ ./configure --prefix=/home/jsquyres/bogus --disable-mpi-f77 -- 
enable-mpirun-prefix-by-default


No OPAL_* env variables set.

Same thing happens on OS X and Linux.


And does it fail when actually installed in /home/jsquyres/bogus, or  
only when installed elsewhere?


Iain



Re: [OMPI devel] autodetect broken

2009-07-21 Thread Iain Bason


On Jul 21, 2009, at 7:48 PM, Jeff Squyres wrote:

Arrgh!!  Even with .ompi_ignore, everything is broken on OS X and  
Linux (perhaps this is what Ralph was referring to -- not a compile  
time problem?):


-
$ mpicc -g -Isrc   -c -o libmpitest.o libmpitest.c
Cannot open configuration file ${datadir}/openmpi/mpicc-wrapper- 
data.txt

Error parsing data file mpicc: Not found
-


Is this just mpicc, or is it also ompi_info and mpirun failing like  
this?  I presume the autodetect component is *not* involved, right? So  
this presumably is a problem with opal_install_dirs_expand?


Iain



Re: [OMPI devel] autodetect broken

2009-07-22 Thread Iain Bason


On Jul 22, 2009, at 10:55 AM, Brian W. Barrett wrote:

The current autodetect implementation seems like the wrong approach  
to me. I'm rather unhappy the base functionality was hacked up like  
it was without any advanced notice or questions about original  
design intent. We seem to have a set of base functions which are now  
more unreadable than before, overly complex, and which leak memory.


First, I should apologize for my procedural misstep.  I had asked my  
group here at Sun whether I should do an RFC or something, and I guess  
I didn't make my question clear enough.  I was under the impression  
that I should check something in and let people comment on it.


That being said, I would argue that there are good reasons for adding  
some complexity to the base component:


1. The pre-existing implementation of expansion is broken (although  
the cases for which it is broken are somewhat obscure).


2. The autodetect component cannot set more than one directory without  
some knowledge of the relationships amongst the various directories.   
Giving it that knowledge would violate the independence of the  
components.


You can see #1 by doing "OPAL_PREFIX='${datarootdir}/..'  
OPAL_DATAROOTDIR='/opt/SUNWhpc/HPC8.2/share' mpirun hostname" (if you  
have installed in /opt/SUNWhpc/HPC8.2).  Yes, I know, "Why would  
anyone do that?"  Nonetheless, I find that a poor excuse for having a  
bug in the code.


To expand on #2 a little, consider the case where OpenMPI is  
configured with "--prefix=/path/one --libdir=/path/two".  We can tell  
that libopen-pal is in /path/two, but it is not correct to assume that  
the prefix is therefore /path.  (Unfortunately, there is code in  
OpenMPI that does make that sort of assumption -- see orterun.c.)  We  
need information from the config component to avoid making incorrect  
assumptions.


There are, of course, alternate ways of getting to the same point.   
But it is not feasible simply to leave the design of the base  
component unchanged.  (More on that below.)


As for readability, I am always open to constructive suggestions as to  
how to make code more readable.  I didn't fix the memory leak because  
I hadn't yet found a way to do that without decreasing readability.


The intent of the installdirs framework was to allow this type of  
behavior, but without rehacking all this infer crap into base.  The  
autodetect component should just set $prefix in the set of functions  
it returns (and possibly libdir and bindir if you really want, which  
might actually make sense if you guess wrong), and let the expansion  
code take over from there.  The general thought on how this would  
work went something like:


- Run after config
- If determine you have a special $prefix, set the
  opal_instal_dirs.prefix to NULL (yes, it's a bit of a hack) and
  set your special prefix.
- Same with bindir and libdir if needed
- Let expansion (which runs after all components have had the
  chance to fill in their fields) expand out with your special
  data


If we run the autodetect component after config, and allow it to  
override values that are already in opal_install_dirs, then there will  
be no way for users to have environment variables take precedence.   
(Let's say someone runs with OPAL_LIBDIR=/foo.  The autodetect  
component will not know whether opal_install_dirs.libdir has been set  
by the env component or by the config component.)


Moreover, if the user has used an environment variable to override one  
of the paths in the config component, then the autodetect component  
may make the wrong inference.  For example, let's say someone runs  
with OPAL_LIBDIR=/foo.  The autodetect component finds libopen-pal in / 
usr/renamed/lib, and sets opal_install_dirs.libdir to /usr/renamed/ 
lib.  However, it has to use the config component's idea of libdir  
(e.g., ${exec_prefix}/lib) to correctly infer that prefix should be / 
usr/renamed.  Since it only has /foo from the environment variable, it  
will have to decide that it cannot infer the prefix.


All of this will lead to behavior that users will have trouble  
diagnosing.  While I appreciate simple code, I think that a simple  
user interface is more important.


We could add some infrastructure so that the autodetect component can  
figure out the provenance of each field in opal_install_dirs, but that  
would make the boundary between the base component and the autodetect  
component unclear.


And the base stays simple, the components do all the heavy lifting,  
and life is happy.


Except in the cases where it doesn't work.

 I would not be opposed to putting in a "find expaneded part" type  
function that takes two strings like "${prefix}/lib" and "/usr/local/ 
lib" and returns "/usr/local" being added to the base so that other  
autodetect-style components don't need to handle such a case, but  
that's about the extent of the base changes I think are appropriate.


Finally, a first quick code review rev

Re: [OMPI devel] Shared library versioning

2009-07-25 Thread Iain Bason


On Jul 23, 2009, at 5:53 PM, Jeff Squyres wrote:

We have talked many times about doing proper versioning for  
OMPI's .so libraries (e.g., libmpi.so -- *not* our component DSOs).


Forgive me if this has been hashed out, but won't you run into trouble  
by not versioning the components?  What happens when there are  
multiple versions of libmpi installed?  The user program will pick up  
the correct one because of versioning, but how will libmpi pick up the  
correct versions of the components?


Iain



[OMPI devel] RFC: Suspend/resume enhancements

2010-01-04 Thread Iain Bason

WHAT: Enhance the orte_forward_job_control MCA flag by:

  1. Forwarding signals to descendants of launched processes; and
  2. Forwarding signals received before process launch time.

(The orte_forward_job_control flag arranges for SIGTSTP and SIGCONT to
be forwarded.  This allows a resource manager like Sun Grid Engine to
suspend a job by sending a SIGTSTP signal to mpirun.)

WHY: Some programs do "mpirun prog.sh", and prog.sh starts multiple
 processes.  Among these programs is weather prediction code from
 the UK Met Office.  This code is used at multiple sites around
 the world.  Since other MPI implementations* forward job control
 signals this way, we risk having OMPI excluded unless we
 implement this feature.

 [*I have personally verified that Intel MPI does it.  I have
 heard that Scali does it.  I don't know about the others.]

HOW: To allow signals to be sent to descendants of launched processes,
 use the setpgrp() system call to create a new process group for
 each launched process.  Then send the signal to the process group
 rather than to the process.

 To allow signals received before process launch time to be
 delivered when the processes are launched, add a job state flag
 to indicate whether the job is suspended.  Check this flag at
 launch time, and send a signal immediately after launching.

WHERE: http://bitbucket.org/igb/ompi-job-control/

WHEN: We would like to integrate this into the 1.5 branch.

TIMEOUT: COB Tuesday, January 19, 2010.

Q&A:

  1. Will this work for Windows?

 I don't know what would be required to make this work for
 Windows.  The current implementation is for Unix only.

  2. Will this work for interactive ssh/rsh PLM?

 It will not work any better or worse than the current
 implementation.  One can suspend a job by typing Ctl-Z at a
 terminal, but the mpirun process itself never gets suspended.
 That means that in order to wake the job up one has to open a
 different terminal to send a SIGCONT to the mpirun process.  It
 would be desirable to fix this problem, but as this feature is
 intended for use with resource managers like SGE it isn't
 essential to make it work smoothly in an interactive shell.

  3. Will the creation of new process groups prohibit SGE from killing
 a job properly?

 No.  SGE has a mechanism to ensure that all a job's processes are
 killed, regardless of whether they create new process groups.

  4. What about other resource managers?

 Using this flag with another resource manager might cause
 problems.  However, the flag may not be necessary with other
 resource managers.  (If the RM can send SIGSTOP to all the
 processes on all the nodes running a job, then mpirun doesn't
 need to forward job control signals.)

 According to the SLURM documentation, plugins are available
 (e.g., linuxproc) that would allow reliable termination of all a
 job's processes, regardless of whether they create new process
 groups.
 [https://computing.llnl.gov/linux/slurm/proctrack_plugins.html]

  5. Will the creation of new process groups prevent mpirun from
 shutting down the job successfully (e.g., when it receives a
 SIGTERM)?

 No.  I have tested jobs both with and without calls to
 MPI_Comm_Spawn, and all are properly terminated.

  6. Can we avoid creating new process groups by just signaling the
 launched process plus any process that calls MPI_Init?

 No.  The shell script might launch other background processes
 that the user wants to suspend.  (The Met Office code does this.)

  7. Can we avoid creating new process groups by having mpirun and
 orted send SIGTSTP to their own process groups, and ignore the
 signal that they send to themselves?

 No.  First, mpirun might be in the same process group as other
 mpirun processes.  Those mpiruns could get into an infinite loop
 forwarding SIGTSTPs to one another.  Second, although the default
 action on receipt of SIGTSTP is to suspend the process, that only
 happens if the process is not in an orphaned process group.  SGE
 starts processes in orphaned process groups.



Re: [OMPI devel] RFC: Suspend/resume enhancements

2010-01-27 Thread Iain Bason
Having heard no further comments, I plan to integrate this into the  
trunk on Monday.


Iain

On Jan 5, 2010, at 6:27 AM, Terry Dontje wrote:

This only happens when the orte_forward_job_control MCA flag is set  
to 1 and the default is that it is set to 0.  Which I believe meets  
Ralph's criteria below.


--td

Ralph Castain wrote:
I don't have any issue with this so long as (a) it is -only- active  
when someone sets a specific MCA param requesting it, and (b) that  
flag is -not- set by default.



On Jan 4, 2010, at 11:50 AM, Iain Bason wrote:



WHAT: Enhance the orte_forward_job_control MCA flag by:

1. Forwarding signals to descendants of launched processes; and
2. Forwarding signals received before process launch time.

(The orte_forward_job_control flag arranges for SIGTSTP and  
SIGCONT to
be forwarded.  This allows a resource manager like Sun Grid Engine  
to

suspend a job by sending a SIGTSTP signal to mpirun.)

WHY: Some programs do "mpirun prog.sh", and prog.sh starts multiple
   processes.  Among these programs is weather prediction code from
   the UK Met Office.  This code is used at multiple sites around
   the world.  Since other MPI implementations* forward job control
   signals this way, we risk having OMPI excluded unless we
   implement this feature.

   [*I have personally verified that Intel MPI does it.  I have
   heard that Scali does it.  I don't know about the others.]

HOW: To allow signals to be sent to descendants of launched  
processes,

   use the setpgrp() system call to create a new process group for
   each launched process.  Then send the signal to the process group
   rather than to the process.

   To allow signals received before process launch time to be
   delivered when the processes are launched, add a job state flag
   to indicate whether the job is suspended.  Check this flag at
   launch time, and send a signal immediately after launching.

WHERE: http://bitbucket.org/igb/ompi-job-control/

WHEN: We would like to integrate this into the 1.5 branch.

TIMEOUT: COB Tuesday, January 19, 2010.

Q&A:

1. Will this work for Windows?

   I don't know what would be required to make this work for
   Windows.  The current implementation is for Unix only.

2. Will this work for interactive ssh/rsh PLM?

   It will not work any better or worse than the current
   implementation.  One can suspend a job by typing Ctl-Z at a
   terminal, but the mpirun process itself never gets suspended.
   That means that in order to wake the job up one has to open a
   different terminal to send a SIGCONT to the mpirun process.  It
   would be desirable to fix this problem, but as this feature is
   intended for use with resource managers like SGE it isn't
   essential to make it work smoothly in an interactive shell.

3. Will the creation of new process groups prohibit SGE from killing
   a job properly?

   No.  SGE has a mechanism to ensure that all a job's processes are
   killed, regardless of whether they create new process groups.

4. What about other resource managers?

   Using this flag with another resource manager might cause
   problems.  However, the flag may not be necessary with other
   resource managers.  (If the RM can send SIGSTOP to all the
   processes on all the nodes running a job, then mpirun doesn't
   need to forward job control signals.)

   According to the SLURM documentation, plugins are available
   (e.g., linuxproc) that would allow reliable termination of all a
   job's processes, regardless of whether they create new process
   groups.
   [https://computing.llnl.gov/linux/slurm/proctrack_plugins.html]

5. Will the creation of new process groups prevent mpirun from
   shutting down the job successfully (e.g., when it receives a
   SIGTERM)?

   No.  I have tested jobs both with and without calls to
   MPI_Comm_Spawn, and all are properly terminated.

6. Can we avoid creating new process groups by just signaling the
   launched process plus any process that calls MPI_Init?

   No.  The shell script might launch other background processes
   that the user wants to suspend.  (The Met Office code does this.)

7. Can we avoid creating new process groups by having mpirun and
   orted send SIGTSTP to their own process groups, and ignore the
   signal that they send to themselves?

   No.  First, mpirun might be in the same process group as other
   mpirun processes.  Those mpiruns could get into an infinite loop
   forwarding SIGTSTPs to one another.  Second, although the default
   action on receipt of SIGTSTP is to suspend the process, that only
   happens if the process is not in an orphaned process group.  SGE
   starts processes in orphaned process groups.

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel




___
devel mailing list
de...@open-mpi.

Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r22762

2010-03-03 Thread Iain Bason

On Mar 3, 2010, at 1:24 PM, Jeff Squyres wrote:

> I'm not sure I agree with change #1.  I understand in principle why the 
> change was made, but I'm uncomfortable with:
> 
> 1. The individual entries now behave like pseudo-regexp's rather that strict 
> matching.  We used strict matching before this for a reason.  If we want to 
> allow regexp-like behavior, then I think we should enable that with special 
> characters -- that's the customary/usual way to do it.

The history of this particular piece of code is that it used to use strncmp.  
George Bosilca changed it last summer, incidental to a larger change (r21652).  
The commit comment was not particularly illuminating on this issue, in my 
opinion:

http://www.open-mpi.org/hg/hgwebdir.cgi/ompi-svn-mirror/rev/bde31d3db7ba

> 2. All other _in|exclude behavior in ompi is strict matching, not prefix 
> matching.  I'm uncomfortable with the disparity.

That turns out not to be the case.  Look in 
btl_tcp_proc.c/mca_btl_tcp_retrieve_local_interfaces.

> Additionally, if loopback is now handled properly via change #2, shouldn't 
> the default value for the btl_tcp_if_exclude parameter now be empty?

That's a good question.  Enabling the "lo" interface results in intra-node 
messages being striped across that interface in addition to the others on a 
system.  I don't know what impact that would have, if any.

> Actually -- thinking about this a little more, does opal_net_islocalhost() 
> guarantee to work on peer interfaces?  

It looks to see whether the IP address is (v4) 127.0.0.1, or (v6) ::1.  I 
believe that these values are dictated by the relevant RFCs (but I haven't 
looked to make sure).

Iain



Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r22762

2010-03-03 Thread Iain Bason

On Mar 3, 2010, at 3:04 PM, Jeff Squyres wrote:

> 
> Mmmm... good point.  I was thinking specifically of the if_in|exclude 
> behavior in the openib BTL.  That uses strcmp, not strncmp.  Here's a 
> complete list:
> 
> ompi_info --param all all --parsable | grep include | grep :value:
> mca:opal:base:param:opal_event_include:value:pollmca:btl:ofud:param:btl_ofud_if_include:value:
> mca:btl:openib:param:btl_openib_if_include:value:
> mca:btl:openib:param:btl_openib_ipaddr_include:value:mca:btl:openib:param:btl_openib_cpc_include:value:
> mca:btl:sctp:param:btl_sctp_if_include:value:
> mca:btl:tcp:param:btl_tcp_if_include:value:
> mca:btl:base:param:btl_base_include:value:
> mca:oob:tcp:param:oob_tcp_if_include:value:
> 
> Do we know what these others do?  I only checked openib_if_*clude -- it's 
> strcmp.

I haven't looked at those, but it's easy to grep for strncmp...

It looks as though sctp is the only other BTL that uses strncmp.

Of course, if we decide to change the default so that it no longer includes 
"lo" then maybe using strncmp doesn't matter.  The problem has been that the 
name of the interface is different on different platforms.

(I should note that the default also excludes "sppp".  I don't know anything 
about that interface.)

>>> Additionally, if loopback is now handled properly via change #2, shouldn't 
>>> the default value for the btl_tcp_if_exclude parameter now be empty?
>> 
>> That's a good question.  Enabling the "lo" interface results in intra-node 
>> messages being striped across that interface in addition to the others on a 
>> system.  I don't know what impact that would have, if any.
> 
> sm and self should still be prioritized above it, right?  If so, we should be 
> ok.

Yes, that's true.  It would only affect those who restrict intra-node 
communication to TCP.

> However, I think you're right that the addition of striping across lo* in 
> addition to the other interfaces might have an unknown effect.
> 
> Here's a random question -- if a user does not use the sm btl, would sending 
> messages through lo for on-node communication be potentially better than 
> sending it through a real device, given that that real device may be far away 
> (in the NUMA sense of "far")?  I.e., are OS's typically smart enough to know 
> that loopback traffic may be able to stay local to the NUMA node, vs. sending 
> it out to a device and back?  Or are OS's smart enough to know that if the 
> both ends of a TCP socket are on the same node -- regardless of what IP 
> interface they use -- and if both processes are on the same NUMA locality, 
> that the data can stay local and not have to make a round trip to the device?
> 
> (I admit that this is a fairly corner case -- doing on-node communication but 
> *not* using the sm btl...)

Good question.  For the loopback interface there is no physical device, so 
there should be no NUMA effect.  For an interface with a physical device there 
may be some reason that a packet would actually have to go out to the device.  
If there is no such reason, I would expect Unix to be smart enough not to do 
it, given how much intra-node TCP traffic one commonly sees on Unix.  I 
couldn't hazard a guess about Windows.

Iain