[OMPI devel] static rate / connection modularity

2007-07-26 Thread Jeff Squyres
I'm about ready to start on the connection modularity stuff in the  
openib BTL.  A few changes are getting rolled up in this work:


1. Modularize the connection scheme in the openib BTL as per previous  
discussions (use function pointers to choose between the current OOB  
wireup and the RDMA CM -- I'll probably do just a skeleton of the  
RDMA CM at first; to be filled in later).  Preliminary prototypes of  
this work in a /tmp branch showed that it cleaned up  
btl_openib_endpoint.c a *lot*.


2. [Re-]Fix the problem with having heterogeneous numbers of ports  
across hosts (it seems to be broken again -- bonk).


3. Remove the static rate MCA parameter and instead, have the  
endpoints negotiate (either in the modex or at wireup time --  
whichever works best) to use the speed of the slower port.


--
Jeff Squyres
Cisco Systems



[OMPI devel] openib credits problem

2007-07-26 Thread Jeff Squyres
I got a problem in MTT runs last night with the openib BTL w.r.t.  
credits:


[...lots of IMB output...]
   #bytes #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]
Mbytes/sec
0 1000   367.66   371.58
369.34 0.00
IMB-MPI1: ./btl_openib_endpoint.h:261: Assertion `endpoint->qps 
[qp].u.pp_qp.rd_credits < rd_num' failed.


Gleb -- you've been mucking around in here recently; did something  
you do cause this, perchance?


--
Jeff Squyres
Cisco Systems



Re: [OMPI devel] openib credits problem

2007-07-26 Thread Gleb Natapov
On Thu, Jul 26, 2007 at 09:12:26AM -0400, Jeff Squyres wrote:
> I got a problem in MTT runs last night with the openib BTL w.r.t.  
> credits:
> 
> [...lots of IMB output...]
> #bytes #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]
> Mbytes/sec
>  0 1000   367.66   371.58
> 369.34 0.00
> IMB-MPI1: ./btl_openib_endpoint.h:261: Assertion `endpoint->qps 
> [qp].u.pp_qp.rd_credits < rd_num' failed.
> 
> Gleb -- you've been mucking around in here recently; did something  
> you do cause this, perchance?
> 
This is definitely caused by something I did. I am not sure this assert
is valid though. I am looking into it.

--
Gleb.


Re: [OMPI devel] Hostfiles - yet again

2007-07-26 Thread Rolf . Vandevaart

Aurelien Bouteiller wrote:


Hi Ralph and everyone,

I just want to make sure the proposed usecases does not break one of the 
current open MPI feature I require. For FT purposes, I need to get some 
specific hosts (lets say with a better MTBF). Those hosts are not part 
of the MPI_COMM_WORLD but are used to deploy FT services (like event 
loggers, checkpoint servers, etc). To enable collaboration between 
computing nodes and those FT services, I use the usual MPI2 Dynamics 
with MPI_Accept/Connect. This means that those different instances of 
mpirun needs to share the same orte registry, so that they can establish 
the MPI2 connect/accept trough the registered MPI_ports.


This background in place, my first concern is how the deployment maps to 
the allocated resources. The nodes used to deploy FT services are 
"special". In typical usecase, I get machines with better MTFB, faster 
or larger disks by requesting special properties to the resources 
allocation manager. I don't want those to be mixed with regular nodes in 
the resulting hostfile: these scarce resources should hold only FT 
services, no computing processes. As I understand things, I don't see 
any way to avoid mpirun to deploy application processes on my "special" 
nodes if they are part of the same launch/allocation in your "filtering" 
usecase. Currently I proceed to two different mpirun with a single orte 
seed holding the registry. This way I get two different hostfiles, one 
for computing nodes, one for FT services. I just want to make sure 
everybody understood this requirement so that this feature does not 
disappear in the brainstorming :]
 


With the use of resource managers, --host, and --hostfile this should
all be possible.

Next requirement is the ability to add during runtime some nodes to the 
initial pool. Because node may fail (but it is the same with comm_spawn 
basically) , I might need some (lot of) spare nodes to replace failed 
ones. As I do not want to request for twice as many nodes as I need 
(after all, things could just go fine, why should I waste that many 
computing resources for idle spares ?), I definitely want to be able to 
allocate some new nodes to the pool of the already running machines. As 
far as I understand, this is impossible to achieve with the usecase2 and 
quite difficult in usecase1. In my opinion, having the ability to spawn 
on nodes which are not part of the initial hostfile is a key feature 
(and not only for FT purposes).


 


I am looking for more detail into the above issue.   What
resource manager are you using?

Ideally, we would prefer not to support this.  Any nodes
that you run on, or hope to run on, would be designated
at the start.   For example:

mpirun -np 1 --host a,b,c,d,e,f,g

This would cause the one process of the mpi job to start on host a. 
Then, the mpi job has available to it the other hosts should it decide

later to start a job on them.  However no ORTE daemons would
be started on those nodes until calls to MPI_Comm_spawn
occur.   So, the MPI job would not be consuming any resources
until called upon to.

Rolf

I know there have been some extra discussions on this subject. 
Unfortunately it looks like I am not part of the list where it happened. 
I hope those concerns have not been already discussed.


Aurelien

Ralph H Castain wrote:
 


Yo all

As you know, I am working on revamping the hostfile functionality to make it
work better with managed environments (at the moment, the two are
exclusive). The issue that we need to review is how we want the interaction
to work, both for the initial launch and for comm_spawn.

In talking with Jeff, we boiled it down to two options that I have
flow-charted (see attached):

Option 1: in this mode, we read any allocated nodes provided by a resource
manager (e.g., SLURM). These nodes establish a base pool of nodes that can
be used by both the initial launch and any dynamic comm_spawn requests. The
hostfile and any -host info is then used to select nodes from within that
pool for use with the specific launch. The initial launch would use the
-hostfile or -host command line option to provide that info - comm_spawn
would use the MPI_Info fields to provide similar info.

This mode has the advantage of allowing a user to obtain a large allocation,
and then designate hosts within the pool for use by an initial application,
and separately designate (via another hostfile or -host spec) another set of
those hosts from the pool to support a comm_spawn'd child job.

If no resource managed nodes are found, then the hostfile and -host options
would provide the list of hosts to be used. Again, comm_spawn'd jobs would
be able to specify their own hostfile and -host nodes.

The negative to this option is complexity - in the absence of a managed
allocation, I either have to deal with hostfile/dash-host allocations in the
RAS and then again in RMAPS, or I have "allocation-like" functionality
happening in RMAPS.


Option 2: in this mode, we read an

Re: [OMPI devel] Hostfiles - yet again

2007-07-26 Thread Ralph H Castain



On 7/26/07 7:33 AM, "rolf.vandeva...@sun.com" 
wrote:

> Aurelien Bouteiller wrote:
> 
>> Hi Ralph and everyone,
>> 
>> I just want to make sure the proposed usecases does not break one of the
>> current open MPI feature I require. For FT purposes, I need to get some
>> specific hosts (lets say with a better MTBF). Those hosts are not part
>> of the MPI_COMM_WORLD but are used to deploy FT services (like event
>> loggers, checkpoint servers, etc). To enable collaboration between
>> computing nodes and those FT services, I use the usual MPI2 Dynamics
>> with MPI_Accept/Connect. This means that those different instances of
>> mpirun needs to share the same orte registry, so that they can establish
>> the MPI2 connect/accept trough the registered MPI_ports.
>> 
>> This background in place, my first concern is how the deployment maps to
>> the allocated resources. The nodes used to deploy FT services are
>> "special". In typical usecase, I get machines with better MTFB, faster
>> or larger disks by requesting special properties to the resources
>> allocation manager. I don't want those to be mixed with regular nodes in
>> the resulting hostfile: these scarce resources should hold only FT
>> services, no computing processes. As I understand things, I don't see
>> any way to avoid mpirun to deploy application processes on my "special"
>> nodes if they are part of the same launch/allocation in your "filtering"
>> usecase. Currently I proceed to two different mpirun with a single orte
>> seed holding the registry. This way I get two different hostfiles, one
>> for computing nodes, one for FT services. I just want to make sure
>> everybody understood this requirement so that this feature does not
>> disappear in the brainstorming :]
>>  
>> 
> With the use of resource managers, --host, and --hostfile this should
> all be possible.
> 

I'll try to keep this in mind as we implement the changewill have to see
if this ability really can be supported in the revision. I'll certainly let
you know if I run into a conflict.

>> Next requirement is the ability to add during runtime some nodes to the
>> initial pool. Because node may fail (but it is the same with comm_spawn
>> basically) , I might need some (lot of) spare nodes to replace failed
>> ones. As I do not want to request for twice as many nodes as I need
>> (after all, things could just go fine, why should I waste that many
>> computing resources for idle spares ?), I definitely want to be able to
>> allocate some new nodes to the pool of the already running machines. As
>> far as I understand, this is impossible to achieve with the usecase2 and
>> quite difficult in usecase1. In my opinion, having the ability to spawn
>> on nodes which are not part of the initial hostfile is a key feature
>> (and not only for FT purposes).
>> 
>>  
>> 
> I am looking for more detail into the above issue.   What
> resource manager are you using?
> 
> Ideally, we would prefer not to support this.  Any nodes
> that you run on, or hope to run on, would be designated
> at the start.   For example:
> 
> mpirun -np 1 --host a,b,c,d,e,f,g
> 
> This would cause the one process of the mpi job to start on host a.
> Then, the mpi job has available to it the other hosts should it decide
> later to start a job on them.  However no ORTE daemons would
> be started on those nodes until calls to MPI_Comm_spawn
> occur.   So, the MPI job would not be consuming any resources
> until called upon to.

This has actually been the subject of multiple threads on the user list and
is considered a critical capability by some users and vendors. I believe
there is little problem in allowing those systems that can support it to
dynamically add nodes to ORTE via some API into the resource manager. At the
moment, none of the RMs support it, but LSF will (and TM at least may)
shortly do so, and some of their customers are depending upon it.

The problem is that job startup could be delayed for significant time if all
hosts must be preallocated. Admittedly, this raises all kinds of issues
about how long the job could be stalled waiting for the new hosts. However,
as the other somewhat exhaustive threads have discussed, there are computing
models that can live with this uncertainty, and RMs that will provide async
callbacks to allow the rest of the app to continue working while waiting.

Just my $0.2 - again, this goes back to...are there use-cases and
customers to which Open MPI is simply going to say "we won't support that"?

> 
> Rolf
> 
>> I know there have been some extra discussions on this subject.
>> Unfortunately it looks like I am not part of the list where it happened.
>> I hope those concerns have not been already discussed.
>> 
>> Aurelien
>> 
>> Ralph H Castain wrote:
>>  
>> 
>>> Yo all
>>> 
>>> As you know, I am working on revamping the hostfile functionality to make it
>>> work better with managed environments (at the moment, the two are
>>> exclusive). The issue that we need to review is

Re: [OMPI devel] openib credits problem

2007-07-26 Thread Gleb Natapov
On Thu, Jul 26, 2007 at 04:29:40PM +0300, Gleb Natapov wrote:
> On Thu, Jul 26, 2007 at 09:12:26AM -0400, Jeff Squyres wrote:
> > I got a problem in MTT runs last night with the openib BTL w.r.t.  
> > credits:
> > 
> > [...lots of IMB output...]
> > #bytes #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]
> > Mbytes/sec
> >  0 1000   367.66   371.58
> > 369.34 0.00
> > IMB-MPI1: ./btl_openib_endpoint.h:261: Assertion `endpoint->qps 
> > [qp].u.pp_qp.rd_credits < rd_num' failed.
> > 
> > Gleb -- you've been mucking around in here recently; did something  
> > you do cause this, perchance?
> > 
> This is definitely caused by something I did. I am not sure this assert
> is valid though. I am looking into it.
> 
Assertion is valid. r15635 should fix this.

--
Gleb.


Re: [OMPI devel] MPI_ALLOC_MEM warning when requesting 0 (zero) bytes

2007-07-26 Thread Lisandro Dalcin

On 7/25/07, Jeff Squyres  wrote:

Be sure to read this thread in order -- the conclusion of the thread
was that we now actually *do* return NULL, per POSIX advice.


OK, I got confused. And now, MPI_Free_mem is going to fail with a NULL
pointer? Not sure what POSIX says, but then OMPI should also follow it
advice, right?

--
Lisandro Dalcín
---
Centro Internacional de Métodos Computacionales en Ingeniería (CIMEC)
Instituto de Desarrollo Tecnológico para la Industria Química (INTEC)
Consejo Nacional de Investigaciones Científicas y Técnicas (CONICET)
PTLC - Güemes 3450, (3000) Santa Fe, Argentina
Tel/Fax: +54-(0)342-451.1594



Re: [OMPI devel] MPI_ALLOC_MEM warning when requesting 0 (zero) bytes

2007-07-26 Thread Jeff Squyres

On Jul 26, 2007, at 12:42 PM, Lisandro Dalcin wrote:


Be sure to read this thread in order -- the conclusion of the thread
was that we now actually *do* return NULL, per POSIX advice.


OK, I got confused. And now, MPI_Free_mem is going to fail with a NULL
pointer? Not sure what POSIX says, but then OMPI should also follow it
advice, right?


It's not going to *fail* -- it's just going to return a NULL pointer  
if you ask for 0 bytes.  This is perfectly valid according to POSIX's  
definition of free().  Also, passing NULL to MPI_FREE_MEM will now  
silently succeed (it will currently raise an MPI_ERR_ARG exception).


--
Jeff Squyres
Cisco Systems



Re: [OMPI devel] [RFC] Sparse group implementation

2007-07-26 Thread Brian Barrett

Mohamad -

A couple of comments / questions:

1) Why do you need the #if OMPI_GROUP_SPARSE in communicator/comm.c?   
That seems like
   part of the API that should under no conditions change based on  
sparse/not sparse


2) I think it would be better to always have the flags and macros  
available (like OMPI_GROUP_SPORADIC and OMPI_GROUP_IS_SPORADIC) even  
when sparse groups are disabled.  They dont' take up any space, and  
mean less #ifs in the general code base


3) Instead of the GROU_GET_PROC_POINTER macro, why not just change  
the behavior of the ompi_group_peer_lookup() function, so that there  
is symmetry with how you get a proc from a communicator?  static  
inline functions (especially short ones like that) are basically  
free.  We'll still have to fix all the places in the code where the  
macro is used or people poke directly at the group structure, but I  
like static inline over macros whenever possible.  So much easier t  
debug.


Other than that, I think you've got my concerns pretty much addressed.

Brian

On Jul 25, 2007, at 8:45 PM, Mohamad Chaarawi wrote:


In the current code, almost all #ifs are due to the fact that we don't
want to add the extra memory by the sparse parameters that are  
added to

the group structure.
The additional parameters are 5 pointers and 3 integers.
If nobody objects, i would actually keep those extra parameters,  
even if
sparse groups are disabled (in the default case on configure),  
because it
would reduce the number of #ifs in the code to only 2 (as i recall  
that i

had it before) ..

Thank you,

Mohamad

On Wed, July 25, 2007 4:23 pm, Brian Barrett wrote:

On Jul 25, 2007, at 3:14 PM, Jeff Squyres wrote:


On Jul 25, 2007, at 5:07 PM, Brian Barrett wrote:


It just adds a lot of #if's throughout the code.  Other than that,
there's no reason to remove it.


I agree, lots of #ifs are bad.  But I guess I don't see the  
problem.

The only real important thing people were directly accessing in the
ompi_group_t is the array of proc pointers.  Indexing into them  
could

be done with a static inline function that just has slightly
different time complexity based on compile options.  Static inline
function that just does an index in the group proc pointer would  
have

almost no added overhead (none if the compiler doesn't suck).


Ya, that's what I proposed.  :-)

But I did also propose removing the extra #if's so that the sparse
group code would be available and we'd add an extra "if" in the
critical code path.

But we can do it this way instead:

Still use the MACRO to access proc_t's.  In the --disable-sparse-
groups scenario, have it map to comm.group.proc[i].  In the -- 
enable-

sparse-groups scenario, have it like I listed in the original
proposal:

 static inline ompi_proc_t lookup_group(ompi_group_t *group, int
index) {
 if (group_is_dense(group)) {
 return group->procs[index];
 } else {
 return sparse_group_lookup(group, index);
 }
 }

With:

a) groups are always dense if --enable and an MCA parameter turns  
off

sparse groups, or
b) there's an added check in the inline function for whether the MCA
parameter is on

I'm personally in favor of a) because it means only one conditional
in the critical path.


I don't really care about the sparse groups turned on case.  I just
want minimal #ifs in the global code and to not have an if() { ... }
in the critical path when sparse groups are disabled :).

Brian
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel




--
Mohamad Chaarawi
Instructional Assistant   http://www.cs.uh.edu/~mschaara
Department of Computer ScienceUniversity of Houston
4800 Calhoun, PGH Room 526Houston, TX 77204, USA

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel




Re: [OMPI devel] [RFC] Sparse group implementation

2007-07-26 Thread Mohamad Chaarawi

On Thu, July 26, 2007 12:20 pm, Brian Barrett wrote:
> Mohamad -
>
> A couple of comments / questions:
>
> 1) Why do you need the #if OMPI_GROUP_SPARSE in communicator/comm.c?
> That seems like
> part of the API that should under no conditions change based on
> sparse/not sparse
>
I don't, there was one #if that i just removed..
but we do need to check in some cases like in ompi_comm_get_rprocs that we
are not using the direct access to the pointers list. like for example:

if(OMPI_GROUP_IS_DENSE(local_comm->c_local_group)) {
rc = ompi_proc_pack(local_comm->c_local_group->grp_proc_pointers,
local_size, sbuf);
}
/* get the proc list for the sparse implementations */
else {
proc_list = (ompi_proc_t **) calloc
(local_comm->c_local_group->grp_proc_count,
 sizeof (ompi_proc_t *));
for(i=0 ; ic_local_group->grp_proc_count ; i++)
proc_list[i] =
GROUP_GET_PROC_POINTER(local_comm->c_local_group,i);
rc = ompi_proc_pack (proc_list, local_size, sbuf);
}

here if sparse groups are disabled, we don't really want to allocate and
set this list of pointers that already exists (not to waste more memory
and time)..

> 2) I think it would be better to always have the flags and macros
> available (like OMPI_GROUP_SPORADIC and OMPI_GROUP_IS_SPORADIC) even
> when sparse groups are disabled.  They dont' take up any space, and
> mean less #ifs in the general code base
>
That's what i actually was proposing.. keep the flags (there are no
macros, just the GROUP_GET_PROC_POINTER) and the sparse parameters in the
group strucutre, and this will mean, only 1 maybe 2 #ifs..

> 3) Instead of the GROU_GET_PROC_POINTER macro, why not just change
> the behavior of the ompi_group_peer_lookup() function, so that there
> is symmetry with how you get a proc from a communicator?  static
> inline functions (especially short ones like that) are basically
> free.  We'll still have to fix all the places in the code where the
> macro is used or people poke directly at the group structure, but I
> like static inline over macros whenever possible.  So much easier t
> debug.

Actually i never knew till this morning that this function was in the
code.. I have an inline function ompi_group_lookup (which does the same
thing), that actually checks if the group is dense or not and act
accordingly.. but to use the inline function instead of the macro, means
again that we need to compile in all the sparse parameters/code, which im
for..

>
> Other than that, I think you've got my concerns pretty much addressed.
>
> Brian
>
> On Jul 25, 2007, at 8:45 PM, Mohamad Chaarawi wrote:
>
>> In the current code, almost all #ifs are due to the fact that we don't
>> want to add the extra memory by the sparse parameters that are
>> added to
>> the group structure.
>> The additional parameters are 5 pointers and 3 integers.
>> If nobody objects, i would actually keep those extra parameters,
>> even if
>> sparse groups are disabled (in the default case on configure),
>> because it
>> would reduce the number of #ifs in the code to only 2 (as i recall
>> that i
>> had it before) ..
>>
>> Thank you,
>>
>> Mohamad
>>
>> On Wed, July 25, 2007 4:23 pm, Brian Barrett wrote:
>>> On Jul 25, 2007, at 3:14 PM, Jeff Squyres wrote:
>>>
 On Jul 25, 2007, at 5:07 PM, Brian Barrett wrote:

>> It just adds a lot of #if's throughout the code.  Other than that,
>> there's no reason to remove it.
>
> I agree, lots of #ifs are bad.  But I guess I don't see the
> problem.
> The only real important thing people were directly accessing in the
> ompi_group_t is the array of proc pointers.  Indexing into them
> could
> be done with a static inline function that just has slightly
> different time complexity based on compile options.  Static inline
> function that just does an index in the group proc pointer would
> have
> almost no added overhead (none if the compiler doesn't suck).

 Ya, that's what I proposed.  :-)

 But I did also propose removing the extra #if's so that the sparse
 group code would be available and we'd add an extra "if" in the
 critical code path.

 But we can do it this way instead:

 Still use the MACRO to access proc_t's.  In the --disable-sparse-
 groups scenario, have it map to comm.group.proc[i].  In the --
 enable-
 sparse-groups scenario, have it like I listed in the original
 proposal:

  static inline ompi_proc_t lookup_group(ompi_group_t *group, int
 index) {
  if (group_is_dense(group)) {
  return group->procs[index];
  } else {
  return sparse_group_lookup(group, index);
  }
  }

 With:

 a) groups are always dense if --enable and an MCA parameter turns
 off
 spa

Re: [OMPI devel] [RFC] Sparse group implementation

2007-07-26 Thread Brian Barrett

On Jul 26, 2007, at 12:00 PM, Mohamad Chaarawi wrote:


2) I think it would be better to always have the flags and macros
available (like OMPI_GROUP_SPORADIC and OMPI_GROUP_IS_SPORADIC) even
when sparse groups are disabled.  They dont' take up any space, and
mean less #ifs in the general code base


That's what i actually was proposing.. keep the flags (there are no
macros, just the GROUP_GET_PROC_POINTER) and the sparse parameters  
in the

group strucutre, and this will mean, only 1 maybe 2 #ifs..


Why would this mean having the sparse parameters in the group structure?


3) Instead of the GROU_GET_PROC_POINTER macro, why not just change
the behavior of the ompi_group_peer_lookup() function, so that there
is symmetry with how you get a proc from a communicator?  static
inline functions (especially short ones like that) are basically
free.  We'll still have to fix all the places in the code where the
macro is used or people poke directly at the group structure, but I
like static inline over macros whenever possible.  So much easier t
debug.


Actually i never knew till this morning that this function was in the
code.. I have an inline function ompi_group_lookup (which does the  
same

thing), that actually checks if the group is dense or not and act
accordingly.. but to use the inline function instead of the macro,  
means
again that we need to compile in all the sparse parameters/code,  
which im

for..


No, it doesn't.  Just have something like:

static inline ompi_proc_t*
ompi_group_lookup(ompi_group_t *group, int peer)
{
#if OMPI_GROUP_SPARSE
/* big long lookup function for sparse groups here */
#else
return group->grp_proc_pointers[peer]
#endif
}

Brian


Re: [OMPI devel] Hostfiles - yet again

2007-07-26 Thread Ralph H Castain
Hi Aurelien

Perhaps some bad news on this subject - see below.


On 7/26/07 7:53 AM, "Ralph H Castain"  wrote:

> 
> 
> 
> On 7/26/07 7:33 AM, "rolf.vandeva...@sun.com" 
> wrote:
> 
>> Aurelien Bouteiller wrote:


>>> Currently I proceed to two different mpirun with a single orte
>>> seed holding the registry. This way I get two different hostfiles, one
>>> for computing nodes, one for FT services. I just want to make sure
>>> everybody understood this requirement so that this feature does not
>>> disappear in the brainstorming :]
>>>  
>>> 

After some investigation, I'm afraid that I have to report that this - as
far as I understand what you are doing - may no longer work in Open MPI in
the future (and I'm pretty sure isn't working in the trunk today except
[maybe] in the special case of hostfile - haven't verified that).

To ensure we are correctly communicating, let me reiterate what I understand
you are doing:

1. in one window, you start a persistent daemon. You then enter "mpirun" to
that command line, specifying a hostfile (let's call it "foo" for now) and
the universe used to start the persistent daemon. Thus, mpirun connects to
that universe and runs within it.

2. in another window, you type "mpirun" to the command line, specifying a
different hostfile ("bar") and again giving it the universe used to start
the persistent daemon. Thus, both mpiruns are being "managed" by the same
HNP (the persistent daemon).

First, there are major issues here involving confusion over allocations and
synchronization between the lifetimes of the two jobs started in this
manner. You may not see those in hostfile-only use cases, but for managed
environments, this proved to cause undesirable confusion over process
placement and unexpected application failures. Accordingly, we have been
working to eliminate this usage (although the trunk will currently still
allow it in some cases).

This was caused by mpirun itself processing its local environment and then
"pushing" it into the global registry. Keeping everything separated causes a
bookkeeper's headache and many lines of code that we would like to
eliminate.

The current future design only processes allocations at the HNP itself.
Thus, the persistent daemon would only be capable of sensing its own local
allocation - it cannot see an allocation obtained in a separate
window/login. This unfortunately extends to hostfiles as well - the
persistent daemon can process the hostfile provided on its command line or
environment, but has no mechanism for reading another one.

The exception to this is comm_spawn. Our current intent was to allow
comm_spawn to specify a hostfile that could be read by the HNP and used for
the child job. However, we are still discussing whether this hostfile should
be allowed to "add" nodes to the known available resources, or only specify
a subset of the already-known resource pool. I suspect we will opt for the
latter interpretation as we otherwise open an entirely different set of
complications.

So I am not sure that you will be able to continue working this way. You may
have to start your regular application with the larger pool of resources,
specify the ones you want used for the application itself via -host, and
then "comm_spawn" your FT services on the other nodes using -host in that
launch. Alternatively, you could use the multiple app_context capability to
start it all from the command line:

mpirun -hostfile big_pool -n 10 -host 1,2,3,4 application : -n 2 -host
99,100 ft_server

Hope that helps explain things. As I hope I have indicated, I -think- you
will still be able to do what you described, but probably not the way you
have been doing it.

Please feel free to comment. If this is a big enough issue to a large enough
audience, then we can try to find a way to solve it (assuming Open MPI's
community decides to support it).

Ralph


> 
>>> Next requirement is the ability to add during runtime some nodes to the
>>> initial pool. Because node may fail (but it is the same with comm_spawn
>>> basically) , I might need some (lot of) spare nodes to replace failed
>>> ones. As I do not want to request for twice as many nodes as I need
>>> (after all, things could just go fine, why should I waste that many
>>> computing resources for idle spares ?), I definitely want to be able to
>>> allocate some new nodes to the pool of the already running machines. As
>>> far as I understand, this is impossible to achieve with the usecase2 and
>>> quite difficult in usecase1. In my opinion, having the ability to spawn
>>> on nodes which are not part of the initial hostfile is a key feature
>>> (and not only for FT purposes).
>>> 
>>>  
>>> 
>> I am looking for more detail into the above issue.   What
>> resource manager are you using?
>> 
>> Ideally, we would prefer not to support this.  Any nodes
>> that you run on, or hope to run on, would be designated
>> at the start.   For example:
>> 
>> mpirun -np 1 --host a,b,c,d,e,f,g
>> 
>> This would cause the on

Re: [OMPI devel] [RFC] Sparse group implementation

2007-07-26 Thread Mohamad Chaarawi

On Thu, July 26, 2007 1:18 pm, Brian Barrett wrote:
> On Jul 26, 2007, at 12:00 PM, Mohamad Chaarawi wrote:
>
>>> 2) I think it would be better to always have the flags and macros
>>> available (like OMPI_GROUP_SPORADIC and OMPI_GROUP_IS_SPORADIC) even
>>> when sparse groups are disabled.  They dont' take up any space, and
>>> mean less #ifs in the general code base
>>>
>> That's what i actually was proposing.. keep the flags (there are no
>> macros, just the GROUP_GET_PROC_POINTER) and the sparse parameters
>> in the
>> group strucutre, and this will mean, only 1 maybe 2 #ifs..
>
> Why would this mean having the sparse parameters in the group structure?

not sure if i understood your question right, but in the group struct we
added 5 integers and 3 pointer.. if we want to compile these out, we would
then need all the #ifs around the code where we use these parameters..

>
>>> 3) Instead of the GROU_GET_PROC_POINTER macro, why not just change
>>> the behavior of the ompi_group_peer_lookup() function, so that there
>>> is symmetry with how you get a proc from a communicator?  static
>>> inline functions (especially short ones like that) are basically
>>> free.  We'll still have to fix all the places in the code where the
>>> macro is used or people poke directly at the group structure, but I
>>> like static inline over macros whenever possible.  So much easier t
>>> debug.
>>
>> Actually i never knew till this morning that this function was in the
>> code.. I have an inline function ompi_group_lookup (which does the
>> same
>> thing), that actually checks if the group is dense or not and act
>> accordingly.. but to use the inline function instead of the macro,
>> means
>> again that we need to compile in all the sparse parameters/code,
>> which im
>> for..
>
> No, it doesn't.  Just have something like:
>
> static inline ompi_proc_t*
> ompi_group_lookup(ompi_group_t *group, int peer)
> {
> #if OMPI_GROUP_SPARSE
>  /* big long lookup function for sparse groups here */
> #else
>  return group->grp_proc_pointers[peer]
> #endif
> }
>
ok, i guess i can do that...

> Brian
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>


-- 
Mohamad Chaarawi
Instructional Assistant   http://www.cs.uh.edu/~mschaara
Department of Computer ScienceUniversity of Houston
4800 Calhoun, PGH Room 526Houston, TX 77204, USA



Re: [OMPI devel] [RFC] Sparse group implementation

2007-07-26 Thread Brian Barrett

On Jul 26, 2007, at 1:01 PM, Mohamad Chaarawi wrote:



On Thu, July 26, 2007 1:18 pm, Brian Barrett wrote:

On Jul 26, 2007, at 12:00 PM, Mohamad Chaarawi wrote:


2) I think it would be better to always have the flags and macros
available (like OMPI_GROUP_SPORADIC and OMPI_GROUP_IS_SPORADIC)  
even

when sparse groups are disabled.  They dont' take up any space, and
mean less #ifs in the general code base


That's what i actually was proposing.. keep the flags (there are no
macros, just the GROUP_GET_PROC_POINTER) and the sparse parameters
in the
group strucutre, and this will mean, only 1 maybe 2 #ifs..


Why would this mean having the sparse parameters in the group  
structure?


not sure if i understood your question right, but in the group  
struct we
added 5 integers and 3 pointer.. if we want to compile these out,  
we would

then need all the #ifs around the code where we use these parameters..


I don't follow why you would need all the sparse stuff in  
ompi_group_t when OMPI_GROUP_SPARSE is 0.  The OMPI_GROUP_IS and  
OMPI_GROU_SET macros only modify grp_flags, which is always present.


Like the ompi_group_peer_lookup, much can be hidden inside the  
functions rather than exposed through the interface, if you're  
concerned about the other functionality currently #if'ed in the code.


Brian


Re: [OMPI devel] [RFC] Sparse group implementation

2007-07-26 Thread Mohamad Chaarawi

On Thu, July 26, 2007 2:07 pm, Brian Barrett wrote:
> On Jul 26, 2007, at 1:01 PM, Mohamad Chaarawi wrote:
>
>>
>> On Thu, July 26, 2007 1:18 pm, Brian Barrett wrote:
>>> On Jul 26, 2007, at 12:00 PM, Mohamad Chaarawi wrote:
>>>
> 2) I think it would be better to always have the flags and macros
> available (like OMPI_GROUP_SPORADIC and OMPI_GROUP_IS_SPORADIC)
> even
> when sparse groups are disabled.  They dont' take up any space, and
> mean less #ifs in the general code base
>
 That's what i actually was proposing.. keep the flags (there are no
 macros, just the GROUP_GET_PROC_POINTER) and the sparse parameters
 in the
 group strucutre, and this will mean, only 1 maybe 2 #ifs..
>>>
>>> Why would this mean having the sparse parameters in the group
>>> structure?
>>
>> not sure if i understood your question right, but in the group
>> struct we
>> added 5 integers and 3 pointer.. if we want to compile these out,
>> we would
>> then need all the #ifs around the code where we use these parameters..
>
> I don't follow why you would need all the sparse stuff in
> ompi_group_t when OMPI_GROUP_SPARSE is 0.  The OMPI_GROUP_IS and
> OMPI_GROU_SET macros only modify grp_flags, which is always present.
>
I don't need them, right now they are compiled out.. but since they are,
all functions using these parameters (example: translate_ranks_strided,
the long lookup function) have to be also compiled out, and other common
functions that changed (like translate ranks, where we now check if sparse
groups are  enabled so we use an easier translate_ranks corresponding to
the storage type) have to have the #ifs to compile stuff out.

> Like the ompi_group_peer_lookup, much can be hidden inside the
> functions rather than exposed through the interface, if you're
> concerned about the other functionality currently #if'ed in the code.
>
in the ompi_group_peer_lookup that u suggested, we have an #if, so the
same way with all functions that use sparse parameters..

I think i get what u are saying, Im don't want any functionality from
including the sparse stuff when they are disabled, just easier code to
look at..


> Brian
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>


-- 
Mohamad Chaarawi
Instructional Assistant   http://www.cs.uh.edu/~mschaara
Department of Computer ScienceUniversity of Houston
4800 Calhoun, PGH Room 526Houston, TX 77204, USA



Re: [OMPI devel] Hostfiles - yet again

2007-07-26 Thread Aurelien Bouteiller

Ralph H Castain wrote:

After some investigation, I'm afraid that I have to report that this - as
far as I understand what you are doing - may no longer work in Open MPI in
the future (and I'm pretty sure isn't working in the trunk today except
[maybe] in the special case of hostfile - haven't verified that).

To ensure we are correctly communicating, let me reiterate what I understand
you are doing:
  
Correct. Also consider that for my testing I use a batch scheduler that  
is not managed by orte right now and provide myself the hostfiles (This 
batch scheduler is named OAR and is in use on the grid5000 research 
facility in France).



This was caused by mpirun itself processing its local environment and then
"pushing" it into the global registry. Keeping everything separated causes a
bookkeeper's headache and many lines of code that we would like to
eliminate.

  
I see the point. I Agree there is very few benefit at allowing users to 
have different local environments on different mpirun instances; while 
it should be a real pain to have a clean code managing this. For my sole 
usage, the app_context feature you described is a more elegant and 
equivalent way of spawning my FT services. I will switch to this right 
away.


Still it might be of some use to be able to start different mpirun the 
same way you plan comm_spawn to work: sharing the same environment, but 
allowing for use of a different hostfile. The use case that comes in 
mind is "grid", where different batch schedulers are in use on each 
clusters, so you can't gather a single hostfile. This is not a feature I 
would fight for, but I can imagine some people might find it useful.


More important for me is the ability to refill the hostfile with fresh 
hosts when some of the original ones died. Allocating an huge amount of 
spares preventively is just not the correct way to go. On the side I am 
not sure  that even the best comm_spawn you discussed could be of some 
help in this case as I do not want the new nodes to go in a different 
COMM_WORLD. Finding a way to update the registry and all the orted to do 
so is a much larger issue than simple spawning and I have not been 
really thinking about it for now. Maybe we should discuss this issue 
separately.


Aurelien

Please feel free to comment. If this is a big enough issue to a large enough
audience, then we can try to find a way to solve it (assuming Open MPI's
community decides to support it).

Ralph


  

Next requirement is the ability to add during runtime some nodes to the
initial pool. Because node may fail (but it is the same with comm_spawn
basically) , I might need some (lot of) spare nodes to replace failed
ones. As I do not want to request for twice as many nodes as I need
(after all, things could just go fine, why should I waste that many
computing resources for idle spares ?), I definitely want to be able to
allocate some new nodes to the pool of the already running machines. As
far as I understand, this is impossible to achieve with the usecase2 and
quite difficult in usecase1. In my opinion, having the ability to spawn
on nodes which are not part of the initial hostfile is a key feature
(and not only for FT purposes).

 



I am looking for more detail into the above issue.   What
resource manager are you using?

Ideally, we would prefer not to support this.  Any nodes
that you run on, or hope to run on, would be designated
at the start.   For example:

mpirun -np 1 --host a,b,c,d,e,f,g

This would cause the one process of the mpi job to start on host a.
Then, the mpi job has available to it the other hosts should it decide
later to start a job on them.  However no ORTE daemons would
be started on those nodes until calls to MPI_Comm_spawn
occur.   So, the MPI job would not be consuming any resources
until called upon to.
  

This has actually been the subject of multiple threads on the user list and
is considered a critical capability by some users and vendors. I believe
there is little problem in allowing those systems that can support it to
dynamically add nodes to ORTE via some API into the resource manager. At the
moment, none of the RMs support it, but LSF will (and TM at least may)
shortly do so, and some of their customers are depending upon it.

The problem is that job startup could be delayed for significant time if all
hosts must be preallocated. Admittedly, this raises all kinds of issues
about how long the job could be stalled waiting for the new hosts. However,
as the other somewhat exhaustive threads have discussed, there are computing
models that can live with this uncertainty, and RMs that will provide async
callbacks to allow the rest of the app to continue working while waiting.

Just my $0.2 - again, this goes back to...are there use-cases and
customers to which Open MPI is simply going to say "we won't support that"?



Rolf

  

I know there have been some extra discussions on this subject.
Unfortunately it looks like I am not

Re: [OMPI devel] Hostfiles - yet again

2007-07-26 Thread Ralph H Castain



On 7/26/07 2:24 PM, "Aurelien Bouteiller"  wrote:

> Ralph H Castain wrote:
>> After some investigation, I'm afraid that I have to report that this - as
>> far as I understand what you are doing - may no longer work in Open MPI in
>> the future (and I'm pretty sure isn't working in the trunk today except
>> [maybe] in the special case of hostfile - haven't verified that).
>> 
>> To ensure we are correctly communicating, let me reiterate what I understand
>> you are doing:
>>   
> Correct. Also consider that for my testing I use a batch scheduler that
> is not managed by orte right now and provide myself the hostfiles (This
> batch scheduler is named OAR and is in use on the grid5000 research
> facility in France).
> 
>> This was caused by mpirun itself processing its local environment and then
>> "pushing" it into the global registry. Keeping everything separated causes a
>> bookkeeper's headache and many lines of code that we would like to
>> eliminate.
>> 
>>   
> I see the point. I Agree there is very few benefit at allowing users to
> have different local environments on different mpirun instances; while
> it should be a real pain to have a clean code managing this. For my sole
> usage, the app_context feature you described is a more elegant and
> equivalent way of spawning my FT services. I will switch to this right
> away.
> 
> Still it might be of some use to be able to start different mpirun the
> same way you plan comm_spawn to work: sharing the same environment, but
> allowing for use of a different hostfile. The use case that comes in
> mind is "grid", where different batch schedulers are in use on each
> clusters, so you can't gather a single hostfile. This is not a feature I
> would fight for, but I can imagine some people might find it useful.


One of the design changes we made was to explicitly not support
multi-cluster operations from inside of Open MPI. Instead, people (not us)
are looking at adding a layer on top of Open MPI to handle the cross-cluster
coordination. I expect you'll hear more about those efforts in the
not-too-distant future.


> 
> More important for me is the ability to refill the hostfile with fresh
> hosts when some of the original ones died. Allocating an huge amount of
> spares preventively is just not the correct way to go. On the side I am
> not sure  that even the best comm_spawn you discussed could be of some
> help in this case as I do not want the new nodes to go in a different
> COMM_WORLD. Finding a way to update the registry and all the orted to do
> so is a much larger issue than simple spawning and I have not been
> really thinking about it for now. Maybe we should discuss this issue
> separately.

Ah, now -that- is a different topic indeed. I do plan to support a dynamic
add_hosts API as part of the revamped system. I'll try to flesh that out as
a separate RFC later.

Thanks
Ralph

> 
> Aurelien
>> Please feel free to comment. If this is a big enough issue to a large enough
>> audience, then we can try to find a way to solve it (assuming Open MPI's
>> community decides to support it).
>> 
>> Ralph
>> 
>> 
>>   
> Next requirement is the ability to add during runtime some nodes to the
> initial pool. Because node may fail (but it is the same with comm_spawn
> basically) , I might need some (lot of) spare nodes to replace failed
> ones. As I do not want to request for twice as many nodes as I need
> (after all, things could just go fine, why should I waste that many
> computing resources for idle spares ?), I definitely want to be able to
> allocate some new nodes to the pool of the already running machines. As
> far as I understand, this is impossible to achieve with the usecase2 and
> quite difficult in usecase1. In my opinion, having the ability to spawn
> on nodes which are not part of the initial hostfile is a key feature
> (and not only for FT purposes).
> 
>  
> 
> 
 I am looking for more detail into the above issue.   What
 resource manager are you using?
 
 Ideally, we would prefer not to support this.  Any nodes
 that you run on, or hope to run on, would be designated
 at the start.   For example:
 
 mpirun -np 1 --host a,b,c,d,e,f,g
 
 This would cause the one process of the mpi job to start on host a.
 Then, the mpi job has available to it the other hosts should it decide
 later to start a job on them.  However no ORTE daemons would
 be started on those nodes until calls to MPI_Comm_spawn
 occur.   So, the MPI job would not be consuming any resources
 until called upon to.
   
>>> This has actually been the subject of multiple threads on the user list and
>>> is considered a critical capability by some users and vendors. I believe
>>> there is little problem in allowing those systems that can support it to
>>> dynamically add nodes to ORTE via some API into the resource manager. At the
>>> moment, none of th

Re: [OMPI devel] Hostfiles - yet again

2007-07-26 Thread Aurelien Bouteiller

mpirun -hostfile big_pool -n 10 -host 1,2,3,4 application : -n 2 -host
99,100 ft_server


This will not work: this is a way to launch MIMD jobs, that share the 
same COMM_WORLD. Not the way to launch two different applications that 
interact trough Accept/Connect.


Direct consequence on simple NAS benchmarks are:
* if the second command does not use MPI-Init, then the first 
application locks forever in MPI-Init

* if both use MPI init, the MPI_Comm_size of the jobs are incorrect.



bouteill@dancer:~$ ompi-build/debug/bin/mpirun -prefix 
/home/bouteill/ompi-build/debug/ -np 4 -host node01,node02,node03,node04 
NPB3.2-MPI/bin/lu.A.4 : -np 1 -host node01 NPB3.2-MPI/bin/mg.A.1



NAS Parallel Benchmarks 3.2 -- LU Benchmark

Warning: program is running on  5 processors
but was compiled for   4
Size:  64x 64x 64
Iterations: 250
Number of processes: 5