[OMPI devel] [Fwd: [devel-core] Open MPI concall agenda (12/4/2007)]

2007-12-04 Thread Terry Dontje
This is a resend since it looks like some people didn't get the agenda 
for some reason.


--td
--- Begin Message ---
Please let me know if you have any other agenda topics for this week's telecon.

List-Post: devel@lists.open-mpi.org
Date: December 4, 2007
Time: 11am Eastern/New York
Dial-in number: 866-629-8606

Agenda:
- 1.2.x update (Rich/Terry)
- 1.3 update (Brad/George)
- updates: LANL, Houston, HLRS, IBM, UTK
- other items?

-

December 4: LANL, Houston, HLRS, IBM, UTK
December 11: Mellanox, Myricom, IU, QLogic, Sun


___
devel-core mailing list
devel-c...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel-core
--- End Message ---


[OMPI devel] RTE issues: scalability & complexity

2007-12-04 Thread Ralph H Castain
Yo all

As (I hope) many of you know, we are in a final phase of revamping ORTE to
simplify the code, enhance scalability, and improve reliability. In working
on this effort, we recently uncovered four issues that merit broader
discussion (apologies in advance for verbosity). Although these somewhat
relate, I realize that people may care about some and not others. Hence, to
provide the chance to only comment on those you -do- care about, and to at
least somewhat constrain the length of the emails, I will be sending out a
series of four emails in this area.

The issues will include:

I. Support for non-MPI jobs

II. Interaction between the ROUTED and GRPCOMM frameworks

III. Collective communications across daemons

IV. RTE/MPI relative modex responsibilities


Please feel free to contact me and/or comment on any of these issues. As a
reminder, if you do comment back to the Devel mailing list, please use
"reply all" so I will also receive a copy of the message.

Thanks
Ralph




[OMPI devel] RTE issue I. Support for non-MPI jobs

2007-12-04 Thread Ralph H Castain
I. Support for non-MPI jobs
Considerable complexity currently exists in ORTE because of the stipulation
in our first requirements document that users be able to mpirun non-MPI jobs
- i.e., that we support such calls as "mpirun -n 100 hostname". This creates
a situation, however, where the RTE cannot know if the application will call
MPI_Init (or at least orte_init), which has significant implications to the
RTE's architecture. For example, during the launch of the application's
processes, the RTE cannot go into any form of blocking receive while waiting
for the procs to report a successful startup as this won't occur for
execution of something like "hostname".

Jeff has noted that support for non-MPI jobs is not something most (all?)
MPIs currently provide, nor something that users are likely to exploit as
they can more easily just "qsub hostname" (or the equivalent for that
environment). While nice for debugging purposes, therefore, it isn't clear
that supporting non-MPI jobs is worth the increased code complexity and
fragility.

In addition, the fact that we do not know if a job will call Init limits our
ability to do collective communications within the RTE, and hence our
scalability - see the note on that specific subject for further discussion
on this area.

This would be a "regression" in behavior, though, so the questions for the
community are:

(a) do we want to retain the feature to run non-MPI jobs with mpirun as-is
(and accept the tradeoffs, including the one described below in II)?

(b) do we provide a flag to mpirun (perhaps adding the distinction that
"orterun" must be used for non-MPI jobs?) to indicate "this is NOT an MPI
job" so we can act accordingly?

(c) simply eliminate support for non-MPI jobs?

(d) other suggestions?

Ralph




[OMPI devel] RTE Issue II: Interaction between the ROUTED and GRPCOMM frameworks

2007-12-04 Thread Ralph H Castain
II. Interaction between the ROUTED and GRPCOMM frameworks
When we initially developed these two frameworks within the RTE, we
envisioned them to operate totally independently of each other. Thus, the
grpcomm collectives provide algorithms such as a binomial "xcast" that uses
the daemons to scalably send messages across the system.

However, we recently realized that the efficacy of the current grpcomm
algorithms directly hinge on the daemons being fully connected - which we
were recently told may not be the case as other people introduce different
ROUTED components. For example, using the binomial algorithm in grpcomm's
xcast while having a ring topology selected in ROUTED would likely result in
terrible performance.

This raises the following questions:

(a) should the GRPCOMM and ROUTED frameworks be consolidated to ensure that
the group collectives algorithms properly "match" the communication
topology?

(b) should we automatically select the grpcomm/routed pairings based on some
internal logic?

(c) should we leave this "as-is" and the user is responsible for making
intelligent choices (and for detecting when the performance is bad due to
this mismatch)?

(d) other suggestions?

Ralph




[OMPI devel] RTE Issue III: Collective communications across daemons

2007-12-04 Thread Ralph H Castain
III. Collective communications across daemons
A few months ago, we deliberately extended the abstraction between the RTE
and MPI layers to reduce their interaction. This has generally been
perceived as a good thing, but it did have a cost: namely, it increased the
communications required during launch. In prior OMPI versions, we took
advantage of tighter integration to aggregate RTE and MPI communications
required during startup - this was lost in the abstraction effort.

We have since been working to reduce the resulting "abstraction penalty". We
have managed to achieve communication performance that scales linearly with
the number of nodes. Further improvements, though, depend upon our ability
to do quasi-collective communications in the RTE.

Collectives in the RTE are complicated by the current requirement to support
non-MPI applications (topic of another email), and by the fact that not
every node participates in a given operation. The former issue is reflected
in the fact that the RTE (and hence, the daemon) cannot know if the
application process is going to call Init or not - hence, the logic in the
daemon must not block on any communication during launch since the proc may
completely execute and terminate without ever calling Init. Thus, entering a
collective to, for example, collect RML contact info is problematic as that
info may never become available - and so, the HNP -cannot- enter a
collective call to wait for its arrival.

The latter issue exists for even MPI jobs. Consider the case of a single
process job that comm_spawns a child job onto other nodes. The RTE will
launch daemons on the new nodes, and then broadcast the "launch procs"
command across all the daemons (this is done to exploit a scalable comm
procedure). Thus, the daemon on the initial node will see the launch
command, but will know it is not participating and hence take no action.

If we now attempt to perform a collective communication (say, to collect RML
contact info), we face four interacting obstacles:

(a) the initial daemon isn't launching anything this time, and so won't know
it has to participate. This can obviously be resolved since it will
certainly know that a launch is being conducted, so we could have it simply
go ahead and call the appropriate collective at that time;

(b) the launch of the local procs is conducted asynchronously - hence, no
daemon can know when another daemon has completed the launch and thus is
ready to communicate;

(c) the failure of any local launch can generate an error response back to
the daemons with orders to kill their procs, exit, or other things. The
daemons must, therefore, not be in blocking communication calls as this will
prevent them from responding as directed; and

(d) the daemons may not be fully connected - hence, any collective must
"follow" the communication topology.

What we could use is a quasi-collective "gather" based on non-blocking
receives that preserves the daemons' ability to respond to unexpected
commands such as "kill/exit". If someone is interested in working on this,
please contact me for a fuller description of the problem.

Ralph




[OMPI devel] MPI_GROUP_EMPTY and MPI_Group_free()

2007-12-04 Thread Lisandro Dalcin
Dear all,

As I see some activity on a related ticked, below some comments I
sended to Bill Gropp some days ago about this subject. Bill did not
write me back, I know he is really busy.

Group operations are supposed to return new groups, so the used has to
free the result. Additionally, the standard say that those operations
may return the empty group. Then the issue: if the empty group is
returned, the user should or should not call MPI_Group_free()??. I
could not find any part of the standard about freeing MPI_GROUP_EMPTY.

This issue is very similar to the one in MPI-1 related to error handlers.

I believe the standard should be a bit stricter here, but many
possibilities are:

* MPI_GROUP_EMPTY must be freed it it is the result of a group
operation. This way similar to the management of predefined error
handlers.

* MPI_GROUP_EMPTY cannot be freed, as it is a predefined handle. Users
have to always check if the result of a group operation is
MPI_GROUP_EMPTY to know if they can or cannot free them. This way is
similar to the current management of predefined datatypes.



--
Lisandro Dalcín
---
Centro Internacional de Métodos Computacionales en Ingeniería (CIMEC)
Instituto de Desarrollo Tecnológico para la Industria Química (INTEC)
Consejo Nacional de Investigaciones Científicas y Técnicas (CONICET)
PTLC - Güemes 3450, (3000) Santa Fe, Argentina
Tel/Fax: +54-(0)342-451.1594


-- 
Lisandro Dalcín
---
Centro Internacional de Métodos Computacionales en Ingeniería (CIMEC)
Instituto de Desarrollo Tecnológico para la Industria Química (INTEC)
Consejo Nacional de Investigaciones Científicas y Técnicas (CONICET)
PTLC - Güemes 3450, (3000) Santa Fe, Argentina
Tel/Fax: +54-(0)342-451.1594



[OMPI devel] RTE Issue IV: RTE/MPI relative modex responsibilities

2007-12-04 Thread Ralph H Castain
IV. RTE/MPI relative modex responsibilities
The modex operation conducted during MPI_Init currently involves the
exchange of two critical pieces of information:

1. the location (i.e., node) of each process in my job so I can determine
who shares a node with me. This is subsequently used by the shared memory
subsystem for initialization and message routing; and

2. BTL contact info for each process in my job.

During our recent efforts to further abstract the RTE from the MPI layer, we
pushed responsibility for both pieces of information into the MPI layer.
This wasn't done capriciously - the modex has always included the exchange
of both pieces of information, and we chose not to disturb that situation.

However, the mixing of these two functional requirements does cause problems
when dealing with an environment such as the Cray where BTL information is
"exchanged" via an entirely different mechanism. In addition, it has been
noted that the RTE (and not the MPI layer) actually "knows" the node
location for each process.

Hence, questions have been raised as to whether:

(a) the modex should be built into a framework to allow multiple BTL
exchange mechansims to be supported, or some alternative mechanism be used -
one suggestion made was to implement an MPICH-like attribute exchange; and

(b) the RTE should absorb responsibility for providing a "node map" of the
processes in a job (note: the modex may -use- that info, but would no longer
be required to exchange it). This has a number of implications that need to
be carefully considered - e.g., the memory required to store the node map in
every process is non-zero. On the other hand:

(i) every proc already -does- store the node for every proc - it is simply
stored in the ompi_proc_t structures as opposed to somewhere in the RTE. We
would want to avoid duplicating that storage, but there would be no change
in memory footprint if done carefully.

(ii) every daemon already knows the node map for the job, so communicating
that info to its local procs may not prove a major burden. However, the very
environments where this subject may be an issue (e.g., the Cray) do not use
our daemons, so some alternative mechanism for obtaining the info would be
required.


So the questions to be considered here are:

(a) do we leave the current modex "as-is", to include exchange of the node
map, perhaps including "#if" statements to support different exchange
mechanisms?

(b) do we separate the two functions currently in the modex and push the
requirement to obtain a node map into the RTE? If so, how do we want the MPI
layer to retrieve that info so we avoid increasing our memory footprint?

(c) do we create a separate modex framework for handling the different
exchange mechanisms for BTL info, do we incorporate it into an existing one
(if so, which one), the new publish-subscribe framework, implement an
alternative approach, or...?

(d) other suggestions?

Ralph




Re: [OMPI devel] MPI_GROUP_EMPTY and MPI_Group_free()

2007-12-04 Thread Jeff Squyres

On Dec 4, 2007, at 10:43 AM, Lisandro Dalcin wrote:


* MPI_GROUP_EMPTY cannot be freed, as it is a predefined handle. Users
have to always check if the result of a group operation is
MPI_GROUP_EMPTY to know if they can or cannot free them. This way is
similar to the current management of predefined datatypes.


I'd be in favor of this, since it's consistent with the rest of the  
spec w.r.t. predefined handles.


--
Jeff Squyres
Cisco Systems


Re: [OMPI devel] RTE issue I. Support for non-MPI jobs

2007-12-04 Thread Jeff Squyres

On Dec 4, 2007, at 10:11 AM, Ralph H Castain wrote:

(a) do we want to retain the feature to run non-MPI jobs with mpirun  
as-is

(and accept the tradeoffs, including the one described below in II)?

(b) do we provide a flag to mpirun (perhaps adding the distinction  
that
"orterun" must be used for non-MPI jobs?) to indicate "this is NOT  
an MPI

job" so we can act accordingly?


Based on talking to Ralph this morning, I'd [cautiously] be in favor  
of b) -- have an MCA param / command line switch that allows switching  
between jobs that call orte_init and those that do not, along with  
setting the default by looking at argv[0] (orterun = does not call  
orte_init, mpirun = does call orte_init).


The benefits are what Ralph described: less complex ORTE code and the  
potential for optimizations that are difficult if you don't know if  
the launched applications are going to call MPI_INIT (orte_init) or not.


But this is definitely a change from past behavior -- so it's worth  
community discussion.  The real question is: how many OMPI users  
actually use mpirun to launch non-MPI jobs?


My $0.02 is that we're focusing ORTE on OMPI these days.  So  
optimizing more for MPI starting is a Good Thing(tm).



(c) simply eliminate support for non-MPI jobs?

(d) other suggestions?



--
Jeff Squyres
Cisco Systems


Re: [OMPI devel] [ofa-general] uDAPL EVD queue length issue

2007-12-04 Thread Arlin Davis

Jon Mason wrote:

While working on OMPI udapl btl, I have noticed some "interesting"
behavior.  OFA udapl wants the evd queues to be a power of 2 and
then will subtract 1 for book keeping (ie, so that internal head and
tail pointers never touch except when the ring is empty).  OFA udapl
will report the queue length as this number (and not the original
size requested) when queried.  This becomes interesting when a power
of 2 is passed in and then queried.  For example, a requested queue
of length 256 will report a length of 255 when queried.  


Something is not right. You should ALWAYS get at least what you request. 
On my system with an mthca, a request of 256 gets you 511. It is the 
verbs provider that is rounding up, not uDAPL.


Here is my uDAPL debug output (DAPL_DBG_TYPE=0x) using dtest:

 cq_object_create: (0x519bb0,0x519d00)
dapls_ib_cq_alloc: evd 0x519bb0 cqlen=256
dapls_ib_cq_alloc: new_cq 0x519d60 cqlen=511

This is before and after the ibv_create_cq call. uDAPL builds it's EVD 
resources based on what is returned from this call.


I modified dtest to double check the dat_evd_query and I get the same:

8962 dto_rcv_evd created 0x519e80
8962 dto_req_evd QLEN - requested 256 and actual 511

What OFED release and device are you using?

-arlin






Re: [OMPI devel] [ofa-general] uDAPL EVD queue length issue

2007-12-04 Thread Jon Mason
On Tue, Dec 04, 2007 at 11:40:17AM -0800, Arlin Davis wrote:
> Jon Mason wrote:
>> While working on OMPI udapl btl, I have noticed some "interesting"
>> behavior.  OFA udapl wants the evd queues to be a power of 2 and
>> then will subtract 1 for book keeping (ie, so that internal head and
>> tail pointers never touch except when the ring is empty).  OFA udapl
>> will report the queue length as this number (and not the original
>> size requested) when queried.  This becomes interesting when a power
>> of 2 is passed in and then queried.  For example, a requested queue
>> of length 256 will report a length of 255 when queried.  
>
> Something is not right. You should ALWAYS get at least what you request. On 
> my system with an mthca, a request of 256 gets you 511. It is the verbs 
> provider that is rounding up, not uDAPL.
>
> Here is my uDAPL debug output (DAPL_DBG_TYPE=0x) using dtest:
>
>  cq_object_create: (0x519bb0,0x519d00)
> dapls_ib_cq_alloc: evd 0x519bb0 cqlen=256
> dapls_ib_cq_alloc: new_cq 0x519d60 cqlen=511
>
> This is before and after the ibv_create_cq call. uDAPL builds it's EVD 
> resources based on what is returned from this call.
>
> I modified dtest to double check the dat_evd_query and I get the same:
>
> 8962 dto_rcv_evd created 0x519e80
> 8962 dto_req_evd QLEN - requested 256 and actual 511
>
> What OFED release and device are you using?

I'm running OFED 1.2.5 and using Chelsio.

The behavior of the iwch_create_cq in
drivers/infiniband/hw/cxgb3/iwch_provider.c is to allocate the amount
given (rounded to the power of 2).  So this function will give 256 if
256 is requested, but uDAPL will consume one of those for book keeping
and thus only have 255.

For my clarification, the provider should take into account the
bookkeeping of uDAPL and roundup to the next power of 2 when given a
power of 2 size?  I'm probably being thick, but why doesn't uDAPL
increase the size requested by one before passing the request to the
provider (or is this the documented behavior of the function and the
provider should conform)?

Thanks,
Jon

>
> -arlin
>
>
>
>