Re: [OMPI devel] ORTE process name,, nodeid..

Ralph Castain Mon, 19 Nov 2007 22:32:29 -0500

On 11/19/07 6:20 PM, "Tim Prins" <tpr...@cs.indiana.edu> wrote:

> On Monday 19 November 2007 09:42:21 am Ralph H Castain wrote:
> <snip>
>> An alternative solution might be to incorporate the modex in the new OMPI
>> framework I was about to create anyway. This framework was intended to deal
>> with publish/lookup of OMPI data to support a variety of methods.
>> Originally, we had intended only to include support there for things
>> specifically related to MPI_Publish etc., but there is no reason we
>> couldn't generalize it to support the general exchange of process "how to
>> connect to me" info and include a modex API in it. I was figuring we would
>> need two immediate components in it anyway: an ORTE one for when we have
>> full ORTE support in the system, and a CNOS one that would...well, I guess
>> just bark and say "you can't do publish/lookup on a Cray". It would be
>> simple to add the modex stuff there, and makes some logical sense as well.
> I think this approach is fundamentally flawed. Our frameworks are designed to
> abstract out something, to allow for multiple implementations. However, doing
> this would put two completely different things (the modex and the MPI
> pub/sub) together in one framework. While this may be convenient for the
> cray, it would be very inconvenient for someone who wanted to do the MPI
> pub/sub via a ldap server (as has been discussed). The key here is that MPI
> pub/sub is for very small amounts of data, accessed infrequently and in a
> non-performance-critical manner, whereas the modex is for rather large
> amounts of information (on big jobs) that has to be exchanged efficiently.

Actually, several people talked about this before we proposed it and came to
a different conclusion. The modex is in essence a "here's how to talk to me"
communication, which is the same intent of publish/lookup. I agree that the
volume of data involved is different. However, we are -not- proposing to use
the same mechanism for the two (modex vs. pub/lookup).

The proposal was based on the fact that the publish/lookup and modex
effectively use similar mechanisms - i.e., the orte component would use the
RML as the underlying communication mechanism. In contrast, the cray
component has alternative non-RML based mechanisms for both systems.

Things like the LDAP server pose an interesting challenge. In that case, the
publish/lookup cannot use the RML as LDAP has no understanding of that comm
mode. The modex, however, might - and might not - use that mechanism.
Accordingly, the plan was to provide base functions that use RML for any
component that can and wants to do so. This is identical to the approach we
use throughout the code base.

However, we do need the modex in a framework somewhere as we will need to
modify it to support tight integration with various environments. I cannot
see doing every tight integration with yet another RSL component as the code
duplication gets absurd - there isn't enough difference to support it. I
also, though, don't want to be forced to use the same modex in every case if
the native environment can provide an alternative method - having the modex
in the framework solves that problem.

So I guess I don't grok the issue here - what is wrong with having a modex
API in the pub/sub framework??? Other than causing you some additional merge
issues within RSL, I fail to understand why this is a problem.


> 
> Before anyone misunderstands, I am *not* proposing that we add a modex
> framework to ompi. Rather, I think this is a case where the RSL could make
> things really easy.
> 
> The RSL defines a process attribute system. One of the original ideas (later
> retracted, but now that I think about it I may re-add it) was to have some
> predefined attribute keys, that the runtime would set so we could look up
> information about any process.
> 
> So in the case of the cray, the rsl_init function would query to get all the
> info it wants, and then populate the info into its (local) process attribute
> data store.
> 
> In other systems each process would set the information in rsl_init and it
> would be exchanged in the normal modex method.
> 
> Then, the information would be looked up (locally) using the 'get' function on
> both platforms.
> 
> Simple, eh?

Maybe - and maybe not. The devil is always in the details. My concerns with
the RSL have been documented and wildly misunderstood. I still fail to see
the overall advantage as it seems we get different explanations every time
we ask. But I'll set that aside here.

FWIW: The publish/lookup interface was specifically required to support both
local and remote data storage operations, though that doesn't really apply
to the modex.

> 
> As an alternative to this, I think we could apply these same ideas into a
> specialized ORTE system, but it would not be as clean, and would tie our
> system closer to ORTE. I am not going to argue whether this is good or bad,
> but I am just mentioning it as a consequence.

My concern right now is that doing it in RSL means (as we chatted about
offline) integrating RSL into the OMPI trunk NOW - either directly or as
part of the orte revision branch. This will certainly delay getting the ORTE
revision done, maybe by as much as 3 months or more (IMHO). I will contact
LANL management to seek their input on this matter, but I doubt they will be
supportive as such a delay will cause LANL to miss several critical
RoadRunner milestones - which would almost certainly negatively impact our
RoadRunner commercial partners as well.

Alternatively, I suppose we could just fork the code base at this time, and
I'll complete the orte revisions on a LANL server. I hate to do this,
though, as it means someone (LANL, IBM, Voltaire, some combination, or
whomever) will be left with the problem of dealing with either re-merging
the branches or supporting a split code. I only offer it as an option we
could consider, if necessary.

Given those potential consequences, it would really help to have some
substantive reason -why- the framework is unacceptable. I grok that you feel
the RSL offers a possibly better alternative, but why does that mean we
shouldn't do the framework now and worry about that if/when the RSL is
proposed for production?

> 
> Tim
> 
>> 
>> If that makes sense, we can implement the latter approach on the branch
>> where we are doing the next major ORTE revision - that's where I was going
>> to create the new framework anyway.
>> 
>> Ralph
>> 
>> On 11/16/07 10:24 PM, "Shipman, Galen M." <gship...@ornl.gov> wrote:
>>> I am doing some work on Cray's CNL to support shared memory. To support
>>> shared memory I need to know if processes are local or remote. For other
>>> systems we simply use the modex in ompi_proc_get_info to get the proc's
>>> nodeid. When using CNOS I don't need the modex to get a remote processes
>>> nodeid. In fact, I can obtain every processes pid and nodeid (nid/pid)
>>> via a single CNOS call.
>>> 
>>> I have explored a couple of ways to populate the proc structures on the
>>> CRAY. One involves using #if's to do special things in
>>> ompi_proc_get_info. I don't like this. The second method involves adding
>>> a CNOS nameserver and modifying the orte_process_name_t to include the
>>> orte_nodeid_t so that the nameserver can populate all the info if it can.
>>> Prior to this change, the orte_nodeid_t was in ompi_proc_t, which doesn't
>>> make any sense to me, it is an orte level concept and it is only
>>> accessible in the ompi side. I also don't like adding orte_nodeid_t to
>>> orte_process_name_t as it really doesn't have anything to do with the a
>>> name.. I think it makes more sense to have an orte_proc_t.. Something
>>> like the following structure:
>>> 
>>> 
>>> 
>>> struct orte_process_name_t {
>>>     orte_jobid_t jobid;     /**< Job number */
>>>     orte_vpid_t vpid;       /**< Process number */
>>>     /** "nodeid" on which the proc resides */
>>> };
>>> 
>>> Struct orte_proc_t {
>>>     opal_list_item_t super;
>>>     orte_process_name_t proc_name;
>>>     orte_nodeid_t nid;
>>> };
>>> 
>>> struct ompi_proc_t {
>>>     orte_proc_t base;
>>>     ..... Etc .....
>>> 
>>> };
>>> 
>>> 
>>> I know there is some talk about removing the process names,,, not sure
>>> how that fits in here but this is what makes sense to me given the
>>> current architecture. Any thoughts here?
>>> 
>>> 
>>> - Galen
>> 
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> 
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] ORTE process name,, nodeid..

Reply via email to