Re: [OMPI devel] OMPI/ORTE and tools

Jeff Squyres Wed, 23 Jan 2008 14:41:40 -0500

Gotcha; thanks for the explanation.

The capabilities you added sounds good for the moment; I'm sure we'llthink of more over time...



On Jan 22, 2008, at 10:19 AM, Ralph H Castain wrote:

On 1/19/08 6:31 AM, "Jeff Squyres" <[email protected]> wrote:
Ralph --

I'm a little confused as to what you're providing.  In all 3 of the
scenarios you describe, you're saying that the external tool can
connect to the HNP for one or more jobs and execute a few discrete
functions:

- find procs and/or jobs running under that HNP
- querying status of procs and/or jobs
- querying status of nodes
- spawning a new job
- terminating a job
Actually, that isn't quite correct - sorry for confusion. What I wastryingto say was that you could connect via a simple wire protocol(scenario #1)to pass a few discrete commands and queries to an existing mpirun(and/orpersistent virtual machine). The HNP "listens" on the same daemoncommand
socket that it always opens, so there is no "new" socket for this
functionality.
The advantages of this approach are: (a) the tool only calls simplelibraryfunctions to pass commands/queries to the HNP and get answers back.Anychanges in APIs within ORTE are now totally hidden from the tool;(b) thesize of the required comm library is much smaller than all of ORTE,so the
tool gets a smaller memory footprint; (c) the tool "lives" totally
independently of the application, so you can quit (and later restartand
reconnect) the tool without disturbing the application.

Disadvantages are: (a) you only get access to a limited set of queries
and/or commands - what I was requesting was input on commands peoplewouldlike that I might have missed; and (b) the mpirun and/or virtualmachinemust be started separately before the tool can connect to them(however, thetool can be started first and simply be told to "look for an mpirun"after
the mpirun is issued).
Scenario #2 is identical to what we have in the code releases today.In thismode, the tool calls "orte_init" and sets itself up as an HNP. Itthen usesthe ORTE API's to execute the commands - e.g., callingorte_plm.spawn tolaunch the specified application. The tool can also launch anydistributed"probes" (i.e., processes needed by the tool but not part of theapplication- e.g., to monitor an application's resource usage) on the backendnodes, ifdesired, by simply calling orte_plm.spawn for a second "app" thatconsists
of the probe executable.

Advantages: full access to all ORTE functionality and internal data
Disadvantages: (a) the tool's code may have to be updated to followchangesin ORTE internal APIs; (b) the tool must stay alive throughoutexecution of
the application.
Scenario #3 is somewhat of a combination of the prior two. If youinvokempirun to launch an application into the background, you cansubsequentlyinvoke mpirun again to launch a set of distributed "probes" (asdescribedabove) to monitor that application. In this case, you could (ifdesired)have one or more of the "probe" processes contact the HNP via thesimplewire protocol to issue commands. Or you could just have theprocesses report
(via stdout or files) whatever info they are monitoring.

The point in this scenario was mainly to show that you could launch a
distributed tool without dealing with the ORTE interfaces - thetool's procs
can either just do their own thing, or can use the wire protocol to
communicate with the application's HNP. In this case, the tool isagain
independent of the application being monitored, so you could stop and
restart/reconnect it without affecting anything.
These were just a response to some concerns expressed about toolsdealingwith changing APIs. The wire protocol removes that necessity/annoyance, withsome (hopefully minor) limits on capability. What people had wantedfrom atool was the ability to spawn jobs, spawn distributed "probes", andquerystatus of jobs/nodes/procs. I have provided that capability - justnot sure
if there is more they would like to see.

Hope that helps
Ralph
I can see how this maps into scenario #1, but I don't quiteunderstandscenarios #2 and #3. Is there a new API for this functionality, oris
there a simple wire protocol that is used to connect to the HNP and
send these commands?  Does the HNP listen on a new socket for these
commands?  I.e., how does it work?


On Jan 16, 2008, at 8:47 AM, Ralph Castain wrote:
Hello all
Summary: this note provides a brief overview of how various toolscan
interface to OMPI applications once the next version of ORTE is
integrated
into the trunk. It includes a request for input regarding any needs
(e.g.,
additional commands to be supported in the interface) that have not
been
adequately addressed.

As many of you know, I have been working on a tmp branch to complete
the
revamp of ORTE that has been in progress for quite some time. Among
other
things, this revamp is intended to simplify the system, provide
enhanced
scalability, and improved reliability.

As part of that effort, I have extensively revised the support for
external
tools. In the past, tools such as the Eclipse PTP could only
interact with
Open MPI-based applications via ORTE API's, thus exposing the tool
to any
changes in those APIs. Most tools, however, do not require the level
of
control provided by the APIs and can benefit from a simplified
interface.

Accordingly, the revamped ORTE now offers alternative methods of
interaction. The primary change has been the creation of a
communications
library with a simple serial protocol for interacting with OMPI
jobs. Thus,
tools now have three choices for interacting with OMPI jobs:

1. I have created a new communications library that tools can link
against.
It does not include all of the ORTE or OMPI libraries, so it is a
very small
memory footprint. Besides the usual calls to initialize and
finalize, the
library contains utilities for finding all of the OMPI jobs running
on that
HNP (i.e., all OMPI jobs whose mpirun was executed from that host),
querying
the status of a job (provides the job map plus all proc states);
querying
the status of nodes (provides node names, status, and list of procs
on each
node including their state); querying the status of a specific
process;
spawning a new job; and terminating a job. In addition, you can
attach to
output streams of any process, specifying stdout, stderr, or both -
this
"tees" the specified streams, so it won't interfere with the job's
normal
output flow.

I could also create a utility to allow attachment to the input
stream of a
process. However, I'm a little concerned about possible conflictswithwhatever is already flowing across that stream. I would appreciateany
suggestions as to whether or not to provide that capability.

Note: we removed the concept of the ORTE "universe", so a tool can
now talk
to any mpirun without complications. Thus, tools can simultaneously
"connect" to and monitor multiple mpiruns, if desired.


2. link against all of OMPI or ORTE, and execute a standalone
program. In
this mode, your tool would act as a surrogate for mpirun by directly
spawning the user's application. This provides some flexibility, but
it does
mean that both the tool and the job -must- end together, and that
the tool
may need to be revised whenever OMPI/ORTE APIs are updated.
3. link against all of OMPI or ORTE, executing as a distributedset of
processes. In this mode, you would execute your tool via "mpirun -
pernode
./my_tool" (or whatever command is appropriate - this example would
launch
one tool process on every node in the allocation). If the tool
processes
need to communicate with each other, they can call MPI_Init or
orte_init,
depending upon the level of desired communication. Note that the
tool job
will be completely standalone from the application job and must be
terminated separately.


In all of these cases, it is possible for tool processes to connect
(via MPI
and/or ORTE-RML) to a job's processes provided that the application
supports
it.

I can provide more details, of course, to anyone wishing them. What
I would
appreciate, though, is any feedback about desired commands, mode of
operation, etc. that I might have missed or people would prefer be
changed.
This code is all in a private repository for my tmp branch, but I
expect
that to merge with the trunk fairly soon. I have provided a coupleofexample tools to illustrate the above modes of operation in thatcode.
Thanks
Ralph





_______________________________________________
devel mailing list
[email protected]
http://www.open-mpi.org/mailman/listinfo.cgi/devel



--
Jeff Squyres
Cisco Systems

Re: [OMPI devel] OMPI/ORTE and tools

Reply via email to