Hmmm...well, I can't do either for 1.3.0 as it is departing this
The first option would be very hard to do. I would have to expose the
display-map option across the code base and check it prior to printing
anything about resolving node names. I guess I should ask: do you only
want noderesolve statements when we are displaying the map? Right now,
I will output them regardless.
The second option could be done. I could check if any "display" option
has been specified, and output the <ompi> root at that time (likewise
for the end). Anything we output in-between would be encapsulated
between the two, but that would include any user output to stdout and/
or stderr - which for 1.3.0 is not in xml.
Any thoughts?
PS. Guess I should clarify that I was not striving for true XML
interaction here, but rather a quasi-XML format that would help you to
filter the output. I have no problem trying to get to something more
formally correct, but it could be tricky in some places to achieve it
due to the inherent async nature of the beast.
On Jan 13, 2009, at 12:17 PM, Greg Watson wrote:
The XML is looking better now, but there is still one problem. To be
valid, there needs to be only one root element, but currently you
don't have any (or many). So rather than:
<noderesolve name="node0" resolved="Jarrah.local"/>
<noderesolve name="node1" resolved="Jarrah.local"/>
<host name="Jarrah.local" slots="8" max_slots="0">
<process rank="0"/>
<process rank="1"/>
<process rank="2"/>
<process rank="3"/>
<process rank="4"/>
the XML should be:
<noderesolve name="node0" resolved="Jarrah.local"/>
<noderesolve name="node1" resolved="Jarrah.local"/>
<host name="Jarrah.local" slots="8" max_slots="0">
<process rank="0"/>
<process rank="1"/>
<process rank="2"/>
<process rank="3"/>
<process rank="4"/>
<noderesolve name="node0" resolved="Jarrah.local"/>
<noderesolve name="node1" resolved="Jarrah.local"/>
<host name="Jarrah.local" slots="8" max_slots="0">
<process rank="0"/>
<process rank="1"/>
<process rank="2"/>
<process rank="3"/>
<process rank="4"/>
Would either of these be possible?
On Dec 8, 2008, at 2:18 PM, Greg Watson wrote:
Ok thanks. I'll test from trunk in future.
On Dec 8, 2008, at 2:05 PM, Ralph Castain wrote:
Working its way around the CMR process now.
Might be easier in the future if we could test/debug this in the
trunk, though. Otherwise, the CMR procedure will fall behind and a
fix might miss a release window.
Anyway, hopefully this one will make the 1.3.0 release cutoff.
On Dec 8, 2008, at 9:56 AM, Greg Watson wrote:
Hi Ralph,
This is now in 1.3rc2, thanks. However there are a couple of
problems. Here is what I see:
[] <noderesolve name="node0"
For some reason each line is prefixed with "[...]", any idea why
this is? Also the end tag should be "/>" not ">".
On Nov 24, 2008, at 3:06 PM, Greg Watson wrote:
Great, thanks. I'll take a look once it comes over to 1.3.
On Nov 24, 2008, at 2:59 PM, Ralph Castain wrote:
Yo Greg
This is in the trunk as of r20032. I'll bring it over to 1.3 in
a few days.
I implemented it as another MCA param
"orte_show_resolved_nodenames" so you can actually get the info
as you execute the job, if you want. The xml tag is
"noderesolve" - let me know if you need any changes.
On Oct 22, 2008, at 11:55 AM, Greg Watson wrote:
I guess the issue for us is that we will have to run two
commands to get the information we need. One to get the
configuration information, such as version and MCA parameters,
and one to get the host information, whereas it would seem
more logical that this should all be available via some kind
of "configuration discovery" command. I understand the issue
with supplying the hostfile though, so maybe this just points
at the need for us to separate configuration information from
the host information. In any case, we'll work with what you
think is best.
On Oct 20, 2008, at 4:49 PM, Ralph Castain wrote:
Hmmm...just to be sure we are all clear on this. The reason
we proposed to use mpirun is that "hostfile" has no meaning
outside of mpirun. That's why ompi_info can't do anything in
this regard.
We have no idea what hostfile the user may specify until we
actually get the mpirun cmd line. They may have specified a
default-hostfile, but they could also specify hostfiles for
the individual app_contexts. These may or may not include the
node upon which mpirun is executing.
So the only way to provide you with a separate command to get
a hostfile<->nodename mapping would require you to provide us
with the default-hostifle and/or hostfile cmd line options
just as if you were issuing the mpirun cmd. We just wouldn't
launch - but it would be the exact equivalent of doing
"mpirun --do-not-launch".
Am I missing something? If so, please do correct me - I would
be happy to provide a tool if that would make it easier. Just
not sure what that tool would do.
On Oct 19, 2008, at 1:59 PM, Greg Watson wrote:
It seems a little strange to be using mpirun for this, but
barring providing a separate command, or using ompi_info, I
think this would solve our problem.
On Oct 17, 2008, at 10:46 AM, Ralph Castain wrote:
Sorry for delay - had to ponder this one for awhile.
Jeff and I agree that adding something to ompi_info would
not be a good idea. Ompi_info has no knowledge or
understanding of hostfiles, and adding that capability to
it would be a major distortion of its intended use.
However, we think we can offer an alternative that might
better solve the problem. Remember, we now treat hostfiles
in a very different manner than before - see the wiki page
for a complete description, or "man orte_hosts".
So the problem is that, to provide you with what you want,
we need to "dump" the information from whatever default-
hostfile was provided, and, if no default-hostfile was
provided, then the information from each hostfile that was
provided with an app_context.
The best way we could think of to do this is to add another
mpirun cmd line option --dump-hostfiles that would output
the line-by-line name from the hostfile plus the name we
resolved it to. Of course, --xml would cause it to be in
xml format.
Would that meet your needs?
On Oct 15, 2008, at 3:12 PM, Greg Watson wrote:
Hi Ralph,
We've been discussing this back and forth a bit internally
and don't really see an easy solution. Our problem is that
Eclipse is not running on the head node, so gethostbyname
will not necessarily resolve to the same address. For
example, the hostfile might refer to the head node by an
internal network address that is not visible to the
outside world. Since gethostname also looks in /etc/hosts,
it may resolve locally but not on a remote system. The
only think I can think of would be, rather than us reading
the hostfile directly as we do now, to provide an option
to ompi_info that would dump the hostfile using the same
rules that you apply when you're using the hostfile. Would
that be feasible?
On Sep 22, 2008, at 4:25 PM, Ralph Castain wrote:
Sorry for delay - was on vacation and am now trying to
work my way back to the surface.
I'm not sure I can fix this one for two reasons:
1. In general, OMPI doesn't really care what name is used
for the node. However, the problem is that it needs to be
consistent. In this case, ORTE has already used the name
returned by gethostname to create its session directory
structure long before mpirun reads a hostfile. This is
why we retain the value from gethostname instead of
allowing it to be overwritten by the name in whatever
allocation we are given. Using the name in hostfile would
require that I either find some way to remember any prior
name, or that I tear down and rebuild the session
directory tree - neither seems attractive nor simple
(e.g., what happens when the user provides multiple
entries in the hostfile for the node, each with a
different IP address based on another interface in that
node? Sounds crazy, but we have already seen it done -
which one do I use?).
2. We don't actually store the hostfile info anywhere -
we just use it and forget it. For us to add an XML
attribute containing any hostfile-related info would
therefore require us to re-read the hostfile. I could
have it do that -only- in the case of "XML output
required", but it seems rather ugly.
An alternative might be for you to simply do a
"gethostbyname" lookup of the IP address or hostname to
see if it matches instead of just doing a strcmp. This is
what we have to do internally as we frequently have
problems with FQDN vs. non-FQDN vs. IP addresses etc. If
the local OS hasn't cached the IP address for the node in
question it can take a little time to DNS resolve it, but
otherwise works fine.
I can point you to the code in OPAL that we use - I would
think something similar would be easy to implement in
your code and would readily solve the problem.
On Sep 19, 2008, at 7:18 AM, Greg Watson wrote:
The problem we're seeing is just with the head node. If
I specify a particular IP address for the head node in
the hostfile, it gets changed to the FQDN when displayed
in the map. This is a problem for us as we need to be
able to match the two, and since we're not necessarily
running on the head node, we can't always do the same
resolution you're doing.
Would it be possible to use the same address that is
specified in the hostfile, or alternatively provide an
XML attribute that contains this information?
On Sep 11, 2008, at 9:06 AM, Ralph Castain wrote:
Not in that regard, depending upon what you mean by
"recently". The only changes I am aware of wrt nodes
consisted of some changes to the order in which we use
the nodes when specified by hostfile or -host, and a
little #if protectionism needed by Brian for the Cray
Are you seeing this for every node? Reason I ask: I
can't offhand think of anything in the code base that
would replace a host name with the FQDN because we
don't get that info for remote nodes. The only
exception is the head node (where mpirun sits) - in
that lone case, we default to the name returned to us
by gethostname(). We do that because the head node is
frequently accessible on a more global basis than the
compute nodes - thus, the FQDN is required to ensure
that there is no address confusion on the network.
If the user refers to compute nodes in a hostfile or -
host (or in an allocation from a resource manager) by
non-FQDN, we just assume they know what they are doing
and the name will correctly resolve to a unique address.
On Sep 10, 2008, at 9:45 AM, Greg Watson wrote:
Has there been a change in the behavior of the -
display-map option has changed recently in the 1.3
branch. We're now seeing the host name as a fully
resolved DN rather than the entry that was specified
in the hostfile. Is there any particular reason for
this? If so, would it be possible to add the hostfile
entry to the output since we need to be able to match
the two?
devel mailing list
devel mailing list
devel mailing list
devel mailing list
devel mailing list
devel mailing list
devel mailing list
devel mailing list
devel mailing list
devel mailing list
devel mailing list
devel mailing list
devel mailing list
devel mailing list
devel mailing list