Ralph,
Is there something I need to do to enable stdout/err encapsulation
(apart from -xml)? Here's what I see:
$ mpirun -mca orte_show_resolved_nodenames 1 -xml -display-map -np 5 /
Users/greg/Documents/workspace1/testMPI/Debug/testMPI
<map>
<host name="Jarrah.local" slots="8" max_slots="0">
<noderesolve resolved="node0"/>
<noderesolve resolved="node1"/>
<noderesolve resolved="node2"/>
<noderesolve resolved="node3"/>
<noderesolve resolved="node4"/>
<noderesolve resolved="node5"/>
<noderesolve resolved="node6"/>
<noderesolve resolved="node7"/>
<process rank="0"/>
<process rank="1"/>
<process rank="2"/>
<process rank="3"/>
<process rank="4"/>
</host>
</map>
n = 0
n = 0
n = 0
n = 0
n = 0
On Jan 15, 2009, at 1:13 PM, Ralph Castain wrote:
Okay, it is in the trunk as of r20284 - I'll file the request to
have it moved to 1.3.1.
Let me know if you get a chance to test the stdout/err stuff in the
trunk - we should try and iterate it so any changes can make 1.3.1
as well.
Thanks!
Ralph
On Jan 15, 2009, at 11:03 AM, Greg Watson wrote:
Ralph,
I think the second form would be ideal and would simplify things
greatly.
Greg
On Jan 15, 2009, at 10:53 AM, Ralph Castain wrote:
Here is what I was able to do - note that the resolve messages are
associated with the specific hostname, not the overall map:
<map>
<host name="graywolf54.lanl.gov" slots="1" max_slots="0">
<noderesolve name="graywolf54.lanl.gov" resolved="localhost"/>
<process rank="0"/>
<process rank="1"/>
<process rank="2"/>
</host>
</map>
Will that work for you? If you like, I can remove the name= field
from the noderesolve element since the info is specific to the
host element that contains it. In other words, I can make it look
like this:
<map>
<host name="graywolf54.lanl.gov" slots="1" max_slots="0">
<noderesolve resolved="localhost"/>
<process rank="0"/>
<process rank="1"/>
<process rank="2"/>
</host>
</map>
if that would help.
Ralph
On Jan 14, 2009, at 7:57 AM, Ralph Castain wrote:
We -may- be able to do a more formal XML output at some point.
The problem will be the natural interleaving of stdout/err from
the various procs due to the async behavior of MPI. Mpirun
receives fragmented output in the forwarding system, limited by
the buffer sizes and the amount of data we can read at any one
"bite" from the pipes connecting us to the procs. So even though
the user -thinks- they output a single large line of stuff, it
may show up at mpirun as a series of fragments. Hence, it gets
tricky to know how to put appropriate XML brackets around it.
Given this input about when you actually want resolved name info,
I can at least do something about that area. Won't be in 1.3.0,
but should make 1.3.1.
As for XML-tagged stdout/err: the OMPI community asked me not to
turn that feature "on" for 1.3.0 as they felt it hasn't been
adequately tested yet. The code is present, but cannot be
activated in 1.3.0. However, I believe it is activated on the
trunk when you do --xml --tagged-output, so perhaps some testing
will help us debug and validate it adequately for 1.3.1?
Thanks
Ralph
On Jan 14, 2009, at 7:02 AM, Greg Watson wrote:
Ralph,
The only time we use the resolved names is when we get a map, so
we consider them part of the map output.
If quasi-XML is all that will ever be possible with 1.3, then
you may as well leave as-is and we will attempt to clean it up
in Eclipse. It would be nice if a future version of ompi could
output correct XML (including stdout) as this would vastly
simplify the parsing we need to do.
Regards,
Greg
On Jan 13, 2009, at 3:30 PM, Ralph Castain wrote:
Hmmm...well, I can't do either for 1.3.0 as it is departing
this afternoon.
The first option would be very hard to do. I would have to
expose the display-map option across the code base and check it
prior to printing anything about resolving node names. I guess
I should ask: do you only want noderesolve statements when we
are displaying the map? Right now, I will output them regardless.
The second option could be done. I could check if any "display"
option has been specified, and output the <ompi> root at that
time (likewise for the end). Anything we output in-between
would be encapsulated between the two, but that would include
any user output to stdout and/or stderr - which for 1.3.0 is
not in xml.
Any thoughts?
Ralph
PS. Guess I should clarify that I was not striving for true XML
interaction here, but rather a quasi-XML format that would help
you to filter the output. I have no problem trying to get to
something more formally correct, but it could be tricky in some
places to achieve it due to the inherent async nature of the
beast.
On Jan 13, 2009, at 12:17 PM, Greg Watson wrote:
Ralph,
The XML is looking better now, but there is still one problem.
To be valid, there needs to be only one root element, but
currently you don't have any (or many). So rather than:
<noderesolve name="node0" resolved="Jarrah.local"/>
<noderesolve name="node1" resolved="Jarrah.local"/>
<map>
<host name="Jarrah.local" slots="8" max_slots="0">
<process rank="0"/>
<process rank="1"/>
<process rank="2"/>
<process rank="3"/>
<process rank="4"/>
</host>
</map>
the XML should be:
<map>
<noderesolve name="node0" resolved="Jarrah.local"/>
<noderesolve name="node1" resolved="Jarrah.local"/>
<host name="Jarrah.local" slots="8" max_slots="0">
<process rank="0"/>
<process rank="1"/>
<process rank="2"/>
<process rank="3"/>
<process rank="4"/>
</host>
</map>
or:
<ompi>
<noderesolve name="node0" resolved="Jarrah.local"/>
<noderesolve name="node1" resolved="Jarrah.local"/>
<map>
<host name="Jarrah.local" slots="8" max_slots="0">
<process rank="0"/>
<process rank="1"/>
<process rank="2"/>
<process rank="3"/>
<process rank="4"/>
</host>
</map>
</ompi>
Would either of these be possible?
Thanks,
Greg
On Dec 8, 2008, at 2:18 PM, Greg Watson wrote:
Ok thanks. I'll test from trunk in future.
Greg
On Dec 8, 2008, at 2:05 PM, Ralph Castain wrote:
Working its way around the CMR process now.
Might be easier in the future if we could test/debug this in
the trunk, though. Otherwise, the CMR procedure will fall
behind and a fix might miss a release window.
Anyway, hopefully this one will make the 1.3.0 release cutoff.
Thanks
Ralph
On Dec 8, 2008, at 9:56 AM, Greg Watson wrote:
Hi Ralph,
This is now in 1.3rc2, thanks. However there are a couple
of problems. Here is what I see:
[Jarrah.watson.ibm.com:58957] <noderesolve name="node0"
resolved="Jarrah.watson.ibm.com">
For some reason each line is prefixed with "[...]", any
idea why this is? Also the end tag should be "/>" not ">".
Thanks,
Greg
On Nov 24, 2008, at 3:06 PM, Greg Watson wrote:
Great, thanks. I'll take a look once it comes over to 1.3.
Cheers,
Greg
On Nov 24, 2008, at 2:59 PM, Ralph Castain wrote:
Yo Greg
This is in the trunk as of r20032. I'll bring it over to
1.3 in a few days.
I implemented it as another MCA param
"orte_show_resolved_nodenames" so you can actually get
the info as you execute the job, if you want. The xml tag
is "noderesolve" - let me know if you need any changes.
Ralph
On Oct 22, 2008, at 11:55 AM, Greg Watson wrote:
Ralph,
I guess the issue for us is that we will have to run two
commands to get the information we need. One to get the
configuration information, such as version and MCA
parameters, and one to get the host information, whereas
it would seem more logical that this should all be
available via some kind of "configuration discovery"
command. I understand the issue with supplying the
hostfile though, so maybe this just points at the need
for us to separate configuration information from the
host information. In any case, we'll work with what you
think is best.
Greg
On Oct 20, 2008, at 4:49 PM, Ralph Castain wrote:
Hmmm...just to be sure we are all clear on this. The
reason we proposed to use mpirun is that "hostfile" has
no meaning outside of mpirun. That's why ompi_info
can't do anything in this regard.
We have no idea what hostfile the user may specify
until we actually get the mpirun cmd line. They may
have specified a default-hostfile, but they could also
specify hostfiles for the individual app_contexts.
These may or may not include the node upon which mpirun
is executing.
So the only way to provide you with a separate command
to get a hostfile<->nodename mapping would require you
to provide us with the default-hostifle and/or hostfile
cmd line options just as if you were issuing the mpirun
cmd. We just wouldn't launch - but it would be the
exact equivalent of doing "mpirun --do-not-launch".
Am I missing something? If so, please do correct me - I
would be happy to provide a tool if that would make it
easier. Just not sure what that tool would do.
Thanks
Ralph
On Oct 19, 2008, at 1:59 PM, Greg Watson wrote:
Ralph,
It seems a little strange to be using mpirun for this,
but barring providing a separate command, or using
ompi_info, I think this would solve our problem.
Thanks,
Greg
On Oct 17, 2008, at 10:46 AM, Ralph Castain wrote:
Sorry for delay - had to ponder this one for awhile.
Jeff and I agree that adding something to ompi_info
would not be a good idea. Ompi_info has no knowledge
or understanding of hostfiles, and adding that
capability to it would be a major distortion of its
intended use.
However, we think we can offer an alternative that
might better solve the problem. Remember, we now
treat hostfiles in a very different manner than
before - see the wiki page for a complete
description, or "man orte_hosts".
So the problem is that, to provide you with what you
want, we need to "dump" the information from whatever
default-hostfile was provided, and, if no default-
hostfile was provided, then the information from each
hostfile that was provided with an app_context.
The best way we could think of to do this is to add
another mpirun cmd line option --dump-hostfiles that
would output the line-by-line name from the hostfile
plus the name we resolved it to. Of course, --xml
would cause it to be in xml format.
Would that meet your needs?
Ralph
On Oct 15, 2008, at 3:12 PM, Greg Watson wrote:
Hi Ralph,
We've been discussing this back and forth a bit
internally and don't really see an easy solution.
Our problem is that Eclipse is not running on the
head node, so gethostbyname will not necessarily
resolve to the same address. For example, the
hostfile might refer to the head node by an internal
network address that is not visible to the outside
world. Since gethostname also looks in /etc/hosts,
it may resolve locally but not on a remote system.
The only think I can think of would be, rather than
us reading the hostfile directly as we do now, to
provide an option to ompi_info that would dump the
hostfile using the same rules that you apply when
you're using the hostfile. Would that be feasible?
Greg
On Sep 22, 2008, at 4:25 PM, Ralph Castain wrote:
Sorry for delay - was on vacation and am now trying
to work my way back to the surface.
I'm not sure I can fix this one for two reasons:
1. In general, OMPI doesn't really care what name
is used for the node. However, the problem is that
it needs to be consistent. In this case, ORTE has
already used the name returned by gethostname to
create its session directory structure long before
mpirun reads a hostfile. This is why we retain the
value from gethostname instead of allowing it to be
overwritten by the name in whatever allocation we
are given. Using the name in hostfile would require
that I either find some way to remember any prior
name, or that I tear down and rebuild the session
directory tree - neither seems attractive nor
simple (e.g., what happens when the user provides
multiple entries in the hostfile for the node, each
with a different IP address based on another
interface in that node? Sounds crazy, but we have
already seen it done - which one do I use?).
2. We don't actually store the hostfile info
anywhere - we just use it and forget it. For us to
add an XML attribute containing any hostfile-
related info would therefore require us to re-read
the hostfile. I could have it do that -only- in the
case of "XML output required", but it seems rather
ugly.
An alternative might be for you to simply do a
"gethostbyname" lookup of the IP address or
hostname to see if it matches instead of just doing
a strcmp. This is what we have to do internally as
we frequently have problems with FQDN vs. non-FQDN
vs. IP addresses etc. If the local OS hasn't cached
the IP address for the node in question it can take
a little time to DNS resolve it, but otherwise
works fine.
I can point you to the code in OPAL that we use - I
would think something similar would be easy to
implement in your code and would readily solve the
problem.
Ralph
On Sep 19, 2008, at 7:18 AM, Greg Watson wrote:
Ralph,
The problem we're seeing is just with the head
node. If I specify a particular IP address for the
head node in the hostfile, it gets changed to the
FQDN when displayed in the map. This is a problem
for us as we need to be able to match the two, and
since we're not necessarily running on the head
node, we can't always do the same resolution
you're doing.
Would it be possible to use the same address that
is specified in the hostfile, or alternatively
provide an XML attribute that contains this
information?
Thanks,
Greg
On Sep 11, 2008, at 9:06 AM, Ralph Castain wrote:
Not in that regard, depending upon what you mean
by "recently". The only changes I am aware of wrt
nodes consisted of some changes to the order in
which we use the nodes when specified by hostfile
or -host, and a little #if protectionism needed
by Brian for the Cray port.
Are you seeing this for every node? Reason I ask:
I can't offhand think of anything in the code
base that would replace a host name with the FQDN
because we don't get that info for remote nodes.
The only exception is the head node (where mpirun
sits) - in that lone case, we default to the name
returned to us by gethostname(). We do that
because the head node is frequently accessible on
a more global basis than the compute nodes -
thus, the FQDN is required to ensure that there
is no address confusion on the network.
If the user refers to compute nodes in a hostfile
or -host (or in an allocation from a resource
manager) by non-FQDN, we just assume they know
what they are doing and the name will correctly
resolve to a unique address.
On Sep 10, 2008, at 9:45 AM, Greg Watson wrote:
Hi,
Has there been a change in the behavior of the -
display-map option has changed recently in the
1.3 branch. We're now seeing the host name as a
fully resolved DN rather than the entry that was
specified in the hostfile. Is there any
particular reason for this? If so, would it be
possible to add the hostfile entry to the output
since we need to be able to match the two?
Thanks,
Greg
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel