Ralph and I talked some more about this.

Here's what we think:

1. The root cause of the issue is that you are assigning a non-existent IP 
address to a name.  I.e., <foo> maps to 127.0.1.1, but that IP address does not 
exist anywhere.  Hence, OMPI will never conclude that that <foo> is "local".  
If you had assigned <foo> to the 127.0.0.1 address, things should have worked 
fine.

Just curious: why are you doing this?

2. That being said, OMPI is not currently looking at all the responses from 
gethostbyname() -- we're only looking at the first one.  In the spirit of how 
clients are supposed to behave when multiple IP addresses are returned from a 
single name lookup, OMPI should examine all of those addresses and see if it 
finds one that it "likes", and then use that.  So we should extend OMPI to 
examine all the IP addresses from gethostbyname().  This should also fix your 
issue.

Ralph is going to work on this, but it'll likely take him a little time to get 
it done.  We'll get it into the trunk and probably ask you to verify that it 
works for you.  And if so, we'll back-port to the v1.6 and v1.7 series.  

One final caveat, however: at this point, it does not look likely that 1.6.6 
will ever happen.  If this all works out, the fix will be committed to the v1.6 
tree, and you can grab a nightly tarball snapshot (which are identical to our 
release tarballs except for their version numbers), or you can patch your 1.6.5 
installation.  But if 1.6.6 is ever released, the fix will be included.


On Jul 2, 2013, at 9:53 AM, Riccardo Murri <riccardo.mu...@uzh.ch> wrote:

> Hi,
> 
> sorry for the delay in replying -- pretty busy week :-(
> 
> 
> On 28 June 2013 21:54, Jeff Squyres (jsquyres) <jsquy...@cisco.com> wrote:
>> Here's what we think we know (I'm using the name "foo" instead of
>> your actual hostname because it's easier to type):
>> 
>> 1. When you run "hostname", you get foo.local back
> 
> Yes.
> 
> 
>> 2. In your /etc/hosts file, foo.local is listed on two lines:
>>   127.0.1.1
>>   10.1.255.201
>> 
> 
> Yes:
> 
>    [rmurri@nh64-5-9 ~]$ fgrep nh64-5-9 /etc/hosts
>    127.0.1.1   nh64-5-9.local nh64-5-9
>    10.1.255.194    nh64-5-9.local nh64-5-9
> 
> 
>> 3. When you login to the "foo" server and execute mpirun with a hostfile
>> that contains "foo", Open MPI incorrectly thinks that the local machine is
>> not foo, and therefore tries to ssh to it (and things go downhill from
>> there).
>> 
> 
> Yes.
> 
> 
>> 4. When you login to the "foo" server and execute mpirun with a hostfile
>> that contains "foo.local" (you said "FQDN", but never said exactly what you
>> meant by that -- I'm assuming "foo.local", not "foo.yourdomain.com"), then
>> Open MPI behaves properly.
>> 
> 
> Yes.
> 
> FQDN = foo.local.  (This is a compute node in a cluster that does not
> have any public IP address not DNS entry -- it only has an interface
> to the cluster-private network.  I presume this is not relevant to
> OpenMPI as long as all names are correctly resolved via `/etc/hosts`.)
> 
> 
>> Is that all correct?
> 
> Yes, all correct.
> 
> 
>> We have some followup questions for you:
>> 
>> 1. What happens when you try to resolve "foo"? (e.g., via the "dig" program
>> -- "dig foo")
> 
> Here's what happens with `dig`:
> 
>    [rmurri@nh64-5-9 ~]$ dig nh64-5-9
> 
>    ; <<>> DiG 9.3.6-P1-RedHat-9.3.6-4.P1.el5 <<>> nh64-5-9
>    ;; global options:  printcmd
>    ;; Got answer:
>    ;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 4373
>    ;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 0
> 
>    ;; QUESTION SECTION:
>    ;nh64-5-9.                 IN      A
> 
>    ;; AUTHORITY SECTION:
>    .                  3600    IN      SOA     a.root-servers.net. 
> nstld.verisign-grs.com.
> 2013070200 1800 900 604800 86400
> 
>    ;; Query time: 17 msec
>    ;; SERVER: 10.1.1.1#53(10.1.1.1)
>    ;; WHEN: Tue Jul  2 15:47:57 2013
>    ;; MSG SIZE  rcvd: 101
> 
> However, `getent hosts` has a different reply:
> 
>    [rmurri@nh64-5-9 ~]$ getent hosts nh64-5-9
>    127.0.1.1       nh64-5-9.local nh64-5-9
> 
> 
>> 2. What happens when you try to resolve "foo.local"? (e.g., "dig foo.local")
> 
> Here's what happens with `dig`:
> 
>    [rmurri@nh64-5-9 ~]$ dig nh64-5-9.local
> 
>    ; <<>> DiG 9.3.6-P1-RedHat-9.3.6-4.P1.el5 <<>> nh64-5-9.local
>    ;; global options:  printcmd
>    ;; Got answer:
>    ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 62092
>    ;; flags: qr aa rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 1, ADDITIONAL: 1
> 
>    ;; QUESTION SECTION:
>    ;nh64-5-9.local.                   IN      A
> 
>    ;; ANSWER SECTION:
>    nh64-5-9.local.            259200  IN      A       10.1.255.194
> 
>    ;; AUTHORITY SECTION:
>    local.                     259200  IN      NS      ns.local.
> 
>    ;; ADDITIONAL SECTION:
>    ns.local.          259200  IN      A       127.0.0.1
> 
>    ;; Query time: 0 msec
>    ;; SERVER: 10.1.1.1#53(10.1.1.1)
>    ;; WHEN: Tue Jul  2 15:48:50 2013
>    ;; MSG SIZE  rcvd: 81
> 
> Same query resolved via `getent hosts`:
> 
>    [rmurri@nh64-5-9 ~]$ getent hosts nh64-5-9
>    127.0.1.1       nh64-5-9.local nh64-5-9
> 
> 
>> 3. What happens when you try to resolve "foo.yourdomain.com"? (e.g., "dig
>> foo.yourdomain.com")
> 
> This yields an empty response from both `dig` and `getent hosts` as the node
> is only attached to a private network and not registered in DNS:
> 
>    [rmurri@nh64-5-9 ~]$ getent hosts nh64-5-9.uzh.ch
>    [rmurri@nh64-5-9 ~]$ dig nh64-5-9.uzh.ch
> 
>    ; <<>> DiG 9.3.6-P1-RedHat-9.3.6-4.P1.el5 <<>> nh64-5-9.uzh.ch
>    ;; global options:  printcmd
>    ;; Got answer:
>    ;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 61801
>    ;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 0
> 
>    ;; QUESTION SECTION:
>    ;nh64-5-9.uzh.ch.          IN      A
> 
>    ;; AUTHORITY SECTION:
>    uzh.ch.                    8921    IN      SOA     ns1.uzh.ch. 
> hostmaster.uzh.ch. 384627811
> 3600 1800 3600000 10800
> 
>    ;; Query time: 0 msec
>    ;; SERVER: 10.1.1.1#53(10.1.1.1)
>    ;; WHEN: Tue Jul  2 15:50:54 2013
>    ;; MSG SIZE  rcvd: 84
> 
> 
>> 4. Please apply the attached patch to your Open MPI 1.6.5 build (please note
>> that it adds diagnostic output; do *not* put this patch into production)
>> and:
>>   4a. Run with one of your "bad" cases and send us the output
>>   4b. Run with one of your "good" cases and send us the output
> 
> Please find the outputs attached.  The exact `mpiexec` invocation and
> the machines file are at the beginning of each file.
> 
> Note that I allocated 8 slots (on 4 nodes), but only use 2 slots (on 1 node).
> 
> Thanks,
> Riccardo
> <exam01.out.BAD><exam01.out.GOOD>_______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/


Reply via email to