Hello,

we have been hit observing a strange behavior with OpenMPI 1.6.3

    strace -f /share/apps/openmpi/1.6.3/bin/mpiexec -n 2
--nooversubscribe --display-allocation --display-map --tag-output
/share/apps/gamess/2011R1/gamess.2011R1.x
/state/partition1/rmurri/29515/exam01.F05 -scr
/state/partition1/rmurri/29515

    ======================   ALLOCATED NODES   ======================

     Data for node: nh64-1-17.local Num slots: 0    Max slots: 0
     Data for node: nh64-1-17       Num slots: 2    Max slots: 0

    =================================================================

     ========================   JOB MAP   ========================

     Data for node: nh64-1-17       Num procs: 2
            Process OMPI jobid: [37108,1] Process rank: 0
            Process OMPI jobid: [37108,1] Process rank: 1

     =============================================================

As you can see, the host file lists the *unqualified* local host name;
OpenMPI fails to recognize that as the same host where it is running,
and uses `ssh` to spawn a remote `orted`, as use of `strace -f` shows:

    Process 16552 attached
    [pid 16552] execve("//usr/bin/ssh", ["/usr/bin/ssh", "-x",
"nh64-1-17", "OPAL_PREFIX=/share/apps/openmpi/1.6.3 ; export
OPAL_PREFIX; PATH=/share/apps/openmpi/1.6.3/bin:$PATH ; export PATH ;
LD_LIBRARY_PATH=/share/apps/openmpi/1.6.3/lib:$LD_LIBRARY_PATH ;
export LD_LIBRARY_PATH ;
DYLD_LIBRARY_PATH=/share/apps/openmpi/1.6.3/lib:$", "--daemonize",
"-mca", "ess", "env", "-mca", "orte_ess_jobid", "2431909888", "-mca",
"orte_ess_vpid", "1", "-mca", "orte_ess_num_procs", "2", "--hnp-uri",
"\"2431909888.0;tcp://10.1.255.237:33154\"", "-mca", "plm", "rsh"],
["OLI235=/state/partition1/rmurri/29515/exam01.F235", ...

If the machine file lists the FQDNs instead, `mpiexec` spawns the jobs
directly via fork()/exec().

This seems related to the fact that each compute node advertises
127.0.1.1 as the IP address associated to its hostname:

    $ ssh nh64-1-17 getent hosts nh64-1-17
    127.0.1.1    nh64-1-17.local nh64-1-17

Indeed, if I change /etc/hosts so that a compute node associates a
"real" IP with its hostname, `mpiexec` works as expected.

Is this a known feature/bug/easter egg?

For the record: using OpenMPI 1.6.3 on Rocks 5.2.

Thanks,
on behalf of the GC3 Team
Sergio :)

GC3: Grid Computing Competence Center
http://www.gc3.uzh.ch/
University of Zurich
Winterthurerstrasse 190
CH-8057 Zurich Switzerland
Tel: +41 44 635 4222
Fax: +41 44 635 6888

Reply via email to