Hi,
Disclaimer up front -- a newbie to openmpi working to get Gromacs and other modeling code running. I have it running fine on the local machine, but I am unable to get openmpi to work when trying to include a remote machine.
Any help or pointers would be greatly appreciated.

System:   opensuse, 10.3.
Openmpi: first I installed 1.2.2 as rpm from yast, and, when that did not seem to work, I switched to the current release of 1.3, compiled with default configuration options, except I did use the -- prefix to set the installation directory
openmpi-mca-params.conf:   (with 1.3) I have only added
   btl = self,tcp
   mpi_show_mca_params = enviro
ssh:  host-based authentication

With both installs, I can run on multiple slots on the local machine, but when I try to include a remote machine, it hangs.
Using this hostfile:
  ccn3 slots=2 max_slots=2
  ccn4 slots=2 max_slots=2
Typical output (this is from 1.3) when I try to run two slots locally (ccn3) and 2 on the remote machine (ccn4):
-----
black@ccn3:~/Documents/mp> mpirun --debug-daemons --hostfile myh3 -np 4 hostname
Daemon was launched on ccn3 - beginning to initialize
Daemon [[63883,0],1] checking in as pid 20554 on host ccn3
Daemon [[63883,0],1] not using static ports
[ccn3:20554] [[63883,0],1] orted: up and running - waiting for commands!
Daemon was launched on ccn4 - beginning to initialize
Daemon [[63883,0],2] checking in as pid 7485 on host ccn4
Daemon [[63883,0],2] not using static ports
----
And here it hangs

When I kill the job with ^C, I get:
        ccn3
        ccn4 - daemon did not report back when launched

Everything I read in the FAQ (in particular in part 2 of the "Running MPI" portion) suggests that this has to do with SSH problems, or with PATH problems. SSH is configured and working for host-based authentication. It seems to be fine. I set the LD_LIBRARY_PATH to include openmpi/lib and include the openmpi/bin directory in PATH as part of a script that runs for all users (called by /bin/bashrc.local), and when things did not work, I included the same code in ~/.bashrc and ~/.profile. All of this results in it being set 3 times (from `env`) in a interactive shell, but it has not solved the problem.

For comparison, when I run it locally on just two slots on the local machine, I get: black@ccn3:~/Documents/mp> mpirun --debug-daemons --hostfile myh3 -np 2 hostname
Daemon was launched on ccn3 - beginning to initialize
Daemon [[63924,0],1] checking in as pid 20608 on host ccn3
Daemon [[63924,0],1] not using static ports
[ccn3:20603] [[63924,0],0] orted_cmd: received add_local_procs
[ccn3:20603] [[63924,0],0] node[0].name ccn3 daemon 0 arch ffc91200
[ccn3:20603] [[63924,0],0] node[1].name ccn3 daemon 1 arch ffc91200
[ccn3:20603] [[63924,0],0] node[2].name ccn4 daemon INVALID arch ffc91200
[ccn3:20608] [[63924,0],1] orted: up and running - waiting for commands!
[ccn3:20608] [[63924,0],1] orted_cmd: received add_local_procs
[ccn3:20608] [[63924,0],1] node[0].name ccn3 daemon 0 arch ffc91200
[ccn3:20608] [[63924,0],1] node[1].name ccn3 daemon 1 arch ffc91200
[ccn3:20608] [[63924,0],1] node[2].name ccn4 daemon INVALID arch ffc91200
ccn3
[ccn3:20608] [[63924,0],1] orted_cmd: received waitpid_fired cmd
[ccn3:20608] [[63924,0],1] orted_cmd: received iof_complete cmd
ccn3
[ccn3:20608] [[63924,0],1] orted_cmd: received waitpid_fired cmd
[ccn3:20608] [[63924,0],1] orted_cmd: received iof_complete cmd
[ccn3:20608] [[63924,0],1] orted_cmd: received exit
[ccn3:20608] [[63924,0],1] orted: finalizing

I can also run it locally on the remote machine with the command:
ssh ccn4 mpirun --debug-daemons -np 2 hostname

Many thanks for any ideas.

Kersey

Reply via email to