Many thanks. The firewall is the issue.
On Feb 9, 2009, at 5:56 AM, Ralph Castain wrote:
It sounds to me like TCP communication isn't getting through for
some reason. Try the following:
mpirun --mca plm_base_verbose 5 --hostfile myh3 -pernode hostname
black@ccn3:~/Documents/mp> mpirun --mca plm_base_verbose 5 --hostfile
myh3 -pernode hostname
[ccn3:26932] mca:base:select:( plm) Querying component [rsh]
[ccn3:26932] mca:base:select:( plm) Query of component [rsh] set
priority to 10
[ccn3:26932] mca:base:select:( plm) Querying component [slurm]
[ccn3:26932] mca:base:select:( plm) Skipping component [slurm]. Query
failed to return a module
[ccn3:26932] mca:base:select:( plm) Selected component [rsh]
-----hangs here
But, when I turn off the firewall for a moment on both machines, local
and remote, everything works:
black@ccn3:~/Documents/mp> mpirun --mca plm_base_verbose 5 --hostfile
myh3 -pernode hostname
[ccn3:26442] mca:base:select:( plm) Querying component [rsh]
[ccn3:26442] mca:base:select:( plm) Query of component [rsh] set
priority to 10
[ccn3:26442] mca:base:select:( plm) Querying component [slurm]
[ccn3:26442] mca:base:select:( plm) Skipping component [slurm]. Query
failed to return a module
[ccn3:26442] mca:base:select:( plm) Selected component [rsh]
ccn3
ccn4
2 Questions:
1) Is it really trying to use 'rsh', or is that just part of the
language in the debugging reporting? I assume it is actually using
ssh under the hood, but it is worth asking. I am using the default
configuration on this.
black@ccn3:~/Documents/mp> ompi_info --param all all | grep pls
MCA plm: parameter "plm_rsh_agent" (current value:
"ssh : rsh", data source: default value, synonyms: pls_rsh_agent)
2) Since it is a firewall issue, I read what I could find and it
seems there is not a means of restricting port ranges. Right now,
each node in this small cluster is running its own firewall rather
than all being hidden behind some other machine or switch. Any
pointers for handling this most easily.
Cheers, Kersey
You should see output from the receipt of a daemon callback for each
daemon, the the sending of the launch command. My guess is that you
won't see all the daemons callback, which is why you hang.
This should tell you which node isn't getting a message back to
wherever mpirun is executing. You might then check to ensure no
firewalls are in the way to that node, there is a TCP path back from
it, etc.
I can help with additional diagnostics once we get that far.
Ralph
On Feb 7, 2009, at 2:40 PM, Kersey Black wrote: