Re: [OMPI users] Job hangs when daemon does not report back from remote machine

Kersey Black Sun, 8 Feb 2009 16:09:04 -0500

Many thanks.  The firewall is the issue.

On Feb 9, 2009, at 5:56 AM, Ralph Castain wrote:

It sounds to me like TCP communication isn't getting through forsome reason. Try the following:
mpirun --mca plm_base_verbose 5 --hostfile myh3 -pernode hostname

black@ccn3:~/Documents/mp> mpirun --mca plm_base_verbose 5 --hostfilemyh3 -pernode hostname

[ccn3:26932] mca:base:select:(  plm) Querying component [rsh]

[ccn3:26932] mca:base:select:( plm) Query of component [rsh] setpriority to 10

[ccn3:26932] mca:base:select:(  plm) Querying component [slurm]

[ccn3:26932] mca:base:select:( plm) Skipping component [slurm]. Queryfailed to return a module

[ccn3:26932] mca:base:select:(  plm) Selected component [rsh]
-----hangs here

But, when I turn off the firewall for a moment on both machines, localand remote, everything works:black@ccn3:~/Documents/mp> mpirun --mca plm_base_verbose 5 --hostfilemyh3 -pernode hostname

[ccn3:26442] mca:base:select:(  plm) Querying component [rsh]

[ccn3:26442] mca:base:select:( plm) Query of component [rsh] setpriority to 10

[ccn3:26442] mca:base:select:(  plm) Querying component [slurm]

[ccn3:26442] mca:base:select:( plm) Skipping component [slurm]. Queryfailed to return a module

[ccn3:26442] mca:base:select:(  plm) Selected component [rsh]
ccn3
ccn4

2 Questions:

1) Is it really trying to use 'rsh', or is that just part of thelanguage in the debugging reporting? I assume it is actually usingssh under the hood, but it is worth asking. I am using the defaultconfiguration on this.

black@ccn3:~/Documents/mp> ompi_info --param all all | grep pls

MCA plm: parameter "plm_rsh_agent" (current value:"ssh : rsh", data source: default value, synonyms: pls_rsh_agent)2) Since it is a firewall issue, I read what I could find and itseems there is not a means of restricting port ranges. Right now,each node in this small cluster is running its own firewall ratherthan all being hidden behind some other machine or switch. Anypointers for handling this most easily.


Cheers, Kersey

You should see output from the receipt of a daemon callback for eachdaemon, the the sending of the launch command. My guess is that youwon't see all the daemons callback, which is why you hang.
This should tell you which node isn't getting a message back towherever mpirun is executing. You might then check to ensure nofirewalls are in the way to that node, there is a TCP path back fromit, etc.
I can help with additional diagnostics once we get that far.
Ralph

On Feb 7, 2009, at 2:40 PM, Kersey Black wrote:

Re: [OMPI users] Job hangs when daemon does not report back from remote machine

Reply via email to