I ran oob_tcp_verbose 99 and I am getting something interesting I never got 
before.

[machine 2:22347] bind() failed: no port available in the range [60001-60016]
[machine 2:22347] mca_oob_tcp_init: unable to create IPv4 listen socket: Error

I never got that error before we messed with the iptables but now I get that 
error... Very interesting, I will have to talk to my sysadmin again and make 
sure he opened the right ports on my two test machines. It looks as though 
there are no open ports. Another interesting thing is I see that the Daemon is 
still report:

Daemon [[28845,0],1] checking in as pid 22347 on host machine 2
Daemon [[28845,0],1] not using static ports

Which, I may be misunderstanding, should have been taken care of when I 
specified what ports to use. I am telling it a static set of ports... Anyhow, I 
will get with my sysadmin again and see what he says. At least OpenMPI is 
correctly interpreting the range. 

Thanks for the help.

--- On Sat, 7/10/10, Ralph Castain <r...@open-mpi.org> wrote:

From: Ralph Castain <r...@open-mpi.org>
Subject: Re: [OMPI users] OpenMPI Hangs, No Error
To: "Open MPI Users" <us...@open-mpi.org>
List-Post: users@lists.open-mpi.org
Date: Saturday, July 10, 2010, 3:21 PM

Are there multiple interfaces on your nodes? I'm wondering if we are using a 
different network than the one where you opened these ports.
You'll get quite a bit of output, but you can turn on debug output in the oob 
itself with -mca oob_tcp_verbose xx. The higher the number, the more you get.

On Jul 10, 2010, at 11:14 AM, Robert Walters wrote:
Hello again,

I believe my administrator has opened the ports I requested. The problem I am 
having now is that OpenMPI is not listening to my defined port assignments in 
openmpi-mca-params.conf (looks like permission 644 on those files should it be 
755?)

When I perform netstat -ltnup I see that orted is listening 14 processes in tcp 
but scaterred in the 26000ish port range when I specified 60001-60016 in the 
mca-params file. Is there a parameter I am missing? In any case I am still 
hanging as mentioned originally even with the port forwarding enabled and 
specifications in mca-param enabled. 

Any other ideas on what might be causing the hang? Is there a more verbose mode 
I can employ to see more deeply into the issue? I have run --debug-daemons and 
--mca plm_base_verbose 99.

Thanks!
--- On Tue, 7/6/10, Robert Walters
 <raw19...@yahoo.com> wrote:

From: Robert Walters <raw19...@yahoo.com>
Subject: Re: [OMPI users] OpenMPI Hangs, No Error
To: "Open MPI Users" <us...@open-mpi.org>
List-Post: users@lists.open-mpi.org
Date: Tuesday, July 6, 2010, 5:41 PM

Thanks for your expeditious responses, Ralph.

Just to confirm with you, I should change openmpi-mca-params.conf to include:

oob_tcp_port_min_v4 = (My minimum port in the range)
oob_tcp_port_range_v4 = (My port range)
btl_tcp_port_min_v4 = (My minimum port in the range)
btl_tcp_port_range_v4 = (My port range)

correct?

Also, for a cluster of around 32-64 processes (8 processors per node), how wide 
of a range will I require? I've noticed some entries in
 the mailing list suggesting you need a few to get started and then it opens as 
necessary. Will I be safe with 20 or should I go for 100? 

Thanks again for all of your help!

--- On Tue, 7/6/10, Ralph Castain <r...@open-mpi.org> wrote:

From:
 Ralph Castain <r...@open-mpi.org>
Subject: Re: [OMPI users] OpenMPI Hangs, No Error
To: "Open MPI Users" <us...@open-mpi.org>
List-Post: users@lists.open-mpi.org
Date: Tuesday, July 6, 2010, 5:31 PM

Problem isn't with ssh - the problem is that the daemons need to open a TCP 
connection back to the machine where mpirun is running. If the firewall blocks 
that connection, then we can't run.
If you can get a range of ports opened, then you can specify the ports OMPI 
should use for this purpose. If the sysadmin won't allow even that, then you 
are pretty well hosed.

On Jul 6, 2010, at 2:23 PM, Robert Walters wrote:
Yes, there is a system firewall. I don't think the sysadmin will allow it to go 
disabled. Each Linux machine
 has the built-in RHEL firewall. SSH is enabled through the firewall though.

--- On Tue, 7/6/10, Ralph Castain <r...@open-mpi.org> wrote:

From: Ralph Castain <r...@open-mpi.org>
Subject: Re: [OMPI users] OpenMPI Hangs, No Error
To: "Open MPI Users" <us...@open-mpi.org>
List-Post: users@lists.open-mpi.org
Date: Tuesday, July 6, 2010, 4:19 PM

It looks like the remote daemon is starting - is there a firewall in the way?
On Jul 6, 2010, at 2:04 PM, Robert Walters
 wrote:
Hello all,

I am using OpenMPI 1.4.2 on RHEL. I have a cluster of AMD Opteron's and right 
now I am just working on getting OpenMPI itself up and running. I have a 
successful configure and make all install. LD_LIBRARY_PATH and PATH variables 
were correctly edited. mpirun -np 8 hello_c successfully works on all machines. 
I have setup my two test machines with DSA key pairs that successfully work 
with each other.

The problem comes when I initiate my hostfile to attempt to communicate across 
machines. The hostfile is setup correctly with <host_name> <slots> <max-slots>. 
When running with all verbose options enabled "mpirun --mca plm_base_verbose 99 
--debug-daemons --mca btl_base_verbose 30 --mca oob_base_verbose 99 --mca
 pml_base_verbose 99 -hostfile hostfile -np 16 hello_c" I receive the following 
text
 output.

[machine1:03578] mca: base: components_open: Looking for plm components
[machine1:03578] mca: base: components_open: opening plm components
[machine1:03578] mca: base: components_open: found loaded component rsh
[machine1:03578] mca: base: components_open: component rsh has no register 
function
[machine1:03578] mca: base: components_open: component rsh open function 
successful
[machine1:03578] mca: base: components_open: found loaded component slurm
[machine1:03578] mca: base: components_open: component slurm has no register 
function
[machine1:03578] mca: base: components_open: component slurm open function 
successful
[machine1:03578] mca:base:select: Auto-selecting plm components
[machine1:03578] mca:base:select:(  plm) Querying component [rsh]
[machine1:03578] mca:base:select:(  plm) Query of component [rsh] set priority 
to 10
[machine1:03578] mca:base:select:(  plm) Querying component
 [slurm]
[machine1:03578] mca:base:select:(  plm) Skipping component [slurm]. Query 
failed to return a module
[machine1:03578] mca:base:select:(  plm) Selected component [rsh]
[machine1:03578] mca: base: close: component slurm closed
[machine1:03578] mca: base: close: unloading component slurm
[machine1:03578] mca: base: components_open: Looking for oob components
[machine1:03578] mca: base: components_open: opening oob components
[machine1:03578] mca: base: components_open: found loaded component tcp
[machine1:03578] mca: base: components_open: component tcp has no register 
function
[machine1:03578] mca: base: components_open: component tcp open function 
successful
Daemon was launched on machine2- beginning to initialize
[machine2:01962] mca: base: components_open: Looking for oob components
[machine2:01962] mca: base: components_open: opening oob components
[machine2:01962] mca: base: components_open:
 found loaded component tcp
[machine2:01962] mca: base: components_open: component tcp has no register 
function
[machine2:01962] mca: base: components_open: component tcp open function 
successful
Daemon [[1418,0],1] checking in as pid 1962 on host machine2
Daemon [[1418,0],1] not using static ports

At this point the system hangs indefinitely. While running top on the machine2 
terminal, I see several things come up briefly. These items are: sshd (root), 
tcsh (myuser), orted (myuser), and mcstransd (root). I was wondering if sshd 
needs to be initiated by myuser? It is currently turned off in sshd_config 
through UsePAM yes. This was setup by the sysadmin but it can be worked around 
if this is necessary.

So in summary, mpirun works on each machine individually, but hangs when 
initiated through a hostfile or with the -host flag. ./configure with defaults 
and --prefix. LD_LIBRARY_PATH and PATH set up correctly. Any help is
 appreciated. Thanks!




      _______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

-----Inline Attachment Follows-----

_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users








      _______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

-----Inline Attachment Follows-----

_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


      


      _______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

-----Inline Attachment Follows-----

_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


      

Reply via email to