Re: [OMPI users] Problem in remote nodes
Jeff, In my case, it was the firewall. It was restricting communication to ssh only between the compute nodes. I appreciate the help. Rob Jeff Squyres (jsquyres) wrote: Those are normal ssh messages, I think - an ssh session may try mulktiple auth methods before one succeeds. You're absolutely sure that there's no firewalling software and selinux is disabled? Ompi is behaving as if it is trying to communicate and failing (e.g., its hanging while trying to open some tcp sockets back). Can you open random tcp sockets between your nodes? (E.g., in non-mpi processes) -jms Sent from my PDA. No type good. - Original Message - From: users-boun...@open-mpi.org To: Open MPI Users Sent: Wed Mar 31 06:25:43 2010 Subject: Re: [OMPI users] Problem in remote nodes I've been checking the /var/log/messages on the compute node and there is nothing new after executing ' mpirun --host itanium2 -np 2 helloworld.out', but in the /var/log/messages file on the remote node it appears the following messages, nothing about unix_chkpwd. Mar 31 11:56:51 itanium2 sshd(pam_unix)[15349]: authentication failure; logname= uid=0 euid=0 tty=NODEVssh ruser= rhost=itanium1 user=otro Mar 31 11:56:53 itanium2 sshd[15349]: Accepted publickey for otro from 192.168.3.1 port 40999 ssh2 Mar 31 11:56:53 itanium2 sshd(pam_unix)[15351]: session opened for user otro by (uid=500) Mar 31 11:56:53 itanium2 sshd(pam_unix)[15351]: session closed for user otro It seems that the authentication fails at first, but in the next message it connects with the node... El Mar, 30 de Marzo de 2010, 20:02, Robert Collyer escribió: > I've been having similar problems using Fedora core 9. I believe the > issue may be with SELinux, but this is just an educated guess. In my > setup, shortly after a login via mpi, there is a notation in the > /var/log/messages on the compute node as follows: > > Mar 30 12:39:45 kernel: type=1400 audit(1269970785.534:588): > avc: denied { read } for pid=8047 comm="unix_chkpwd" name="hosts" > dev=dm-0 ino=24579 > scontext=system_u:system_r:system_chkpwd_t:s0-s0:c0.c1023 > tcontext=unconfined_u:object_r:etc_runtime_t:s0 tclass=file > > which says SELinux denied unix_chkpwd read access to hosts. > > Are you getting anything like this? > > In the meantime, I'll check if allowing unix_chkpwd read access to hosts > eliminates the problem on my system, and if it works, I'll post the > steps involved. > > uriz.49...@e.unavarra.es wrote: >> I've benn investigating and there is no firewall that could stop TCP >> traffic in the cluster. With the option --mca plm_base_verbose 30 I get >> the following output: >> >> [itanium1] /home/otro > mpirun --mca plm_base_verbose 30 --host itanium2 >> helloworld.out >> [itanium1:08311] mca: base: components_open: Looking for plm components >> [itanium1:08311] mca: base: components_open: opening plm components >> [itanium1:08311] mca: base: components_open: found loaded component rsh >> [itanium1:08311] mca: base: components_open: component rsh has no >> register >> function >> [itanium1:08311] mca: base: components_open: component rsh open function >> successful >> [itanium1:08311] mca: base: components_open: found loaded component >> slurm >> [itanium1:08311] mca: base: components_open: component slurm has no >> register function >> [itanium1:08311] mca: base: components_open: component slurm open >> function >> successful >> [itanium1:08311] mca:base:select: Auto-selecting plm components >> [itanium1:08311] mca:base:select:( plm) Querying component [rsh] >> [itanium1:08311] mca:base:select:( plm) Query of component [rsh] set >> priority to 10 >> [itanium1:08311] mca:base:select:( plm) Querying component [slurm] >> [itanium1:08311] mca:base:select:( plm) Skipping component [slurm]. >> Query >> failed to return a module >> [itanium1:08311] mca:base:select:( plm) Selected component [rsh] >> [itanium1:08311] mca: base: close: component slurm closed >> [itanium1:08311] mca: base: close: unloading component slurm >> >> --Hangs here >> >> It seems a slurm problem?? >> >> Thanks to any idea >> >> El Vie, 19 de Marzo de 2010, 17:57, Ralph Castain escribió: >> >>> Did you configure OMPI with --enable-debug? You should do this so that >>> more diagnostic output is available. >>> >>> You can also add the following to your cmd line to get more info: >>> >>> --debug --debug-daemons --leave-session-attached >>> >>> Something is likely blocking proper launch of the daemons and processes >>> so >>> you aren't g
Re: [OMPI users] Problem in remote nodes
On Mar 30, 2010, at 4:28 PM, Robert Collyer wrote: > I changed the SELinux config to permissive (log only), and it didn't > change anything. Back to the drawing board. I'm afraid I have no expereince with SELinux -- I don't know what it restricts. Generally, you need to be able to run processes on remote nodes without entering a password and also be able to open random TCP and unix sockets between previously unrelated processes. -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI users] Problem in remote nodes
Those are normal ssh messages, I think - an ssh session may try mulktiple auth methods before one succeeds. You're absolutely sure that there's no firewalling software and selinux is disabled? Ompi is behaving as if it is trying to communicate and failing (e.g., its hanging while trying to open some tcp sockets back). Can you open random tcp sockets between your nodes? (E.g., in non-mpi processes) -jms Sent from my PDA. No type good. - Original Message - From: users-boun...@open-mpi.org To: Open MPI Users Sent: Wed Mar 31 06:25:43 2010 Subject: Re: [OMPI users] Problem in remote nodes I've been checking the /var/log/messages on the compute node and there is nothing new after executing ' mpirun --host itanium2 -np 2 helloworld.out', but in the /var/log/messages file on the remote node it appears the following messages, nothing about unix_chkpwd. Mar 31 11:56:51 itanium2 sshd(pam_unix)[15349]: authentication failure; logname= uid=0 euid=0 tty=NODEVssh ruser= rhost=itanium1 user=otro Mar 31 11:56:53 itanium2 sshd[15349]: Accepted publickey for otro from 192.168.3.1 port 40999 ssh2 Mar 31 11:56:53 itanium2 sshd(pam_unix)[15351]: session opened for user otro by (uid=500) Mar 31 11:56:53 itanium2 sshd(pam_unix)[15351]: session closed for user otro It seems that the authentication fails at first, but in the next message it connects with the node... El Mar, 30 de Marzo de 2010, 20:02, Robert Collyer escribió: > I've been having similar problems using Fedora core 9. I believe the > issue may be with SELinux, but this is just an educated guess. In my > setup, shortly after a login via mpi, there is a notation in the > /var/log/messages on the compute node as follows: > > Mar 30 12:39:45 kernel: type=1400 audit(1269970785.534:588): > avc: denied { read } for pid=8047 comm="unix_chkpwd" name="hosts" > dev=dm-0 ino=24579 > scontext=system_u:system_r:system_chkpwd_t:s0-s0:c0.c1023 > tcontext=unconfined_u:object_r:etc_runtime_t:s0 tclass=file > > which says SELinux denied unix_chkpwd read access to hosts. > > Are you getting anything like this? > > In the meantime, I'll check if allowing unix_chkpwd read access to hosts > eliminates the problem on my system, and if it works, I'll post the > steps involved. > > uriz.49...@e.unavarra.es wrote: >> I've benn investigating and there is no firewall that could stop TCP >> traffic in the cluster. With the option --mca plm_base_verbose 30 I get >> the following output: >> >> [itanium1] /home/otro > mpirun --mca plm_base_verbose 30 --host itanium2 >> helloworld.out >> [itanium1:08311] mca: base: components_open: Looking for plm components >> [itanium1:08311] mca: base: components_open: opening plm components >> [itanium1:08311] mca: base: components_open: found loaded component rsh >> [itanium1:08311] mca: base: components_open: component rsh has no >> register >> function >> [itanium1:08311] mca: base: components_open: component rsh open function >> successful >> [itanium1:08311] mca: base: components_open: found loaded component >> slurm >> [itanium1:08311] mca: base: components_open: component slurm has no >> register function >> [itanium1:08311] mca: base: components_open: component slurm open >> function >> successful >> [itanium1:08311] mca:base:select: Auto-selecting plm components >> [itanium1:08311] mca:base:select:( plm) Querying component [rsh] >> [itanium1:08311] mca:base:select:( plm) Query of component [rsh] set >> priority to 10 >> [itanium1:08311] mca:base:select:( plm) Querying component [slurm] >> [itanium1:08311] mca:base:select:( plm) Skipping component [slurm]. >> Query >> failed to return a module >> [itanium1:08311] mca:base:select:( plm) Selected component [rsh] >> [itanium1:08311] mca: base: close: component slurm closed >> [itanium1:08311] mca: base: close: unloading component slurm >> >> --Hangs here >> >> It seems a slurm problem?? >> >> Thanks to any idea >> >> El Vie, 19 de Marzo de 2010, 17:57, Ralph Castain escribió: >> >>> Did you configure OMPI with --enable-debug? You should do this so that >>> more diagnostic output is available. >>> >>> You can also add the following to your cmd line to get more info: >>> >>> --debug --debug-daemons --leave-session-attached >>> >>> Something is likely blocking proper launch of the daemons and processes >>> so >>> you aren't getting to the btl's at all. >>> >>> >>> On Mar 19, 2010, at 9:42 AM, uriz.49...@e.unavarra.es wrote: >>> >>> >>>> The processes
Re: [OMPI users] Problem in remote nodes
I've been checking the /var/log/messages on the compute node and there is nothing new after executing ' mpirun --host itanium2 -np 2 helloworld.out', but in the /var/log/messages file on the remote node it appears the following messages, nothing about unix_chkpwd. Mar 31 11:56:51 itanium2 sshd(pam_unix)[15349]: authentication failure; logname= uid=0 euid=0 tty=NODEVssh ruser= rhost=itanium1 user=otro Mar 31 11:56:53 itanium2 sshd[15349]: Accepted publickey for otro from 192.168.3.1 port 40999 ssh2 Mar 31 11:56:53 itanium2 sshd(pam_unix)[15351]: session opened for user otro by (uid=500) Mar 31 11:56:53 itanium2 sshd(pam_unix)[15351]: session closed for user otro It seems that the authentication fails at first, but in the next message it connects with the node... El Mar, 30 de Marzo de 2010, 20:02, Robert Collyer escribió: > I've been having similar problems using Fedora core 9. I believe the > issue may be with SELinux, but this is just an educated guess. In my > setup, shortly after a login via mpi, there is a notation in the > /var/log/messages on the compute node as follows: > > Mar 30 12:39:45 kernel: type=1400 audit(1269970785.534:588): > avc: denied { read } for pid=8047 comm="unix_chkpwd" name="hosts" > dev=dm-0 ino=24579 > scontext=system_u:system_r:system_chkpwd_t:s0-s0:c0.c1023 > tcontext=unconfined_u:object_r:etc_runtime_t:s0 tclass=file > > which says SELinux denied unix_chkpwd read access to hosts. > > Are you getting anything like this? > > In the meantime, I'll check if allowing unix_chkpwd read access to hosts > eliminates the problem on my system, and if it works, I'll post the > steps involved. > > uriz.49...@e.unavarra.es wrote: >> I've benn investigating and there is no firewall that could stop TCP >> traffic in the cluster. With the option --mca plm_base_verbose 30 I get >> the following output: >> >> [itanium1] /home/otro > mpirun --mca plm_base_verbose 30 --host itanium2 >> helloworld.out >> [itanium1:08311] mca: base: components_open: Looking for plm components >> [itanium1:08311] mca: base: components_open: opening plm components >> [itanium1:08311] mca: base: components_open: found loaded component rsh >> [itanium1:08311] mca: base: components_open: component rsh has no >> register >> function >> [itanium1:08311] mca: base: components_open: component rsh open function >> successful >> [itanium1:08311] mca: base: components_open: found loaded component >> slurm >> [itanium1:08311] mca: base: components_open: component slurm has no >> register function >> [itanium1:08311] mca: base: components_open: component slurm open >> function >> successful >> [itanium1:08311] mca:base:select: Auto-selecting plm components >> [itanium1:08311] mca:base:select:( plm) Querying component [rsh] >> [itanium1:08311] mca:base:select:( plm) Query of component [rsh] set >> priority to 10 >> [itanium1:08311] mca:base:select:( plm) Querying component [slurm] >> [itanium1:08311] mca:base:select:( plm) Skipping component [slurm]. >> Query >> failed to return a module >> [itanium1:08311] mca:base:select:( plm) Selected component [rsh] >> [itanium1:08311] mca: base: close: component slurm closed >> [itanium1:08311] mca: base: close: unloading component slurm >> >> --Hangs here >> >> It seems a slurm problem?? >> >> Thanks to any idea >> >> El Vie, 19 de Marzo de 2010, 17:57, Ralph Castain escribió: >> >>> Did you configure OMPI with --enable-debug? You should do this so that >>> more diagnostic output is available. >>> >>> You can also add the following to your cmd line to get more info: >>> >>> --debug --debug-daemons --leave-session-attached >>> >>> Something is likely blocking proper launch of the daemons and processes >>> so >>> you aren't getting to the btl's at all. >>> >>> >>> On Mar 19, 2010, at 9:42 AM, uriz.49...@e.unavarra.es wrote: >>> >>> The processes are running on the remote nodes but they don't give the response to the origin node. I don't know why. With the option --mca btl_base_verbose 30, I have the same problems and it doesn't show any message. Thanks > On Wed, Mar 17, 2010 at 1:41 PM, Jeff Squyres > wrote: > >> On Mar 17, 2010, at 4:39 AM, wrote: >> >> >>> Hi everyone I'm a new Open MPI user and I have just installed Open >>> MPI >>> in >>> a 6 nodes cluster with Scientific Linux. When I execute it in local >>> it >>> works perfectly, but when I try to execute it on the remote nodes >>> with >>> the >>> --host option it hangs and gives no message. I think that the >>> problem >>> could be with the shared libraries but i'm not sure. In my opinion >>> the >>> problem is not ssh because i can access to the nodes with no >>> password >>> >> You might want to check that Open MPI processes are actually running >> on >> the remote nodes -- check with ps if you see any "orted" or other >> MPI-related processes (e.g., your processes)
Re: [OMPI users] Problem in remote nodes
I changed the SELinux config to permissive (log only), and it didn't change anything. Back to the drawing board. Robert Collyer wrote: I've been having similar problems using Fedora core 9. I believe the issue may be with SELinux, but this is just an educated guess. In my setup, shortly after a login via mpi, there is a notation in the /var/log/messages on the compute node as follows: Mar 30 12:39:45 kernel: type=1400 audit(1269970785.534:588): avc: denied { read } for pid=8047 comm="unix_chkpwd" name="hosts" dev=dm-0 ino=24579 scontext=system_u:system_r:system_chkpwd_t:s0-s0:c0.c1023 tcontext=unconfined_u:object_r:etc_runtime_t:s0 tclass=file which says SELinux denied unix_chkpwd read access to hosts. Are you getting anything like this? In the meantime, I'll check if allowing unix_chkpwd read access to hosts eliminates the problem on my system, and if it works, I'll post the steps involved. uriz.49...@e.unavarra.es wrote: I've benn investigating and there is no firewall that could stop TCP traffic in the cluster. With the option --mca plm_base_verbose 30 I get the following output: [itanium1] /home/otro > mpirun --mca plm_base_verbose 30 --host itanium2 helloworld.out [itanium1:08311] mca: base: components_open: Looking for plm components [itanium1:08311] mca: base: components_open: opening plm components [itanium1:08311] mca: base: components_open: found loaded component rsh [itanium1:08311] mca: base: components_open: component rsh has no register function [itanium1:08311] mca: base: components_open: component rsh open function successful [itanium1:08311] mca: base: components_open: found loaded component slurm [itanium1:08311] mca: base: components_open: component slurm has no register function [itanium1:08311] mca: base: components_open: component slurm open function successful [itanium1:08311] mca:base:select: Auto-selecting plm components [itanium1:08311] mca:base:select:( plm) Querying component [rsh] [itanium1:08311] mca:base:select:( plm) Query of component [rsh] set priority to 10 [itanium1:08311] mca:base:select:( plm) Querying component [slurm] [itanium1:08311] mca:base:select:( plm) Skipping component [slurm]. Query failed to return a module [itanium1:08311] mca:base:select:( plm) Selected component [rsh] [itanium1:08311] mca: base: close: component slurm closed [itanium1:08311] mca: base: close: unloading component slurm --Hangs here It seems a slurm problem?? Thanks to any idea El Vie, 19 de Marzo de 2010, 17:57, Ralph Castain escribió: Did you configure OMPI with --enable-debug? You should do this so that more diagnostic output is available. You can also add the following to your cmd line to get more info: --debug --debug-daemons --leave-session-attached Something is likely blocking proper launch of the daemons and processes so you aren't getting to the btl's at all. On Mar 19, 2010, at 9:42 AM, uriz.49...@e.unavarra.es wrote: The processes are running on the remote nodes but they don't give the response to the origin node. I don't know why. With the option --mca btl_base_verbose 30, I have the same problems and it doesn't show any message. Thanks On Wed, Mar 17, 2010 at 1:41 PM, Jeff Squyres wrote: On Mar 17, 2010, at 4:39 AM, wrote: Hi everyone I'm a new Open MPI user and I have just installed Open MPI in a 6 nodes cluster with Scientific Linux. When I execute it in local it works perfectly, but when I try to execute it on the remote nodes with the --host option it hangs and gives no message. I think that the problem could be with the shared libraries but i'm not sure. In my opinion the problem is not ssh because i can access to the nodes with no password You might want to check that Open MPI processes are actually running on the remote nodes -- check with ps if you see any "orted" or other MPI-related processes (e.g., your processes). Do you have any TCP firewall software running between the nodes? If so, you'll need to disable it (at least for Open MPI jobs). I also recommend running mpirun with the option --mca btl_base_verbose 30 to troubleshoot tcp issues. In some environments, you need to explicitly tell mpirun what network interfaces it can use to reach the hosts. Read the following FAQ section for more information: http://www.open-mpi.org/faq/?category=tcp Item 7 of the FAQ might be of special interest. Regards, ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/
Re: [OMPI users] Problem in remote nodes
I've been having similar problems using Fedora core 9. I believe the issue may be with SELinux, but this is just an educated guess. In my setup, shortly after a login via mpi, there is a notation in the /var/log/messages on the compute node as follows: Mar 30 12:39:45 kernel: type=1400 audit(1269970785.534:588): avc: denied { read } for pid=8047 comm="unix_chkpwd" name="hosts" dev=dm-0 ino=24579 scontext=system_u:system_r:system_chkpwd_t:s0-s0:c0.c1023 tcontext=unconfined_u:object_r:etc_runtime_t:s0 tclass=file which says SELinux denied unix_chkpwd read access to hosts. Are you getting anything like this? In the meantime, I'll check if allowing unix_chkpwd read access to hosts eliminates the problem on my system, and if it works, I'll post the steps involved. uriz.49...@e.unavarra.es wrote: I've benn investigating and there is no firewall that could stop TCP traffic in the cluster. With the option --mca plm_base_verbose 30 I get the following output: [itanium1] /home/otro > mpirun --mca plm_base_verbose 30 --host itanium2 helloworld.out [itanium1:08311] mca: base: components_open: Looking for plm components [itanium1:08311] mca: base: components_open: opening plm components [itanium1:08311] mca: base: components_open: found loaded component rsh [itanium1:08311] mca: base: components_open: component rsh has no register function [itanium1:08311] mca: base: components_open: component rsh open function successful [itanium1:08311] mca: base: components_open: found loaded component slurm [itanium1:08311] mca: base: components_open: component slurm has no register function [itanium1:08311] mca: base: components_open: component slurm open function successful [itanium1:08311] mca:base:select: Auto-selecting plm components [itanium1:08311] mca:base:select:( plm) Querying component [rsh] [itanium1:08311] mca:base:select:( plm) Query of component [rsh] set priority to 10 [itanium1:08311] mca:base:select:( plm) Querying component [slurm] [itanium1:08311] mca:base:select:( plm) Skipping component [slurm]. Query failed to return a module [itanium1:08311] mca:base:select:( plm) Selected component [rsh] [itanium1:08311] mca: base: close: component slurm closed [itanium1:08311] mca: base: close: unloading component slurm --Hangs here It seems a slurm problem?? Thanks to any idea El Vie, 19 de Marzo de 2010, 17:57, Ralph Castain escribió: Did you configure OMPI with --enable-debug? You should do this so that more diagnostic output is available. You can also add the following to your cmd line to get more info: --debug --debug-daemons --leave-session-attached Something is likely blocking proper launch of the daemons and processes so you aren't getting to the btl's at all. On Mar 19, 2010, at 9:42 AM, uriz.49...@e.unavarra.es wrote: The processes are running on the remote nodes but they don't give the response to the origin node. I don't know why. With the option --mca btl_base_verbose 30, I have the same problems and it doesn't show any message. Thanks On Wed, Mar 17, 2010 at 1:41 PM, Jeff Squyres wrote: On Mar 17, 2010, at 4:39 AM, wrote: Hi everyone I'm a new Open MPI user and I have just installed Open MPI in a 6 nodes cluster with Scientific Linux. When I execute it in local it works perfectly, but when I try to execute it on the remote nodes with the --host option it hangs and gives no message. I think that the problem could be with the shared libraries but i'm not sure. In my opinion the problem is not ssh because i can access to the nodes with no password You might want to check that Open MPI processes are actually running on the remote nodes -- check with ps if you see any "orted" or other MPI-related processes (e.g., your processes). Do you have any TCP firewall software running between the nodes? If so, you'll need to disable it (at least for Open MPI jobs). I also recommend running mpirun with the option --mca btl_base_verbose 30 to troubleshoot tcp issues. In some environments, you need to explicitly tell mpirun what network interfaces it can use to reach the hosts. Read the following FAQ section for more information: http://www.open-mpi.org/faq/?category=tcp Item 7 of the FAQ might be of special interest. Regards, ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] Problem in remote nodes
Looks to me like you have an error in your cmd line - you aren't specifying the number of procs to run. My guess is that the system is hanging trying to resolve the process map as a result. Try adding "-np 1" to the cmd line. The output indicates it is dropping slurm because it doesn't see a slurm allocation. So it is defaulting to use of rsh/ssh to launch. On Mar 30, 2010, at 4:27 AM, uriz.49...@e.unavarra.es wrote: > I've benn investigating and there is no firewall that could stop TCP > traffic in the cluster. With the option --mca plm_base_verbose 30 I get > the following output: > > [itanium1] /home/otro > mpirun --mca plm_base_verbose 30 --host itanium2 > helloworld.out > [itanium1:08311] mca: base: components_open: Looking for plm components > [itanium1:08311] mca: base: components_open: opening plm components > [itanium1:08311] mca: base: components_open: found loaded component rsh > [itanium1:08311] mca: base: components_open: component rsh has no register > function > [itanium1:08311] mca: base: components_open: component rsh open function > successful > [itanium1:08311] mca: base: components_open: found loaded component slurm > [itanium1:08311] mca: base: components_open: component slurm has no > register function > [itanium1:08311] mca: base: components_open: component slurm open function > successful > [itanium1:08311] mca:base:select: Auto-selecting plm components > [itanium1:08311] mca:base:select:( plm) Querying component [rsh] > [itanium1:08311] mca:base:select:( plm) Query of component [rsh] set > priority to 10 > [itanium1:08311] mca:base:select:( plm) Querying component [slurm] > [itanium1:08311] mca:base:select:( plm) Skipping component [slurm]. Query > failed to return a module > [itanium1:08311] mca:base:select:( plm) Selected component [rsh] > [itanium1:08311] mca: base: close: component slurm closed > [itanium1:08311] mca: base: close: unloading component slurm > > --Hangs here > > It seems a slurm problem?? > > Thanks to any idea > > El Vie, 19 de Marzo de 2010, 17:57, Ralph Castain escribió: >> Did you configure OMPI with --enable-debug? You should do this so that >> more diagnostic output is available. >> >> You can also add the following to your cmd line to get more info: >> >> --debug --debug-daemons --leave-session-attached >> >> Something is likely blocking proper launch of the daemons and processes so >> you aren't getting to the btl's at all. >> >> >> On Mar 19, 2010, at 9:42 AM, uriz.49...@e.unavarra.es wrote: >> >>> The processes are running on the remote nodes but they don't give the >>> response to the origin node. I don't know why. >>> With the option --mca btl_base_verbose 30, I have the same problems and >>> it >>> doesn't show any message. >>> >>> Thanks >>> On Wed, Mar 17, 2010 at 1:41 PM, Jeff Squyres wrote: > On Mar 17, 2010, at 4:39 AM, wrote: > >> Hi everyone I'm a new Open MPI user and I have just installed Open >> MPI >> in >> a 6 nodes cluster with Scientific Linux. When I execute it in local >> it >> works perfectly, but when I try to execute it on the remote nodes >> with >> the >> --host option it hangs and gives no message. I think that the >> problem >> could be with the shared libraries but i'm not sure. In my opinion >> the >> problem is not ssh because i can access to the nodes with no password > > You might want to check that Open MPI processes are actually running > on > the remote nodes -- check with ps if you see any "orted" or other > MPI-related processes (e.g., your processes). > > Do you have any TCP firewall software running between the nodes? If > so, > you'll need to disable it (at least for Open MPI jobs). I also recommend running mpirun with the option --mca btl_base_verbose 30 to troubleshoot tcp issues. In some environments, you need to explicitly tell mpirun what network interfaces it can use to reach the hosts. Read the following FAQ section for more information: http://www.open-mpi.org/faq/?category=tcp Item 7 of the FAQ might be of special interest. Regards, ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >>> >>> ___ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> >> ___ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users >> > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] Problem in remote nodes
I've benn investigating and there is no firewall that could stop TCP traffic in the cluster. With the option --mca plm_base_verbose 30 I get the following output: [itanium1] /home/otro > mpirun --mca plm_base_verbose 30 --host itanium2 helloworld.out [itanium1:08311] mca: base: components_open: Looking for plm components [itanium1:08311] mca: base: components_open: opening plm components [itanium1:08311] mca: base: components_open: found loaded component rsh [itanium1:08311] mca: base: components_open: component rsh has no register function [itanium1:08311] mca: base: components_open: component rsh open function successful [itanium1:08311] mca: base: components_open: found loaded component slurm [itanium1:08311] mca: base: components_open: component slurm has no register function [itanium1:08311] mca: base: components_open: component slurm open function successful [itanium1:08311] mca:base:select: Auto-selecting plm components [itanium1:08311] mca:base:select:( plm) Querying component [rsh] [itanium1:08311] mca:base:select:( plm) Query of component [rsh] set priority to 10 [itanium1:08311] mca:base:select:( plm) Querying component [slurm] [itanium1:08311] mca:base:select:( plm) Skipping component [slurm]. Query failed to return a module [itanium1:08311] mca:base:select:( plm) Selected component [rsh] [itanium1:08311] mca: base: close: component slurm closed [itanium1:08311] mca: base: close: unloading component slurm --Hangs here It seems a slurm problem?? Thanks to any idea El Vie, 19 de Marzo de 2010, 17:57, Ralph Castain escribió: > Did you configure OMPI with --enable-debug? You should do this so that > more diagnostic output is available. > > You can also add the following to your cmd line to get more info: > > --debug --debug-daemons --leave-session-attached > > Something is likely blocking proper launch of the daemons and processes so > you aren't getting to the btl's at all. > > > On Mar 19, 2010, at 9:42 AM, uriz.49...@e.unavarra.es wrote: > >> The processes are running on the remote nodes but they don't give the >> response to the origin node. I don't know why. >> With the option --mca btl_base_verbose 30, I have the same problems and >> it >> doesn't show any message. >> >> Thanks >> >>> On Wed, Mar 17, 2010 at 1:41 PM, Jeff Squyres >>> wrote: On Mar 17, 2010, at 4:39 AM, wrote: > Hi everyone I'm a new Open MPI user and I have just installed Open > MPI > in > a 6 nodes cluster with Scientific Linux. When I execute it in local > it > works perfectly, but when I try to execute it on the remote nodes > with > the > --host option it hangs and gives no message. I think that the > problem > could be with the shared libraries but i'm not sure. In my opinion > the > problem is not ssh because i can access to the nodes with no password You might want to check that Open MPI processes are actually running on the remote nodes -- check with ps if you see any "orted" or other MPI-related processes (e.g., your processes). Do you have any TCP firewall software running between the nodes? If so, you'll need to disable it (at least for Open MPI jobs). >>> >>> I also recommend running mpirun with the option --mca btl_base_verbose >>> 30 to troubleshoot tcp issues. >>> >>> In some environments, you need to explicitly tell mpirun what network >>> interfaces it can use to reach the hosts. Read the following FAQ >>> section for more information: >>> >>> http://www.open-mpi.org/faq/?category=tcp >>> >>> Item 7 of the FAQ might be of special interest. >>> >>> Regards, >>> >>> ___ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >> >> >> ___ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >
Re: [OMPI users] Problem in remote nodes
Did you configure OMPI with --enable-debug? You should do this so that more diagnostic output is available. You can also add the following to your cmd line to get more info: --debug --debug-daemons --leave-session-attached Something is likely blocking proper launch of the daemons and processes so you aren't getting to the btl's at all. On Mar 19, 2010, at 9:42 AM, uriz.49...@e.unavarra.es wrote: > The processes are running on the remote nodes but they don't give the > response to the origin node. I don't know why. > With the option --mca btl_base_verbose 30, I have the same problems and it > doesn't show any message. > > Thanks > >> On Wed, Mar 17, 2010 at 1:41 PM, Jeff Squyres wrote: >>> On Mar 17, 2010, at 4:39 AM, wrote: >>> Hi everyone I'm a new Open MPI user and I have just installed Open MPI in a 6 nodes cluster with Scientific Linux. When I execute it in local it works perfectly, but when I try to execute it on the remote nodes with the --host option it hangs and gives no message. I think that the problem could be with the shared libraries but i'm not sure. In my opinion the problem is not ssh because i can access to the nodes with no password >>> >>> You might want to check that Open MPI processes are actually running on >>> the remote nodes -- check with ps if you see any "orted" or other >>> MPI-related processes (e.g., your processes). >>> >>> Do you have any TCP firewall software running between the nodes? If so, >>> you'll need to disable it (at least for Open MPI jobs). >> >> I also recommend running mpirun with the option --mca btl_base_verbose >> 30 to troubleshoot tcp issues. >> >> In some environments, you need to explicitly tell mpirun what network >> interfaces it can use to reach the hosts. Read the following FAQ >> section for more information: >> >> http://www.open-mpi.org/faq/?category=tcp >> >> Item 7 of the FAQ might be of special interest. >> >> Regards, >> >> ___ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users >> > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] Problem in remote nodes
The processes are running on the remote nodes but they don't give the response to the origin node. I don't know why. With the option --mca btl_base_verbose 30, I have the same problems and it doesn't show any message. Thanks > On Wed, Mar 17, 2010 at 1:41 PM, Jeff Squyres wrote: >> On Mar 17, 2010, at 4:39 AM, wrote: >> >>> Hi everyone I'm a new Open MPI user and I have just installed Open MPI >>> in >>> a 6 nodes cluster with Scientific Linux. When I execute it in local it >>> works perfectly, but when I try to execute it on the remote nodes with >>> the >>> --host option it hangs and gives no message. I think that the problem >>> could be with the shared libraries but i'm not sure. In my opinion the >>> problem is not ssh because i can access to the nodes with no password >> >> You might want to check that Open MPI processes are actually running on >> the remote nodes -- check with ps if you see any "orted" or other >> MPI-related processes (e.g., your processes). >> >> Do you have any TCP firewall software running between the nodes? If so, >> you'll need to disable it (at least for Open MPI jobs). > > I also recommend running mpirun with the option --mca btl_base_verbose > 30 to troubleshoot tcp issues. > > In some environments, you need to explicitly tell mpirun what network > interfaces it can use to reach the hosts. Read the following FAQ > section for more information: > > http://www.open-mpi.org/faq/?category=tcp > > Item 7 of the FAQ might be of special interest. > > Regards, > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >
Re: [OMPI users] Problem in remote nodes
On Wed, Mar 17, 2010 at 1:41 PM, Jeff Squyres wrote: > On Mar 17, 2010, at 4:39 AM, wrote: > >> Hi everyone I'm a new Open MPI user and I have just installed Open MPI in >> a 6 nodes cluster with Scientific Linux. When I execute it in local it >> works perfectly, but when I try to execute it on the remote nodes with the >> --host option it hangs and gives no message. I think that the problem >> could be with the shared libraries but i'm not sure. In my opinion the >> problem is not ssh because i can access to the nodes with no password > > You might want to check that Open MPI processes are actually running on the > remote nodes -- check with ps if you see any "orted" or other MPI-related > processes (e.g., your processes). > > Do you have any TCP firewall software running between the nodes? If so, > you'll need to disable it (at least for Open MPI jobs). I also recommend running mpirun with the option --mca btl_base_verbose 30 to troubleshoot tcp issues. In some environments, you need to explicitly tell mpirun what network interfaces it can use to reach the hosts. Read the following FAQ section for more information: http://www.open-mpi.org/faq/?category=tcp Item 7 of the FAQ might be of special interest. Regards,
Re: [OMPI users] Problem in remote nodes
On Mar 17, 2010, at 4:39 AM, wrote: > Hi everyone I'm a new Open MPI user and I have just installed Open MPI in > a 6 nodes cluster with Scientific Linux. When I execute it in local it > works perfectly, but when I try to execute it on the remote nodes with the > --host option it hangs and gives no message. I think that the problem > could be with the shared libraries but i'm not sure. In my opinion the > problem is not ssh because i can access to the nodes with no password You might want to check that Open MPI processes are actually running on the remote nodes -- check with ps if you see any "orted" or other MPI-related processes (e.g., your processes). Do you have any TCP firewall software running between the nodes? If so, you'll need to disable it (at least for Open MPI jobs). -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
[OMPI users] Problem in remote nodes
Hi everyone I'm a new Open MPI user and I have just installed Open MPI in a 6 nodes cluster with Scientific Linux. When I execute it in local it works perfectly, but when I try to execute it on the remote nodes with the --host option it hangs and gives no message. I think that the problem could be with the shared libraries but i'm not sure. In my opinion the problem is not ssh because i can access to the nodes with no password If someone could give me an idea of what could be my problem i'll very pleased... I'm totally blocked thanks