Re: [OMPI users] Problem in remote nodes

2010-04-07 Thread Robert Collyer

Jeff,
In my case, it was the firewall.  It was restricting communication to 
ssh only between the compute nodes.  I appreciate the help.


Rob

Jeff Squyres (jsquyres) wrote:


Those are normal ssh messages, I think - an ssh session may try 
mulktiple auth methods before one succeeds.


You're absolutely sure that there's no firewalling software and 
selinux is disabled?  Ompi is behaving as if it is trying to 
communicate and failing (e.g., its hanging while trying to open some 
tcp sockets back).


Can you open random tcp sockets between your nodes?  (E.g., in non-mpi 
processes)


-jms
Sent from my PDA.  No type good.

- Original Message -
From: users-boun...@open-mpi.org 
To: Open MPI Users 
Sent: Wed Mar 31 06:25:43 2010
Subject: Re: [OMPI users] Problem in remote nodes

I've been checking the /var/log/messages on the compute node and there is
nothing new after executing ' mpirun --host itanium2 -np 2
helloworld.out',
but in the /var/log/messages file on the remote node it appears the
following messages, nothing about unix_chkpwd.

Mar 31 11:56:51 itanium2 sshd(pam_unix)[15349]: authentication failure;
logname= uid=0 euid=0 tty=NODEVssh ruser= rhost=itanium1  user=otro
Mar 31 11:56:53 itanium2 sshd[15349]: Accepted publickey for otro from
192.168.3.1 port 40999 ssh2
Mar 31 11:56:53 itanium2 sshd(pam_unix)[15351]: session opened for user
otro by (uid=500)
Mar 31 11:56:53 itanium2 sshd(pam_unix)[15351]: session closed for 
user otro


It seems that the authentication fails at first, but in the next message
it connects with the node...

El Mar, 30 de Marzo de 2010, 20:02, Robert Collyer escribió:
> I've been having similar problems using Fedora core 9.  I believe the
> issue may be with SELinux, but this is just an educated guess.  In my
> setup, shortly after a login via mpi, there is a notation in the
> /var/log/messages on the compute node as follows:
>
> Mar 30 12:39:45  kernel: type=1400 audit(1269970785.534:588):
> avc:  denied  { read } for  pid=8047 comm="unix_chkpwd" name="hosts"
> dev=dm-0 ino=24579
> scontext=system_u:system_r:system_chkpwd_t:s0-s0:c0.c1023
> tcontext=unconfined_u:object_r:etc_runtime_t:s0 tclass=file
>
> which says SELinux denied unix_chkpwd read access to hosts.
>
> Are you getting anything like this?
>
> In the meantime, I'll check if allowing unix_chkpwd read access to hosts
> eliminates the problem on my system, and if it works, I'll post the
> steps involved.
>
> uriz.49...@e.unavarra.es wrote:
>> I've benn investigating and there is no firewall that could stop TCP
>> traffic in the cluster. With the option --mca plm_base_verbose 30 I get
>> the following output:
>>
>> [itanium1] /home/otro > mpirun --mca plm_base_verbose 30 --host 
itanium2

>> helloworld.out
>> [itanium1:08311] mca: base: components_open: Looking for plm components
>> [itanium1:08311] mca: base: components_open: opening plm components
>> [itanium1:08311] mca: base: components_open: found loaded component rsh
>> [itanium1:08311] mca: base: components_open: component rsh has no
>> register
>> function
>> [itanium1:08311] mca: base: components_open: component rsh open 
function

>> successful
>> [itanium1:08311] mca: base: components_open: found loaded component
>> slurm
>> [itanium1:08311] mca: base: components_open: component slurm has no
>> register function
>> [itanium1:08311] mca: base: components_open: component slurm open
>> function
>> successful
>> [itanium1:08311] mca:base:select: Auto-selecting plm components
>> [itanium1:08311] mca:base:select:(  plm) Querying component [rsh]
>> [itanium1:08311] mca:base:select:(  plm) Query of component [rsh] set
>> priority to 10
>> [itanium1:08311] mca:base:select:(  plm) Querying component [slurm]
>> [itanium1:08311] mca:base:select:(  plm) Skipping component [slurm].
>> Query
>> failed to return a module
>> [itanium1:08311] mca:base:select:(  plm) Selected component [rsh]
>> [itanium1:08311] mca: base: close: component slurm closed
>> [itanium1:08311] mca: base: close: unloading component slurm
>>
>> --Hangs here
>>
>> It seems a slurm problem??
>>
>> Thanks to any idea
>>
>> El Vie, 19 de Marzo de 2010, 17:57, Ralph Castain escribió:
>>
>>> Did you configure OMPI with --enable-debug? You should do this so that
>>> more diagnostic output is available.
>>>
>>> You can also add the following to your cmd line to get more info:
>>>
>>> --debug --debug-daemons --leave-session-attached
>>>
>>> Something is likely blocking proper launch of the daemons and 
processes

>>> so
>>> you aren't g

Re: [OMPI users] Problem in remote nodes

2010-03-31 Thread Jeff Squyres
On Mar 30, 2010, at 4:28 PM, Robert Collyer wrote:

> I changed the SELinux config to permissive (log only), and it didn't
> change anything.  Back to the drawing board.

I'm afraid I have no expereince with SELinux -- I don't know what it restricts. 
 Generally, you need to be able to run processes on remote nodes without 
entering a password and also be able to open random TCP and unix sockets 
between previously unrelated processes.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI users] Problem in remote nodes

2010-03-31 Thread Jeff Squyres (jsquyres)
Those are normal ssh messages, I think - an ssh session may try mulktiple auth 
methods before one succeeds. 

You're absolutely sure that there's no firewalling software and selinux is 
disabled?  Ompi is behaving as if it is trying to communicate and failing 
(e.g., its hanging while trying to open some tcp sockets back). 

Can you open random tcp sockets between your nodes?  (E.g., in non-mpi 
processes)

-jms
Sent from my PDA.  No type good.

- Original Message -
From: users-boun...@open-mpi.org 
To: Open MPI Users 
Sent: Wed Mar 31 06:25:43 2010
Subject: Re: [OMPI users] Problem in remote nodes

I've been checking the /var/log/messages on the compute node and there is
nothing new after executing ' mpirun --host itanium2 -np 2
helloworld.out',
but in the /var/log/messages file on the remote node it appears the
following messages, nothing about unix_chkpwd.

Mar 31 11:56:51 itanium2 sshd(pam_unix)[15349]: authentication failure;
logname= uid=0 euid=0 tty=NODEVssh ruser= rhost=itanium1  user=otro
Mar 31 11:56:53 itanium2 sshd[15349]: Accepted publickey for otro from
192.168.3.1 port 40999 ssh2
Mar 31 11:56:53 itanium2 sshd(pam_unix)[15351]: session opened for user
otro by (uid=500)
Mar 31 11:56:53 itanium2 sshd(pam_unix)[15351]: session closed for user otro

It seems that the authentication fails at first, but in the next message
it connects with the node...

El Mar, 30 de Marzo de 2010, 20:02, Robert Collyer escribió:
> I've been having similar problems using Fedora core 9.  I believe the
> issue may be with SELinux, but this is just an educated guess.  In my
> setup, shortly after a login via mpi, there is a notation in the
> /var/log/messages on the compute node as follows:
>
> Mar 30 12:39:45  kernel: type=1400 audit(1269970785.534:588):
> avc:  denied  { read } for  pid=8047 comm="unix_chkpwd" name="hosts"
> dev=dm-0 ino=24579
> scontext=system_u:system_r:system_chkpwd_t:s0-s0:c0.c1023
> tcontext=unconfined_u:object_r:etc_runtime_t:s0 tclass=file
>
> which says SELinux denied unix_chkpwd read access to hosts.
>
> Are you getting anything like this?
>
> In the meantime, I'll check if allowing unix_chkpwd read access to hosts
> eliminates the problem on my system, and if it works, I'll post the
> steps involved.
>
> uriz.49...@e.unavarra.es wrote:
>> I've benn investigating and there is no firewall that could stop TCP
>> traffic in the cluster. With the option --mca plm_base_verbose 30 I get
>> the following output:
>>
>> [itanium1] /home/otro > mpirun --mca plm_base_verbose 30 --host itanium2
>> helloworld.out
>> [itanium1:08311] mca: base: components_open: Looking for plm components
>> [itanium1:08311] mca: base: components_open: opening plm components
>> [itanium1:08311] mca: base: components_open: found loaded component rsh
>> [itanium1:08311] mca: base: components_open: component rsh has no
>> register
>> function
>> [itanium1:08311] mca: base: components_open: component rsh open function
>> successful
>> [itanium1:08311] mca: base: components_open: found loaded component
>> slurm
>> [itanium1:08311] mca: base: components_open: component slurm has no
>> register function
>> [itanium1:08311] mca: base: components_open: component slurm open
>> function
>> successful
>> [itanium1:08311] mca:base:select: Auto-selecting plm components
>> [itanium1:08311] mca:base:select:(  plm) Querying component [rsh]
>> [itanium1:08311] mca:base:select:(  plm) Query of component [rsh] set
>> priority to 10
>> [itanium1:08311] mca:base:select:(  plm) Querying component [slurm]
>> [itanium1:08311] mca:base:select:(  plm) Skipping component [slurm].
>> Query
>> failed to return a module
>> [itanium1:08311] mca:base:select:(  plm) Selected component [rsh]
>> [itanium1:08311] mca: base: close: component slurm closed
>> [itanium1:08311] mca: base: close: unloading component slurm
>>
>> --Hangs here
>>
>> It seems a slurm problem??
>>
>> Thanks to any idea
>>
>> El Vie, 19 de Marzo de 2010, 17:57, Ralph Castain escribió:
>>
>>> Did you configure OMPI with --enable-debug? You should do this so that
>>> more diagnostic output is available.
>>>
>>> You can also add the following to your cmd line to get more info:
>>>
>>> --debug --debug-daemons --leave-session-attached
>>>
>>> Something is likely blocking proper launch of the daemons and processes
>>> so
>>> you aren't getting to the btl's at all.
>>>
>>>
>>> On Mar 19, 2010, at 9:42 AM, uriz.49...@e.unavarra.es wrote:
>>>
>>>
>>>> The processes 

Re: [OMPI users] Problem in remote nodes

2010-03-31 Thread uriz . 49949
I've been checking the /var/log/messages on the compute node and there is
nothing new after executing ' mpirun --host itanium2 -np 2
helloworld.out',
but in the /var/log/messages file on the remote node it appears the
following messages, nothing about unix_chkpwd.

Mar 31 11:56:51 itanium2 sshd(pam_unix)[15349]: authentication failure;
logname= uid=0 euid=0 tty=NODEVssh ruser= rhost=itanium1  user=otro
Mar 31 11:56:53 itanium2 sshd[15349]: Accepted publickey for otro from
192.168.3.1 port 40999 ssh2
Mar 31 11:56:53 itanium2 sshd(pam_unix)[15351]: session opened for user
otro by (uid=500)
Mar 31 11:56:53 itanium2 sshd(pam_unix)[15351]: session closed for user otro

It seems that the authentication fails at first, but in the next message
it connects with the node...

El Mar, 30 de Marzo de 2010, 20:02, Robert Collyer escribió:
> I've been having similar problems using Fedora core 9.  I believe the
> issue may be with SELinux, but this is just an educated guess.  In my
> setup, shortly after a login via mpi, there is a notation in the
> /var/log/messages on the compute node as follows:
>
> Mar 30 12:39:45  kernel: type=1400 audit(1269970785.534:588):
> avc:  denied  { read } for  pid=8047 comm="unix_chkpwd" name="hosts"
> dev=dm-0 ino=24579
> scontext=system_u:system_r:system_chkpwd_t:s0-s0:c0.c1023
> tcontext=unconfined_u:object_r:etc_runtime_t:s0 tclass=file
>
> which says SELinux denied unix_chkpwd read access to hosts.
>
> Are you getting anything like this?
>
> In the meantime, I'll check if allowing unix_chkpwd read access to hosts
> eliminates the problem on my system, and if it works, I'll post the
> steps involved.
>
> uriz.49...@e.unavarra.es wrote:
>> I've benn investigating and there is no firewall that could stop TCP
>> traffic in the cluster. With the option --mca plm_base_verbose 30 I get
>> the following output:
>>
>> [itanium1] /home/otro > mpirun --mca plm_base_verbose 30 --host itanium2
>> helloworld.out
>> [itanium1:08311] mca: base: components_open: Looking for plm components
>> [itanium1:08311] mca: base: components_open: opening plm components
>> [itanium1:08311] mca: base: components_open: found loaded component rsh
>> [itanium1:08311] mca: base: components_open: component rsh has no
>> register
>> function
>> [itanium1:08311] mca: base: components_open: component rsh open function
>> successful
>> [itanium1:08311] mca: base: components_open: found loaded component
>> slurm
>> [itanium1:08311] mca: base: components_open: component slurm has no
>> register function
>> [itanium1:08311] mca: base: components_open: component slurm open
>> function
>> successful
>> [itanium1:08311] mca:base:select: Auto-selecting plm components
>> [itanium1:08311] mca:base:select:(  plm) Querying component [rsh]
>> [itanium1:08311] mca:base:select:(  plm) Query of component [rsh] set
>> priority to 10
>> [itanium1:08311] mca:base:select:(  plm) Querying component [slurm]
>> [itanium1:08311] mca:base:select:(  plm) Skipping component [slurm].
>> Query
>> failed to return a module
>> [itanium1:08311] mca:base:select:(  plm) Selected component [rsh]
>> [itanium1:08311] mca: base: close: component slurm closed
>> [itanium1:08311] mca: base: close: unloading component slurm
>>
>> --Hangs here
>>
>> It seems a slurm problem??
>>
>> Thanks to any idea
>>
>> El Vie, 19 de Marzo de 2010, 17:57, Ralph Castain escribió:
>>
>>> Did you configure OMPI with --enable-debug? You should do this so that
>>> more diagnostic output is available.
>>>
>>> You can also add the following to your cmd line to get more info:
>>>
>>> --debug --debug-daemons --leave-session-attached
>>>
>>> Something is likely blocking proper launch of the daemons and processes
>>> so
>>> you aren't getting to the btl's at all.
>>>
>>>
>>> On Mar 19, 2010, at 9:42 AM, uriz.49...@e.unavarra.es wrote:
>>>
>>>
 The processes are running on the remote nodes but they don't give the
 response to the origin node. I don't know why.
 With the option --mca btl_base_verbose 30, I have the same problems
 and
 it
 doesn't show any message.

 Thanks


> On Wed, Mar 17, 2010 at 1:41 PM, Jeff Squyres 
> wrote:
>
>> On Mar 17, 2010, at 4:39 AM,  wrote:
>>
>>
>>> Hi everyone I'm a new Open MPI user and I have just installed Open
>>> MPI
>>> in
>>> a 6 nodes cluster with Scientific Linux. When I execute it in local
>>> it
>>> works perfectly, but when I try to execute it on the remote nodes
>>> with
>>> the
>>> --host  option it hangs and gives no message. I think that the
>>> problem
>>> could be with the shared libraries but i'm not sure. In my opinion
>>> the
>>> problem is not ssh because i can access to the nodes with no
>>> password
>>>
>> You might want to check that Open MPI processes are actually running
>> on
>> the remote nodes -- check with ps if you see any "orted" or other
>> MPI-related processes (e.g., your processes)

Re: [OMPI users] Problem in remote nodes

2010-03-30 Thread Robert Collyer
I changed the SELinux config to permissive (log only), and it didn't 
change anything.  Back to the drawing board.


Robert Collyer wrote:
I've been having similar problems using Fedora core 9.  I believe the 
issue may be with SELinux, but this is just an educated guess.  In my 
setup, shortly after a login via mpi, there is a notation in the 
/var/log/messages on the compute node as follows:


Mar 30 12:39:45  kernel: type=1400 
audit(1269970785.534:588): avc:  denied  { read } for  pid=8047 
comm="unix_chkpwd" name="hosts" dev=dm-0 ino=24579 
scontext=system_u:system_r:system_chkpwd_t:s0-s0:c0.c1023 
tcontext=unconfined_u:object_r:etc_runtime_t:s0 tclass=file


which says SELinux denied unix_chkpwd read access to hosts.
Are you getting anything like this?

In the meantime, I'll check if allowing unix_chkpwd read access to 
hosts eliminates the problem on my system, and if it works, I'll post 
the steps involved.


uriz.49...@e.unavarra.es wrote:

I've benn investigating and there is no firewall that could stop TCP
traffic in the cluster. With the option --mca plm_base_verbose 30 I get
the following output:

[itanium1] /home/otro > mpirun --mca plm_base_verbose 30 --host itanium2
helloworld.out
[itanium1:08311] mca: base: components_open: Looking for plm components
[itanium1:08311] mca: base: components_open: opening plm components
[itanium1:08311] mca: base: components_open: found loaded component rsh
[itanium1:08311] mca: base: components_open: component rsh has no 
register

function
[itanium1:08311] mca: base: components_open: component rsh open function
successful
[itanium1:08311] mca: base: components_open: found loaded component 
slurm

[itanium1:08311] mca: base: components_open: component slurm has no
register function
[itanium1:08311] mca: base: components_open: component slurm open 
function

successful
[itanium1:08311] mca:base:select: Auto-selecting plm components
[itanium1:08311] mca:base:select:(  plm) Querying component [rsh]
[itanium1:08311] mca:base:select:(  plm) Query of component [rsh] set
priority to 10
[itanium1:08311] mca:base:select:(  plm) Querying component [slurm]
[itanium1:08311] mca:base:select:(  plm) Skipping component [slurm]. 
Query

failed to return a module
[itanium1:08311] mca:base:select:(  plm) Selected component [rsh]
[itanium1:08311] mca: base: close: component slurm closed
[itanium1:08311] mca: base: close: unloading component slurm

--Hangs here

It seems a slurm problem??

Thanks to any idea

El Vie, 19 de Marzo de 2010, 17:57, Ralph Castain escribió:
 

Did you configure OMPI with --enable-debug? You should do this so that
more diagnostic output is available.

You can also add the following to your cmd line to get more info:

--debug --debug-daemons --leave-session-attached

Something is likely blocking proper launch of the daemons and 
processes so

you aren't getting to the btl's at all.


On Mar 19, 2010, at 9:42 AM, uriz.49...@e.unavarra.es wrote:

   

The processes are running on the remote nodes but they don't give the
response to the origin node. I don't know why.
With the option --mca btl_base_verbose 30, I have the same problems 
and

it
doesn't show any message.

Thanks

 

On Wed, Mar 17, 2010 at 1:41 PM, Jeff Squyres 
wrote:
   

On Mar 17, 2010, at 4:39 AM,  wrote:

 

Hi everyone I'm a new Open MPI user and I have just installed Open
MPI
in
a 6 nodes cluster with Scientific Linux. When I execute it in local
it
works perfectly, but when I try to execute it on the remote nodes
with
the
--host  option it hangs and gives no message. I think that the
problem
could be with the shared libraries but i'm not sure. In my opinion
the
problem is not ssh because i can access to the nodes with no 
password


You might want to check that Open MPI processes are actually running
on
the remote nodes -- check with ps if you see any "orted" or other
MPI-related processes (e.g., your processes).

Do you have any TCP firewall software running between the nodes?  If
so,
you'll need to disable it (at least for Open MPI jobs).
  
I also recommend running mpirun with the option --mca 
btl_base_verbose

30 to troubleshoot tcp issues.

In some environments, you need to explicitly tell mpirun what network
interfaces it can use to reach the hosts. Read the following FAQ
section for more information:

http://www.open-mpi.org/faq/?category=tcp

Item 7 of the FAQ might be of special interest.

Regards,

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
  

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users





___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/

Re: [OMPI users] Problem in remote nodes

2010-03-30 Thread Robert Collyer
I've been having similar problems using Fedora core 9.  I believe the 
issue may be with SELinux, but this is just an educated guess.  In my 
setup, shortly after a login via mpi, there is a notation in the 
/var/log/messages on the compute node as follows:


Mar 30 12:39:45  kernel: type=1400 audit(1269970785.534:588): 
avc:  denied  { read } for  pid=8047 comm="unix_chkpwd" name="hosts" 
dev=dm-0 ino=24579 
scontext=system_u:system_r:system_chkpwd_t:s0-s0:c0.c1023 
tcontext=unconfined_u:object_r:etc_runtime_t:s0 tclass=file


which says SELinux denied unix_chkpwd read access to hosts. 


Are you getting anything like this?

In the meantime, I'll check if allowing unix_chkpwd read access to hosts 
eliminates the problem on my system, and if it works, I'll post the 
steps involved.


uriz.49...@e.unavarra.es wrote:

I've benn investigating and there is no firewall that could stop TCP
traffic in the cluster. With the option --mca plm_base_verbose 30 I get
the following output:

[itanium1] /home/otro > mpirun --mca plm_base_verbose 30 --host itanium2
helloworld.out
[itanium1:08311] mca: base: components_open: Looking for plm components
[itanium1:08311] mca: base: components_open: opening plm components
[itanium1:08311] mca: base: components_open: found loaded component rsh
[itanium1:08311] mca: base: components_open: component rsh has no register
function
[itanium1:08311] mca: base: components_open: component rsh open function
successful
[itanium1:08311] mca: base: components_open: found loaded component slurm
[itanium1:08311] mca: base: components_open: component slurm has no
register function
[itanium1:08311] mca: base: components_open: component slurm open function
successful
[itanium1:08311] mca:base:select: Auto-selecting plm components
[itanium1:08311] mca:base:select:(  plm) Querying component [rsh]
[itanium1:08311] mca:base:select:(  plm) Query of component [rsh] set
priority to 10
[itanium1:08311] mca:base:select:(  plm) Querying component [slurm]
[itanium1:08311] mca:base:select:(  plm) Skipping component [slurm]. Query
failed to return a module
[itanium1:08311] mca:base:select:(  plm) Selected component [rsh]
[itanium1:08311] mca: base: close: component slurm closed
[itanium1:08311] mca: base: close: unloading component slurm

--Hangs here

It seems a slurm problem??

Thanks to any idea

El Vie, 19 de Marzo de 2010, 17:57, Ralph Castain escribió:
  

Did you configure OMPI with --enable-debug? You should do this so that
more diagnostic output is available.

You can also add the following to your cmd line to get more info:

--debug --debug-daemons --leave-session-attached

Something is likely blocking proper launch of the daemons and processes so
you aren't getting to the btl's at all.


On Mar 19, 2010, at 9:42 AM, uriz.49...@e.unavarra.es wrote:



The processes are running on the remote nodes but they don't give the
response to the origin node. I don't know why.
With the option --mca btl_base_verbose 30, I have the same problems and
it
doesn't show any message.

Thanks

  

On Wed, Mar 17, 2010 at 1:41 PM, Jeff Squyres 
wrote:


On Mar 17, 2010, at 4:39 AM,  wrote:

  

Hi everyone I'm a new Open MPI user and I have just installed Open
MPI
in
a 6 nodes cluster with Scientific Linux. When I execute it in local
it
works perfectly, but when I try to execute it on the remote nodes
with
the
--host  option it hangs and gives no message. I think that the
problem
could be with the shared libraries but i'm not sure. In my opinion
the
problem is not ssh because i can access to the nodes with no password


You might want to check that Open MPI processes are actually running
on
the remote nodes -- check with ps if you see any "orted" or other
MPI-related processes (e.g., your processes).

Do you have any TCP firewall software running between the nodes?  If
so,
you'll need to disable it (at least for Open MPI jobs).
  

I also recommend running mpirun with the option --mca btl_base_verbose
30 to troubleshoot tcp issues.

In some environments, you need to explicitly tell mpirun what network
interfaces it can use to reach the hosts. Read the following FAQ
section for more information:

http://www.open-mpi.org/faq/?category=tcp

Item 7 of the FAQ might be of special interest.

Regards,

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
  

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users





___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

  


Re: [OMPI users] Problem in remote nodes

2010-03-30 Thread Ralph Castain
Looks to me like you have an error in your cmd line - you aren't specifying the 
number of procs to run. My guess is that the system is hanging trying to 
resolve the process map as a result. Try adding "-np 1" to the cmd line.

The output indicates it is dropping slurm because it doesn't see a slurm 
allocation. So it is defaulting to use of rsh/ssh to launch.


On Mar 30, 2010, at 4:27 AM, uriz.49...@e.unavarra.es wrote:

> I've benn investigating and there is no firewall that could stop TCP
> traffic in the cluster. With the option --mca plm_base_verbose 30 I get
> the following output:
> 
> [itanium1] /home/otro > mpirun --mca plm_base_verbose 30 --host itanium2
> helloworld.out
> [itanium1:08311] mca: base: components_open: Looking for plm components
> [itanium1:08311] mca: base: components_open: opening plm components
> [itanium1:08311] mca: base: components_open: found loaded component rsh
> [itanium1:08311] mca: base: components_open: component rsh has no register
> function
> [itanium1:08311] mca: base: components_open: component rsh open function
> successful
> [itanium1:08311] mca: base: components_open: found loaded component slurm
> [itanium1:08311] mca: base: components_open: component slurm has no
> register function
> [itanium1:08311] mca: base: components_open: component slurm open function
> successful
> [itanium1:08311] mca:base:select: Auto-selecting plm components
> [itanium1:08311] mca:base:select:(  plm) Querying component [rsh]
> [itanium1:08311] mca:base:select:(  plm) Query of component [rsh] set
> priority to 10
> [itanium1:08311] mca:base:select:(  plm) Querying component [slurm]
> [itanium1:08311] mca:base:select:(  plm) Skipping component [slurm]. Query
> failed to return a module
> [itanium1:08311] mca:base:select:(  plm) Selected component [rsh]
> [itanium1:08311] mca: base: close: component slurm closed
> [itanium1:08311] mca: base: close: unloading component slurm
> 
> --Hangs here
> 
> It seems a slurm problem??
> 
> Thanks to any idea
> 
> El Vie, 19 de Marzo de 2010, 17:57, Ralph Castain escribió:
>> Did you configure OMPI with --enable-debug? You should do this so that
>> more diagnostic output is available.
>> 
>> You can also add the following to your cmd line to get more info:
>> 
>> --debug --debug-daemons --leave-session-attached
>> 
>> Something is likely blocking proper launch of the daemons and processes so
>> you aren't getting to the btl's at all.
>> 
>> 
>> On Mar 19, 2010, at 9:42 AM, uriz.49...@e.unavarra.es wrote:
>> 
>>> The processes are running on the remote nodes but they don't give the
>>> response to the origin node. I don't know why.
>>> With the option --mca btl_base_verbose 30, I have the same problems and
>>> it
>>> doesn't show any message.
>>> 
>>> Thanks
>>> 
 On Wed, Mar 17, 2010 at 1:41 PM, Jeff Squyres 
 wrote:
> On Mar 17, 2010, at 4:39 AM,  wrote:
> 
>> Hi everyone I'm a new Open MPI user and I have just installed Open
>> MPI
>> in
>> a 6 nodes cluster with Scientific Linux. When I execute it in local
>> it
>> works perfectly, but when I try to execute it on the remote nodes
>> with
>> the
>> --host  option it hangs and gives no message. I think that the
>> problem
>> could be with the shared libraries but i'm not sure. In my opinion
>> the
>> problem is not ssh because i can access to the nodes with no password
> 
> You might want to check that Open MPI processes are actually running
> on
> the remote nodes -- check with ps if you see any "orted" or other
> MPI-related processes (e.g., your processes).
> 
> Do you have any TCP firewall software running between the nodes?  If
> so,
> you'll need to disable it (at least for Open MPI jobs).
 
 I also recommend running mpirun with the option --mca btl_base_verbose
 30 to troubleshoot tcp issues.
 
 In some environments, you need to explicitly tell mpirun what network
 interfaces it can use to reach the hosts. Read the following FAQ
 section for more information:
 
 http://www.open-mpi.org/faq/?category=tcp
 
 Item 7 of the FAQ might be of special interest.
 
 Regards,
 
 ___
 users mailing list
 us...@open-mpi.org
 http://www.open-mpi.org/mailman/listinfo.cgi/users
 
>>> 
>>> 
>>> ___
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
>> 
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] Problem in remote nodes

2010-03-30 Thread uriz . 49949
I've benn investigating and there is no firewall that could stop TCP
traffic in the cluster. With the option --mca plm_base_verbose 30 I get
the following output:

[itanium1] /home/otro > mpirun --mca plm_base_verbose 30 --host itanium2
helloworld.out
[itanium1:08311] mca: base: components_open: Looking for plm components
[itanium1:08311] mca: base: components_open: opening plm components
[itanium1:08311] mca: base: components_open: found loaded component rsh
[itanium1:08311] mca: base: components_open: component rsh has no register
function
[itanium1:08311] mca: base: components_open: component rsh open function
successful
[itanium1:08311] mca: base: components_open: found loaded component slurm
[itanium1:08311] mca: base: components_open: component slurm has no
register function
[itanium1:08311] mca: base: components_open: component slurm open function
successful
[itanium1:08311] mca:base:select: Auto-selecting plm components
[itanium1:08311] mca:base:select:(  plm) Querying component [rsh]
[itanium1:08311] mca:base:select:(  plm) Query of component [rsh] set
priority to 10
[itanium1:08311] mca:base:select:(  plm) Querying component [slurm]
[itanium1:08311] mca:base:select:(  plm) Skipping component [slurm]. Query
failed to return a module
[itanium1:08311] mca:base:select:(  plm) Selected component [rsh]
[itanium1:08311] mca: base: close: component slurm closed
[itanium1:08311] mca: base: close: unloading component slurm

--Hangs here

It seems a slurm problem??

Thanks to any idea

El Vie, 19 de Marzo de 2010, 17:57, Ralph Castain escribió:
> Did you configure OMPI with --enable-debug? You should do this so that
> more diagnostic output is available.
>
> You can also add the following to your cmd line to get more info:
>
> --debug --debug-daemons --leave-session-attached
>
> Something is likely blocking proper launch of the daemons and processes so
> you aren't getting to the btl's at all.
>
>
> On Mar 19, 2010, at 9:42 AM, uriz.49...@e.unavarra.es wrote:
>
>> The processes are running on the remote nodes but they don't give the
>> response to the origin node. I don't know why.
>> With the option --mca btl_base_verbose 30, I have the same problems and
>> it
>> doesn't show any message.
>>
>> Thanks
>>
>>> On Wed, Mar 17, 2010 at 1:41 PM, Jeff Squyres 
>>> wrote:
 On Mar 17, 2010, at 4:39 AM,  wrote:

> Hi everyone I'm a new Open MPI user and I have just installed Open
> MPI
> in
> a 6 nodes cluster with Scientific Linux. When I execute it in local
> it
> works perfectly, but when I try to execute it on the remote nodes
> with
> the
> --host  option it hangs and gives no message. I think that the
> problem
> could be with the shared libraries but i'm not sure. In my opinion
> the
> problem is not ssh because i can access to the nodes with no password

 You might want to check that Open MPI processes are actually running
 on
 the remote nodes -- check with ps if you see any "orted" or other
 MPI-related processes (e.g., your processes).

 Do you have any TCP firewall software running between the nodes?  If
 so,
 you'll need to disable it (at least for Open MPI jobs).
>>>
>>> I also recommend running mpirun with the option --mca btl_base_verbose
>>> 30 to troubleshoot tcp issues.
>>>
>>> In some environments, you need to explicitly tell mpirun what network
>>> interfaces it can use to reach the hosts. Read the following FAQ
>>> section for more information:
>>>
>>> http://www.open-mpi.org/faq/?category=tcp
>>>
>>> Item 7 of the FAQ might be of special interest.
>>>
>>> Regards,
>>>
>>> ___
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>
>>
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>




Re: [OMPI users] Problem in remote nodes

2010-03-19 Thread Ralph Castain
Did you configure OMPI with --enable-debug? You should do this so that more 
diagnostic output is available.

You can also add the following to your cmd line to get more info:

--debug --debug-daemons --leave-session-attached

Something is likely blocking proper launch of the daemons and processes so you 
aren't getting to the btl's at all.


On Mar 19, 2010, at 9:42 AM, uriz.49...@e.unavarra.es wrote:

> The processes are running on the remote nodes but they don't give the
> response to the origin node. I don't know why.
> With the option --mca btl_base_verbose 30, I have the same problems and it
> doesn't show any message.
> 
> Thanks
> 
>> On Wed, Mar 17, 2010 at 1:41 PM, Jeff Squyres  wrote:
>>> On Mar 17, 2010, at 4:39 AM,  wrote:
>>> 
 Hi everyone I'm a new Open MPI user and I have just installed Open MPI
 in
 a 6 nodes cluster with Scientific Linux. When I execute it in local it
 works perfectly, but when I try to execute it on the remote nodes with
 the
 --host  option it hangs and gives no message. I think that the problem
 could be with the shared libraries but i'm not sure. In my opinion the
 problem is not ssh because i can access to the nodes with no password
>>> 
>>> You might want to check that Open MPI processes are actually running on
>>> the remote nodes -- check with ps if you see any "orted" or other
>>> MPI-related processes (e.g., your processes).
>>> 
>>> Do you have any TCP firewall software running between the nodes?  If so,
>>> you'll need to disable it (at least for Open MPI jobs).
>> 
>> I also recommend running mpirun with the option --mca btl_base_verbose
>> 30 to troubleshoot tcp issues.
>> 
>> In some environments, you need to explicitly tell mpirun what network
>> interfaces it can use to reach the hosts. Read the following FAQ
>> section for more information:
>> 
>> http://www.open-mpi.org/faq/?category=tcp
>> 
>> Item 7 of the FAQ might be of special interest.
>> 
>> Regards,
>> 
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] Problem in remote nodes

2010-03-19 Thread uriz . 49949
The processes are running on the remote nodes but they don't give the
response to the origin node. I don't know why.
With the option --mca btl_base_verbose 30, I have the same problems and it
doesn't show any message.

Thanks

> On Wed, Mar 17, 2010 at 1:41 PM, Jeff Squyres  wrote:
>> On Mar 17, 2010, at 4:39 AM,  wrote:
>>
>>> Hi everyone I'm a new Open MPI user and I have just installed Open MPI
>>> in
>>> a 6 nodes cluster with Scientific Linux. When I execute it in local it
>>> works perfectly, but when I try to execute it on the remote nodes with
>>> the
>>> --host  option it hangs and gives no message. I think that the problem
>>> could be with the shared libraries but i'm not sure. In my opinion the
>>> problem is not ssh because i can access to the nodes with no password
>>
>> You might want to check that Open MPI processes are actually running on
>> the remote nodes -- check with ps if you see any "orted" or other
>> MPI-related processes (e.g., your processes).
>>
>> Do you have any TCP firewall software running between the nodes?  If so,
>> you'll need to disable it (at least for Open MPI jobs).
>
> I also recommend running mpirun with the option --mca btl_base_verbose
> 30 to troubleshoot tcp issues.
>
> In some environments, you need to explicitly tell mpirun what network
> interfaces it can use to reach the hosts. Read the following FAQ
> section for more information:
>
> http://www.open-mpi.org/faq/?category=tcp
>
> Item 7 of the FAQ might be of special interest.
>
> Regards,
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>




Re: [OMPI users] Problem in remote nodes

2010-03-17 Thread Fernando Lemos
On Wed, Mar 17, 2010 at 1:41 PM, Jeff Squyres  wrote:
> On Mar 17, 2010, at 4:39 AM,  wrote:
>
>> Hi everyone I'm a new Open MPI user and I have just installed Open MPI in
>> a 6 nodes cluster with Scientific Linux. When I execute it in local it
>> works perfectly, but when I try to execute it on the remote nodes with the
>> --host  option it hangs and gives no message. I think that the problem
>> could be with the shared libraries but i'm not sure. In my opinion the
>> problem is not ssh because i can access to the nodes with no password
>
> You might want to check that Open MPI processes are actually running on the 
> remote nodes -- check with ps if you see any "orted" or other MPI-related 
> processes (e.g., your processes).
>
> Do you have any TCP firewall software running between the nodes?  If so, 
> you'll need to disable it (at least for Open MPI jobs).

I also recommend running mpirun with the option --mca btl_base_verbose
30 to troubleshoot tcp issues.

In some environments, you need to explicitly tell mpirun what network
interfaces it can use to reach the hosts. Read the following FAQ
section for more information:

http://www.open-mpi.org/faq/?category=tcp

Item 7 of the FAQ might be of special interest.

Regards,



Re: [OMPI users] Problem in remote nodes

2010-03-17 Thread Jeff Squyres
On Mar 17, 2010, at 4:39 AM,  wrote:

> Hi everyone I'm a new Open MPI user and I have just installed Open MPI in
> a 6 nodes cluster with Scientific Linux. When I execute it in local it
> works perfectly, but when I try to execute it on the remote nodes with the
> --host  option it hangs and gives no message. I think that the problem
> could be with the shared libraries but i'm not sure. In my opinion the
> problem is not ssh because i can access to the nodes with no password

You might want to check that Open MPI processes are actually running on the 
remote nodes -- check with ps if you see any "orted" or other MPI-related 
processes (e.g., your processes).

Do you have any TCP firewall software running between the nodes?  If so, you'll 
need to disable it (at least for Open MPI jobs).

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




[OMPI users] Problem in remote nodes

2010-03-17 Thread uriz . 49949
Hi everyone I'm a new Open MPI user and I have just installed Open MPI in
a 6 nodes cluster with Scientific Linux. When I execute it in local it
works perfectly, but when I try to execute it on the remote nodes with the
--host  option it hangs and gives no message. I think that the problem
could be with the shared libraries but i'm not sure. In my opinion the
problem is not ssh because i can access to the nodes with no password

If someone could give me an idea of what could be my problem i'll very
pleased... I'm totally blocked

thanks