Re: [OMPI users] OpenMPI Hangs, No Error

2010-07-14 Thread Robert Walters
Also, I finally got some graphical output from Sun Studio Analyzer.

I
 see MPI_Recv and MPI_Wait taking a lot of time, but I would think that 
is ok, this program does heavy number crunching and I would expect it to
 need to Wait or wait to Receive very often since there is a decent 
amount of time between communications. Is this the correct assumption?

What
 does catch my eye is MPI_Barrier takes up a significant chunk of around
 10%. I read that MPI_Barrier blocks the caller until all processes have
 called? Perhaps there is something fishy there that it is taking an 
awful long time for processes to call each other although MPI_Send is 
not taking very long so it makes me feel more comfortable about network 
communication.

Anyways, please have a look and let me know what 
you think could be the issue.

Regards,
Robert
 Walters

--- On Tue, 7/13/10, Robert Walters  wrote:

From: Robert Walters 
Subject: Re: [OMPI users] OpenMPI Hangs, No Error
To: "Open MPI Users" 
List-Post: users@lists.open-mpi.org
Date: Tuesday, July 13, 2010, 10:42 PM

Naturally, a forgotten attachment.

An to edit that, it was compiled to be used with OpenMPI 1.4.1, but as I 
understand, 1.4.2 is just a bug fix of 1.4.1.

--- On Tue, 7/13/10, Robert Walters  wrote:

From: Robert Walters 
Subject: Re: [OMPI users] OpenMPI Hangs, No Error
To: "Open MPI Users" 
List-Post: users@lists.open-mpi.org
Date: Tuesday, July 13, 2010, 10:38 PM

I think I forgot to mention earlier that the application I am using is 
pre-compiled. It is a finite element software called LS-DYNA. It is not open 
source and I likely cannot obtain the code
 it uses for MPP. This version I am using was specifically compiled, by the 
parent company, for OpenMPI 1.4.2 MPP operations. 

I recently installed the Sun Studio 12.1 to attempt to analyze the situation. 
It seems to work partially. It will record various processes individually, 
which is cryptic. The function it fails on, though, is the MPI Tracing. It 
errors that "no MPI tracing data file in experiment, MPI Timeline and MPI 
Charts will not be available". Sometime during the analysis (about 10,000 
iterations later, the VT_MAX_FLUSHES complains that there are too many i/o 
flushes and its not happy. I've increased this number in the environmental 
variable and killed the
 analysis before it had a chance to error but still no MPI Trace data is 
recorded. Not sure if you guys have heard of that happening or know any way to 
fix it...Did OpenMPI need to be configured/built for Sun Studio use?

I also noticed that from the data I do get back, there are two sets of 
functions for everything. There is mpi_recv and then my_recv_, both with the 
same % utilization time. The mpi one comes from your program's library and the 
my_recv_ one comes from my program. Is that typical or should the program I'm 
using be saying mpi_recv only? This data may be enough to help me see what's 
wrong so I will pass it along. Keep in mind this is percent time of total run 
time and not percent of MPI communication. I attached the information in a 
picture rather than me attempting to format a nice table in this nasty e-mail 
application.
I blacked out items that are related to LS-DYNA but afterward I just realized 
that I think every function with
 an _ at the end represents a command issuing from LS-DYNA. 

These are my big spenders. The processes I did not include are in the bottom 
4%. The processes that would be above these were the LS-DYNA applications at 
100%. Like I mentioned earlier, there are two instances of every MPI command, 
and they carry the same percent usage. It's curious that this version, built 
for OpenMPI, uses different functions. 

Just for a little more background info, OpenMPI is being launched from a local 
hard drive on each machine, but the LS-DYNA job files, and related data output 
files, are on a mounted drive on that machine, where the mounted drive is 
located on a different machine also in the cluster. We were thinking that might 
be an issue but it isn't writing enough data for me to think that would 
significantly decrease MPP performance.

I would like to make one last mention. That is that OpenMPI running 8 cores on 
a single node, with all the
 communication, works flawlessly. It works much faster than the Shared Memory 
Parallel (SMP) version of LS-DYNA that we currently have used scaled to 8 
cores. LS-DYNA seems to be approximately 25% faster (don't quote me on that) 
when using the OpenMPI installation than when using the standard SMP, which is 
awesome. My point being that OpenMPI seems to be working fine, even with the 
screwy mounted drive. This leads me to continue to point at the network.

Anyhow, let me know if anything seems weird on the OpenMPI communication 
subroutines. I don't have any numbers to lean on from experience.

Sorry this e-mail was long. Thank you again for all of your help.

Regards,
Rober

Re: [OMPI users] OpenMPI Hangs, No Error

2010-07-13 Thread Robert Walters
Naturally, a forgotten attachment.

An to edit that, it was compiled to be used with OpenMPI 1.4.1, but as I 
understand, 1.4.2 is just a bug fix of 1.4.1.

--- On Tue, 7/13/10, Robert Walters  wrote:

From: Robert Walters 
Subject: Re: [OMPI users] OpenMPI Hangs, No Error
To: "Open MPI Users" 
List-Post: users@lists.open-mpi.org
Date: Tuesday, July 13, 2010, 10:38 PM

I think I forgot to mention earlier that the application I am using is 
pre-compiled. It is a finite element software called LS-DYNA. It is not open 
source and I likely cannot obtain the code it uses for MPP. This version I am 
using was specifically compiled, by the parent company, for OpenMPI 1.4.2 MPP 
operations. 

I recently installed the Sun Studio 12.1 to attempt to analyze the situation. 
It seems to work partially. It will record various processes individually, 
which is cryptic. The function it fails on, though, is the MPI Tracing. It 
errors that "no MPI tracing data file in experiment, MPI Timeline and MPI 
Charts will not be available". Sometime during the analysis (about 10,000 
iterations later, the VT_MAX_FLUSHES complains that there are too many i/o 
flushes and its not happy. I've increased this number in the environmental 
variable and killed the
 analysis before it had a chance to error but still no MPI Trace data is 
recorded. Not sure if you guys have heard of that happening or know any way to 
fix it...Did OpenMPI need to be configured/built for Sun Studio use?

I also noticed that from the data I do get back, there are two sets of 
functions for everything. There is mpi_recv and then my_recv_, both with the 
same % utilization time. The mpi one comes from your program's library and the 
my_recv_ one comes from my program. Is that typical or should the program I'm 
using be saying mpi_recv only? This data may be enough to help me see what's 
wrong so I will pass it along. Keep in mind this is percent time of total run 
time and not percent of MPI communication. I attached the information in a 
picture rather than me attempting to format a nice table in this nasty e-mail 
application.
I blacked out items that are related to LS-DYNA but afterward I just realized 
that I think every function with
 an _ at the end represents a command issuing from LS-DYNA. 

These are my big spenders. The processes I did not include are in the bottom 
4%. The processes that would be above these were the LS-DYNA applications at 
100%. Like I mentioned earlier, there are two instances of every MPI command, 
and they carry the same percent usage. It's curious that this version, built 
for OpenMPI, uses different functions. 

Just for a little more background info, OpenMPI is being launched from a local 
hard drive on each machine, but the LS-DYNA job files, and related data output 
files, are on a mounted drive on that machine, where the mounted drive is 
located on a different machine also in the cluster. We were thinking that might 
be an issue but it isn't writing enough data for me to think that would 
significantly decrease MPP performance.

I would like to make one last mention. That is that OpenMPI running 8 cores on 
a single node, with all the
 communication, works flawlessly. It works much faster than the Shared Memory 
Parallel (SMP) version of LS-DYNA that we currently have used scaled to 8 
cores. LS-DYNA seems to be approximately 25% faster (don't quote me on that) 
when using the OpenMPI installation than when using the standard SMP, which is 
awesome. My point being that OpenMPI seems to be working fine, even with the 
screwy mounted drive. This leads me to continue to point at the network.

Anyhow, let me know if anything seems weird on the OpenMPI communication 
subroutines. I don't have any numbers to lean on from experience.

Sorry this e-mail was long. Thank you again for all of your help.

Regards,
Robert Walters

--- On Tue, 7/13/10, David Zhang  wrote:

From: David Zhang 
Subject:
 Re: [OMPI users] OpenMPI Hangs, No Error
To: "Open MPI Users" 
List-Post: users@lists.open-mpi.org
Date: Tuesday, July 13, 2010, 9:42 PM

Like Ralph says, the slow down may not be coming from the kernel, but rather on 
waiting for messages.  What MPI send/recv commands are you using?

On Tue, Jul 13, 2010 at 11:53 AM, Ralph Castain  wrote:


I'm afraid that having 2 cores on a single machine will always outperform 
having 1 core on each machine if any communication is involved.


The most likely thing that is happening is that OMPI is polling waiting for 
messages to arrive. You might look closer at your code to try and optimize it 
better so that number-crunching can get more attention.


Others on this list are far more knowledgeable than I am about doing such 
things, so I'll let them take it from here. Glad it is now running!



On Jul 13, 2010, at 12:22 PM, Robert Walters wrote:
OpenMPI,

Following up. The sysadmin opened ports for machine to machine com

Re: [OMPI users] OpenMPI Hangs, No Error

2010-07-13 Thread Robert Walters
I think I forgot to mention earlier that the application I am using is 
pre-compiled. It is a finite element software called LS-DYNA. It is not open 
source and I likely cannot obtain the code it uses for MPP. This version I am 
using was specifically compiled, by the parent company, for OpenMPI 1.4.2 MPP 
operations. 

I recently installed the Sun Studio 12.1 to attempt to analyze the situation. 
It seems to work partially. It will record various processes individually, 
which is cryptic. The function it fails on, though, is the MPI Tracing. It 
errors that "no MPI tracing data file in experiment, MPI Timeline and MPI 
Charts will not be available". Sometime during the analysis (about 10,000 
iterations later, the VT_MAX_FLUSHES complains that there are too many i/o 
flushes and its not happy. I've increased this number in the environmental 
variable and killed the analysis before it had a chance to error but still no 
MPI Trace data is recorded. Not sure if you guys have heard of that happening 
or know any way to fix it...Did OpenMPI need to be configured/built for Sun 
Studio use?

I also noticed that from the data I do get back, there are two sets of 
functions for everything. There is mpi_recv and then my_recv_, both with the 
same % utilization time. The mpi one comes from your program's library and the 
my_recv_ one comes from my program. Is that typical or should the program I'm 
using be saying mpi_recv only? This data may be enough to help me see what's 
wrong so I will pass it along. Keep in mind this is percent time of total run 
time and not percent of MPI communication. I attached the information in a 
picture rather than me attempting to format a nice table in this nasty e-mail 
application.
I blacked out items that are related to LS-DYNA but afterward I just realized 
that I think every function with an _ at the end represents a command issuing 
from LS-DYNA. 

These are my big spenders. The processes I did not include are in the bottom 
4%. The processes that would be above these were the LS-DYNA applications at 
100%. Like I mentioned earlier, there are two instances of every MPI command, 
and they carry the same percent usage. It's curious that this version, built 
for OpenMPI, uses different functions. 

Just for a little more background info, OpenMPI is being launched from a local 
hard drive on each machine, but the LS-DYNA job files, and related data output 
files, are on a mounted drive on that machine, where the mounted drive is 
located on a different machine also in the cluster. We were thinking that might 
be an issue but it isn't writing enough data for me to think that would 
significantly decrease MPP performance.

I would like to make one last mention. That is that OpenMPI running 8 cores on 
a single node, with all the communication, works flawlessly. It works much 
faster than the Shared Memory Parallel (SMP) version of LS-DYNA that we 
currently have used scaled to 8 cores. LS-DYNA seems to be approximately 25% 
faster (don't quote me on that) when using the OpenMPI installation than when 
using the standard SMP, which is awesome. My point being that OpenMPI seems to 
be working fine, even with the screwy mounted drive. This leads me to continue 
to point at the network.

Anyhow, let me know if anything seems weird on the OpenMPI communication 
subroutines. I don't have any numbers to lean on from experience.

Sorry this e-mail was long. Thank you again for all of your help.

Regards,
Robert Walters

--- On Tue, 7/13/10, David Zhang  wrote:

From: David Zhang 
Subject: Re: [OMPI users] OpenMPI Hangs, No Error
To: "Open MPI Users" 
List-Post: users@lists.open-mpi.org
Date: Tuesday, July 13, 2010, 9:42 PM

Like Ralph says, the slow down may not be coming from the kernel, but rather on 
waiting for messages.  What MPI send/recv commands are you using?

On Tue, Jul 13, 2010 at 11:53 AM, Ralph Castain  wrote:


I'm afraid that having 2 cores on a single machine will always outperform 
having 1 core on each machine if any communication is involved.


The most likely thing that is happening is that OMPI is polling waiting for 
messages to arrive. You might look closer at your code to try and optimize it 
better so that number-crunching can get more attention.


Others on this list are far more knowledgeable than I am about doing such 
things, so I'll let them take it from here. Glad it is now running!



On Jul 13, 2010, at 12:22 PM, Robert Walters wrote:
OpenMPI,

Following up. The sysadmin opened ports for machine to machine communication 
and OpenMPI is running successfully with no errors in connectivity_c, hello_c, 
or ring_c. Since, I have started to implement our MPP software (finite element 
analysis) that we have, and upon running a simple, 1 core on machine1, 1 core 
on machine2, job, I notice it is considerably slower than a 2 core job on a 
single machine. 



A quick look at top shows me kerne

Re: [OMPI users] OpenMPI Hangs, No Error

2010-07-13 Thread David Zhang
Like Ralph says, the slow down may not be coming from the kernel, but rather
on waiting for messages.  What MPI send/recv commands are you using?

On Tue, Jul 13, 2010 at 11:53 AM, Ralph Castain  wrote:

> I'm afraid that having 2 cores on a single machine will always outperform
> having 1 core on each machine if any communication is involved.
>
> The most likely thing that is happening is that OMPI is polling waiting for
> messages to arrive. You might look closer at your code to try and optimize
> it better so that number-crunching can get more attention.
>
> Others on this list are far more knowledgeable than I am about doing such
> things, so I'll let them take it from here. Glad it is now running!
>
>
> On Jul 13, 2010, at 12:22 PM, Robert Walters wrote:
>
> OpenMPI,
>
> Following up. The sysadmin opened ports for machine to machine
> communication and OpenMPI is running successfully with no errors in
> connectivity_c, hello_c, or ring_c. Since, I have started to implement our
> MPP software (finite element analysis) that we have, and upon running a
> simple, 1 core on machine1, 1 core on machine2, job, I notice it is
> considerably slower than a 2 core job on a single machine.
>
> A quick look at top shows me kernel usage is almost twice what cpu usage
> is! On a 16 core job, (8 cores per node so 2 nodes total) test, OpenMPI was
> consuming ~65% of the cpu for kernel related items rather than
> number-crunching related items...Granted, we are running on GigE, but this
> is a finite element code we are running with no heavy data transfer within
> it. I'm looking into benchmarking tools, but my sysadmin is not very open to
> installing third party softwares. Do you have any suggestions for what I can
> use that would be "big name" or guaranteed safe tools I can use to figure
> out what's causing the hold up with all the kernel usage? I'm pretty sure
> its network traffic but I have no way of telling (as far as I know because
> I'm not a Linux whiz) with the standard tools in RHEL.
>
> Thanks for all the help! I'm glad to get it finally working and I think
> with a little tweaking it should be ready to go very soon.
>
> Regards,
> Robert Walters
> --- On *Sat, 7/10/10, Ralph Castain * wrote:
>
>
> From: Ralph Castain 
> Subject: Re: [OMPI users] OpenMPI Hangs, No Error
> To: "Open MPI Users" 
> Date: Saturday, July 10, 2010, 4:37 PM
>
> The "static ports" flag means something different - it is used when the
> daemon is given a fixed port to use. In some installations, we lock every
> daemon to the same port number so that each daemon can compute exactly how
> to contact its peers (i.e., no contact info exchange required for wireup).
>
> You have a "fixed range", but not "static port", scenario. Hence the
> message.
>
> Let us know how it goes - I agree it sounds like something to discuss with
> the sysadmin.
>
>
> On Jul 10, 2010, at 1:47 PM, Robert Walters wrote:
>
> I ran oob_tcp_verbose 99 and I am getting something interesting I never got
> before.
>
> [machine 2:22347] bind() failed: no port available in the range
> [60001-60016]
> [machine 2:22347] mca_oob_tcp_init: unable to create IPv4 listen socket:
> Error
>
> I never got that error before we messed with the iptables but now I get
> that error... Very interesting, I will have to talk to my sysadmin again and
> make sure he opened the right ports on my two test machines. It looks as
> though there are no open ports. Another interesting thing is I see that the
> Daemon is still report:
>
> Daemon [[28845,0],1] checking in as pid 22347 on host machine 2
> Daemon [[28845,0],1] not using static ports
>
> Which, I may be misunderstanding, should have been taken care of when I
> specified what ports to use. I am telling it a static set of ports...
> Anyhow, I will get with my sysadmin again and see what he says. At least
> OpenMPI is correctly interpreting the range.
>
> Thanks for the help.
>
> --- On *Sat, 7/10/10, Ralph Castain * wrote:
>
>
> From: Ralph Castain 
> Subject: Re: [OMPI users] OpenMPI Hangs, No Error
> To: "Open MPI Users" 
> Date: Saturday, July 10, 2010, 3:21 PM
>
> Are there multiple interfaces on your nodes? I'm wondering if we are using
> a different network than the one where you opened these ports.
>
> You'll get quite a bit of output, but you can turn on debug output in the
> oob itself with -mca oob_tcp_verbose xx. The higher the number, the more you
> get.
>
>
> On Jul 10, 2010, at 11:14 AM, Robert Walters wrote:
>
> Hello again,
>
> I believe my administrator has opened the ports I requested. The problem 

Re: [OMPI users] OpenMPI Hangs, No Error

2010-07-13 Thread Ralph Castain
I'm afraid that having 2 cores on a single machine will always outperform 
having 1 core on each machine if any communication is involved.

The most likely thing that is happening is that OMPI is polling waiting for 
messages to arrive. You might look closer at your code to try and optimize it 
better so that number-crunching can get more attention.

Others on this list are far more knowledgeable than I am about doing such 
things, so I'll let them take it from here. Glad it is now running!


On Jul 13, 2010, at 12:22 PM, Robert Walters wrote:

> OpenMPI,
> 
> Following up. The sysadmin opened ports for machine to machine communication 
> and OpenMPI is running successfully with no errors in connectivity_c, 
> hello_c, or ring_c. Since, I have started to implement our MPP software 
> (finite element analysis) that we have, and upon running a simple, 1 core on 
> machine1, 1 core on machine2, job, I notice it is considerably slower than a 
> 2 core job on a single machine. 
> 
> A quick look at top shows me kernel usage is almost twice what cpu usage is! 
> On a 16 core job, (8 cores per node so 2 nodes total) test, OpenMPI was 
> consuming ~65% of the cpu for kernel related items rather than 
> number-crunching related items...Granted, we are running on GigE, but this is 
> a finite element code we are running with no heavy data transfer within it. 
> I'm looking into benchmarking tools, but my sysadmin is not very open to 
> installing third party softwares. Do you have any suggestions for what I can 
> use that would be "big name" or guaranteed safe tools I can use to figure out 
> what's causing the hold up with all the kernel usage? I'm pretty sure its 
> network traffic but I have no way of telling (as far as I know because I'm 
> not a Linux whiz) with the standard tools in RHEL.
> 
> Thanks for all the help! I'm glad to get it finally working and I think with 
> a little tweaking it should be ready to go very soon.
> 
> Regards,
> Robert Walters
> --- On Sat, 7/10/10, Ralph Castain  wrote:
> 
> From: Ralph Castain 
> Subject: Re: [OMPI users] OpenMPI Hangs, No Error
> To: "Open MPI Users" 
> Date: Saturday, July 10, 2010, 4:37 PM
> 
> The "static ports" flag means something different - it is used when the 
> daemon is given a fixed port to use. In some installations, we lock every 
> daemon to the same port number so that each daemon can compute exactly how to 
> contact its peers (i.e., no contact info exchange required for wireup).
> 
> You have a "fixed range", but not "static port", scenario. Hence the message.
> 
> Let us know how it goes - I agree it sounds like something to discuss with 
> the sysadmin.
> 
> 
> On Jul 10, 2010, at 1:47 PM, Robert Walters wrote:
> 
>> I ran oob_tcp_verbose 99 and I am getting something interesting I never got 
>> before.
>> 
>> [machine 2:22347] bind() failed: no port available in the range [60001-60016]
>> [machine 2:22347] mca_oob_tcp_init: unable to create IPv4 listen socket: 
>> Error
>> 
>> I never got that error before we messed with the iptables but now I get that 
>> error... Very interesting, I will have to talk to my sysadmin again and make 
>> sure he opened the right ports on my two test machines. It looks as though 
>> there are no open ports. Another interesting thing is I see that the Daemon 
>> is still report:
>> 
>> Daemon [[28845,0],1] checking in as pid 22347 on host machine 2
>> Daemon [[28845,0],1] not using static ports
>> 
>> Which, I may be misunderstanding, should have been taken care of when I 
>> specified what ports to use. I am telling it a static set of ports... 
>> Anyhow, I will get with my sysadmin again and see what he says. At least 
>> OpenMPI is correctly interpreting the range. 
>> 
>> Thanks for the help.
>> 
>> --- On Sat, 7/10/10, Ralph Castain  wrote:
>> 
>> From: Ralph Castain 
>> Subject: Re: [OMPI users] OpenMPI Hangs, No Error
>> To: "Open MPI Users" 
>> Date: Saturday, July 10, 2010, 3:21 PM
>> 
>> Are there multiple interfaces on your nodes? I'm wondering if we are using a 
>> different network than the one where you opened these ports.
>> 
>> You'll get quite a bit of output, but you can turn on debug output in the 
>> oob itself with -mca oob_tcp_verbose xx. The higher the number, the more you 
>> get.
>> 
>> 
>> On Jul 10, 2010, at 11:14 AM, Robert Walters wrote:
>> 
>>> Hello again,
>>> 
>>> I believe my administrator has opened the ports I requested. The problem I 
>>> am having now is that OpenM

Re: [OMPI users] OpenMPI Hangs, No Error

2010-07-13 Thread Robert Walters
OpenMPI,

Following up. The sysadmin opened ports for machine to machine communication 
and OpenMPI is running successfully with no errors in connectivity_c, hello_c, 
or ring_c. Since, I have started to implement our MPP software (finite element 
analysis) that we have, and upon running a simple, 1 core on machine1, 1 core 
on machine2, job, I notice it is considerably slower than a 2 core job on a 
single machine. 

A quick look at top shows me kernel usage is almost twice what cpu usage is! On 
a 16 core job, (8 cores per node so 2 nodes total) test, OpenMPI was consuming 
~65% of the cpu for kernel related items rather than number-crunching related 
items...Granted, we are running on GigE, but this is a finite element code we 
are running with no heavy data transfer within it. I'm looking into 
benchmarking tools, but my sysadmin is not very open to installing third party 
softwares. Do you have any suggestions for what I can use that would be "big 
name" or guaranteed safe tools I can use to figure out what's causing the hold 
up with all the kernel usage? I'm pretty sure its network traffic but I have no 
way of telling (as far as I know because I'm not a Linux whiz) with the 
standard tools in RHEL.

Thanks for all the help! I'm glad to get it finally working and I think with a 
little tweaking it should be ready to go very soon.

Regards,
Robert Walters
--- On Sat, 7/10/10, Ralph Castain  wrote:

From: Ralph Castain 
Subject: Re: [OMPI users] OpenMPI Hangs, No Error
To: "Open MPI Users" 
List-Post: users@lists.open-mpi.org
Date: Saturday, July 10, 2010, 4:37 PM

The "static ports" flag means something different - it is used when the daemon 
is given a fixed port to use. In some installations, we lock every daemon to 
the same port number so that each daemon can compute exactly how to contact its 
peers (i.e., no contact info exchange required for wireup).
You have a "fixed range", but not "static port", scenario. Hence the message.
Let us know how it goes - I agree it sounds like something to discuss with the 
sysadmin.

On Jul 10, 2010, at 1:47 PM, Robert Walters wrote:
I ran oob_tcp_verbose 99 and I am getting something interesting I never got 
before.

[machine 2:22347] bind() failed: no port available in the range [60001-60016]
[machine 2:22347] mca_oob_tcp_init: unable to create IPv4 listen socket: Error

I never got that error before we messed with the iptables but now I get that 
error... Very interesting, I will have to talk to my sysadmin again and make 
sure he opened the right ports on my two test machines. It looks as though 
there are no open ports. Another interesting thing is I see that the Daemon is 
still report:

Daemon [[28845,0],1] checking in as pid 22347 on host machine 2
Daemon [[28845,0],1] not using static ports

Which, I may be misunderstanding, should have been taken care of when I 
specified what ports to use. I am telling it a static set of ports... Anyhow, I 
will get with my
 sysadmin again and see what he says. At least OpenMPI is correctly 
interpreting the range. 

Thanks for the help.

--- On Sat, 7/10/10, Ralph Castain  wrote:

From: Ralph Castain 
Subject: Re: [OMPI users] OpenMPI Hangs, No Error
To: "Open MPI Users" 
List-Post: users@lists.open-mpi.org
Date: Saturday, July 10, 2010, 3:21 PM

Are there multiple interfaces on your nodes? I'm wondering if we are using a 
different network than the one where you opened these ports.
You'll get quite a bit of output, but you can turn on debug output in the oob 
itself with -mca oob_tcp_verbose xx. The higher the number, the more you get.

On Jul 10, 2010, at 11:14 AM, Robert Walters wrote:
Hello again,

I believe my administrator has opened the ports I requested. The problem I am 
having now is that OpenMPI is not listening to my defined port assignments in 
openmpi-mca-params.conf (looks like permission 644 on those files should it be 
755?)

When I perform netstat -ltnup I see that orted is listening 14 processes in tcp 
but scaterred in the 26000ish port range when I specified 60001-60016 in the 
mca-params file. Is there a parameter I am missing? In any case I am still 
hanging as mentioned originally even with the port forwarding enabled and 
specifications in mca-param enabled. 

Any other ideas on what might be causing the hang? Is there a more verbose mode 
I can employ to see more deeply into the issue? I have run --debug-daemons and 
--mca plm_base_verbose
 99.

Thanks!
--- On Tue, 7/6/10, Robert Walters
  wrote:

From: Robert Walters 
Subject: Re: [OMPI users] OpenMPI Hangs, No Error
To: "Open MPI Users" 
List-Post: users@lists.open-mpi.org
Date: Tuesday, July 6, 2010, 5:41 PM

Thanks for your expeditious responses, Ralph.

Just to confirm with you, I should change openmpi-mca-params.conf to
 include:

oob_tcp_port_min_v4 = (My minimum port in the range)
oob_tcp_port_range_v4 = (My port range)
btl_tcp_

Re: [OMPI users] OpenMPI Hangs, No Error

2010-07-10 Thread Ralph Castain
The "static ports" flag means something different - it is used when the daemon 
is given a fixed port to use. In some installations, we lock every daemon to 
the same port number so that each daemon can compute exactly how to contact its 
peers (i.e., no contact info exchange required for wireup).

You have a "fixed range", but not "static port", scenario. Hence the message.

Let us know how it goes - I agree it sounds like something to discuss with the 
sysadmin.


On Jul 10, 2010, at 1:47 PM, Robert Walters wrote:

> I ran oob_tcp_verbose 99 and I am getting something interesting I never got 
> before.
> 
> [machine 2:22347] bind() failed: no port available in the range [60001-60016]
> [machine 2:22347] mca_oob_tcp_init: unable to create IPv4 listen socket: Error
> 
> I never got that error before we messed with the iptables but now I get that 
> error... Very interesting, I will have to talk to my sysadmin again and make 
> sure he opened the right ports on my two test machines. It looks as though 
> there are no open ports. Another interesting thing is I see that the Daemon 
> is still report:
> 
> Daemon [[28845,0],1] checking in as pid 22347 on host machine 2
> Daemon [[28845,0],1] not using static ports
> 
> Which, I may be misunderstanding, should have been taken care of when I 
> specified what ports to use. I am telling it a static set of ports... Anyhow, 
> I will get with my sysadmin again and see what he says. At least OpenMPI is 
> correctly interpreting the range. 
> 
> Thanks for the help.
> 
> --- On Sat, 7/10/10, Ralph Castain  wrote:
> 
> From: Ralph Castain 
> Subject: Re: [OMPI users] OpenMPI Hangs, No Error
> To: "Open MPI Users" 
> Date: Saturday, July 10, 2010, 3:21 PM
> 
> Are there multiple interfaces on your nodes? I'm wondering if we are using a 
> different network than the one where you opened these ports.
> 
> You'll get quite a bit of output, but you can turn on debug output in the oob 
> itself with -mca oob_tcp_verbose xx. The higher the number, the more you get.
> 
> 
> On Jul 10, 2010, at 11:14 AM, Robert Walters wrote:
> 
>> Hello again,
>> 
>> I believe my administrator has opened the ports I requested. The problem I 
>> am having now is that OpenMPI is not listening to my defined port 
>> assignments in openmpi-mca-params.conf (looks like permission 644 on those 
>> files should it be 755?)
>> 
>> When I perform netstat -ltnup I see that orted is listening 14 processes in 
>> tcp but scaterred in the 26000ish port range when I specified 60001-60016 in 
>> the mca-params file. Is there a parameter I am missing? In any case I am 
>> still hanging as mentioned originally even with the port forwarding enabled 
>> and specifications in mca-param enabled. 
>> 
>> Any other ideas on what might be causing the hang? Is there a more verbose 
>> mode I can employ to see more deeply into the issue? I have run 
>> --debug-daemons and --mca plm_base_verbose 99.
>> 
>> Thanks!
>> --- On Tue, 7/6/10, Robert Walters  wrote:
>> 
>> From: Robert Walters 
>> Subject: Re: [OMPI users] OpenMPI Hangs, No Error
>> To: "Open MPI Users" 
>> Date: Tuesday, July 6, 2010, 5:41 PM
>> 
>> Thanks for your expeditious responses, Ralph.
>> 
>> Just to confirm with you, I should change openmpi-mca-params.conf to include:
>> 
>> oob_tcp_port_min_v4 = (My minimum port in the range)
>> oob_tcp_port_range_v4 = (My port range)
>> btl_tcp_port_min_v4 = (My minimum port in the range)
>> btl_tcp_port_range_v4 = (My port range)
>> 
>> correct?
>> 
>> Also, for a cluster of around 32-64 processes (8 processors per node), how 
>> wide of a range will I require? I've noticed some entries in the mailing 
>> list suggesting you need a few to get started and then it opens as 
>> necessary. Will I be safe with 20 or should I go for 100? 
>> 
>> Thanks again for all of your help!
>> 
>> --- On Tue, 7/6/10, Ralph Castain  wrote:
>> 
>> From: Ralph Castain 
>> Subject: Re: [OMPI users] OpenMPI Hangs, No Error
>> To: "Open MPI Users" 
>> Date: Tuesday, July 6, 2010, 5:31 PM
>> 
>> Problem isn't with ssh - the problem is that the daemons need to open a TCP 
>> connection back to the machine where mpirun is running. If the firewall 
>> blocks that connection, then we can't run.
>> 
>> If you can get a range of ports opened, then you can specify the ports OMPI 
>> should use for this purpose. If the sysadmin won't allow even that, then you 
>> are pretty well hosed.
>> 

Re: [OMPI users] OpenMPI Hangs, No Error

2010-07-10 Thread Robert Walters
I ran oob_tcp_verbose 99 and I am getting something interesting I never got 
before.

[machine 2:22347] bind() failed: no port available in the range [60001-60016]
[machine 2:22347] mca_oob_tcp_init: unable to create IPv4 listen socket: Error

I never got that error before we messed with the iptables but now I get that 
error... Very interesting, I will have to talk to my sysadmin again and make 
sure he opened the right ports on my two test machines. It looks as though 
there are no open ports. Another interesting thing is I see that the Daemon is 
still report:

Daemon [[28845,0],1] checking in as pid 22347 on host machine 2
Daemon [[28845,0],1] not using static ports

Which, I may be misunderstanding, should have been taken care of when I 
specified what ports to use. I am telling it a static set of ports... Anyhow, I 
will get with my sysadmin again and see what he says. At least OpenMPI is 
correctly interpreting the range. 

Thanks for the help.

--- On Sat, 7/10/10, Ralph Castain  wrote:

From: Ralph Castain 
Subject: Re: [OMPI users] OpenMPI Hangs, No Error
To: "Open MPI Users" 
List-Post: users@lists.open-mpi.org
Date: Saturday, July 10, 2010, 3:21 PM

Are there multiple interfaces on your nodes? I'm wondering if we are using a 
different network than the one where you opened these ports.
You'll get quite a bit of output, but you can turn on debug output in the oob 
itself with -mca oob_tcp_verbose xx. The higher the number, the more you get.

On Jul 10, 2010, at 11:14 AM, Robert Walters wrote:
Hello again,

I believe my administrator has opened the ports I requested. The problem I am 
having now is that OpenMPI is not listening to my defined port assignments in 
openmpi-mca-params.conf (looks like permission 644 on those files should it be 
755?)

When I perform netstat -ltnup I see that orted is listening 14 processes in tcp 
but scaterred in the 26000ish port range when I specified 60001-60016 in the 
mca-params file. Is there a parameter I am missing? In any case I am still 
hanging as mentioned originally even with the port forwarding enabled and 
specifications in mca-param enabled. 

Any other ideas on what might be causing the hang? Is there a more verbose mode 
I can employ to see more deeply into the issue? I have run --debug-daemons and 
--mca plm_base_verbose 99.

Thanks!
--- On Tue, 7/6/10, Robert Walters
  wrote:

From: Robert Walters 
Subject: Re: [OMPI users] OpenMPI Hangs, No Error
To: "Open MPI Users" 
List-Post: users@lists.open-mpi.org
Date: Tuesday, July 6, 2010, 5:41 PM

Thanks for your expeditious responses, Ralph.

Just to confirm with you, I should change openmpi-mca-params.conf to include:

oob_tcp_port_min_v4 = (My minimum port in the range)
oob_tcp_port_range_v4 = (My port range)
btl_tcp_port_min_v4 = (My minimum port in the range)
btl_tcp_port_range_v4 = (My port range)

correct?

Also, for a cluster of around 32-64 processes (8 processors per node), how wide 
of a range will I require? I've noticed some entries in
 the mailing list suggesting you need a few to get started and then it opens as 
necessary. Will I be safe with 20 or should I go for 100? 

Thanks again for all of your help!

--- On Tue, 7/6/10, Ralph Castain  wrote:

From:
 Ralph Castain 
Subject: Re: [OMPI users] OpenMPI Hangs, No Error
To: "Open MPI Users" 
List-Post: users@lists.open-mpi.org
Date: Tuesday, July 6, 2010, 5:31 PM

Problem isn't with ssh - the problem is that the daemons need to open a TCP 
connection back to the machine where mpirun is running. If the firewall blocks 
that connection, then we can't run.
If you can get a range of ports opened, then you can specify the ports OMPI 
should use for this purpose. If the sysadmin won't allow even that, then you 
are pretty well hosed.

On Jul 6, 2010, at 2:23 PM, Robert Walters wrote:
Yes, there is a system firewall. I don't think the sysadmin will allow it to go 
disabled. Each Linux machine
 has the built-in RHEL firewall. SSH is enabled through the firewall though.

--- On Tue, 7/6/10, Ralph Castain  wrote:

From: Ralph Castain 
Subject: Re: [OMPI users] OpenMPI Hangs, No Error
To: "Open MPI Users" 
List-Post: users@lists.open-mpi.org
Date: Tuesday, July 6, 2010, 4:19 PM

It looks like the remote daemon is starting - is there a firewall in the way?
On Jul 6, 2010, at 2:04 PM, Robert Walters
 wrote:
Hello all,

I am using OpenMPI 1.4.2 on RHEL. I have a cluster of AMD Opteron's and right 
now I am just working on getting OpenMPI itself up and running. I have a 
successful configure and make all install. LD_LIBRARY_PATH and PATH variables 
were correctly edited. mpirun -np 8 hello_c successfully works on all machines. 
I have setup my two test machines with DSA key pairs that successfully work 
with each other.

The problem comes when I initiate my hostfile to attempt to communicate across 
machines. The hostfile is setup correctly with   . 
When running with all 

Re: [OMPI users] OpenMPI Hangs, No Error

2010-07-10 Thread Ralph Castain
Are there multiple interfaces on your nodes? I'm wondering if we are using a 
different network than the one where you opened these ports.

You'll get quite a bit of output, but you can turn on debug output in the oob 
itself with -mca oob_tcp_verbose xx. The higher the number, the more you get.


On Jul 10, 2010, at 11:14 AM, Robert Walters wrote:

> Hello again,
> 
> I believe my administrator has opened the ports I requested. The problem I am 
> having now is that OpenMPI is not listening to my defined port assignments in 
> openmpi-mca-params.conf (looks like permission 644 on those files should it 
> be 755?)
> 
> When I perform netstat -ltnup I see that orted is listening 14 processes in 
> tcp but scaterred in the 26000ish port range when I specified 60001-60016 in 
> the mca-params file. Is there a parameter I am missing? In any case I am 
> still hanging as mentioned originally even with the port forwarding enabled 
> and specifications in mca-param enabled. 
> 
> Any other ideas on what might be causing the hang? Is there a more verbose 
> mode I can employ to see more deeply into the issue? I have run 
> --debug-daemons and --mca plm_base_verbose 99.
> 
> Thanks!
> --- On Tue, 7/6/10, Robert Walters  wrote:
> 
> From: Robert Walters 
> Subject: Re: [OMPI users] OpenMPI Hangs, No Error
> To: "Open MPI Users" 
> Date: Tuesday, July 6, 2010, 5:41 PM
> 
> Thanks for your expeditious responses, Ralph.
> 
> Just to confirm with you, I should change openmpi-mca-params.conf to include:
> 
> oob_tcp_port_min_v4 = (My minimum port in the range)
> oob_tcp_port_range_v4 = (My port range)
> btl_tcp_port_min_v4 = (My minimum port in the range)
> btl_tcp_port_range_v4 = (My port range)
> 
> correct?
> 
> Also, for a cluster of around 32-64 processes (8 processors per node), how 
> wide of a range will I require? I've noticed some entries in the mailing list 
> suggesting you need a few to get started and then it opens as necessary. Will 
> I be safe with 20 or should I go for 100? 
> 
> Thanks again for all of your help!
> 
> --- On Tue, 7/6/10, Ralph Castain  wrote:
> 
> From: Ralph Castain 
> Subject: Re: [OMPI users] OpenMPI Hangs, No Error
> To: "Open MPI Users" 
> Date: Tuesday, July 6, 2010, 5:31 PM
> 
> Problem isn't with ssh - the problem is that the daemons need to open a TCP 
> connection back to the machine where mpirun is running. If the firewall 
> blocks that connection, then we can't run.
> 
> If you can get a range of ports opened, then you can specify the ports OMPI 
> should use for this purpose. If the sysadmin won't allow even that, then you 
> are pretty well hosed.
> 
> 
> On Jul 6, 2010, at 2:23 PM, Robert Walters wrote:
> 
>> Yes, there is a system firewall. I don't think the sysadmin will allow it to 
>> go disabled. Each Linux machine has the built-in RHEL firewall. SSH is 
>> enabled through the firewall though.
>> 
>> --- On Tue, 7/6/10, Ralph Castain  wrote:
>> 
>> From: Ralph Castain 
>> Subject: Re: [OMPI users] OpenMPI Hangs, No Error
>> To: "Open MPI Users" 
>> Date: Tuesday, July 6, 2010, 4:19 PM
>> 
>> It looks like the remote daemon is starting - is there a firewall in the way?
>> 
>> On Jul 6, 2010, at 2:04 PM, Robert Walters wrote:
>> 
>>> Hello all,
>>> 
>>> I am using OpenMPI 1.4.2 on RHEL. I have a cluster of AMD Opteron's and 
>>> right now I am just working on getting OpenMPI itself up and running. I 
>>> have a successful configure and make all install. LD_LIBRARY_PATH and PATH 
>>> variables were correctly edited. mpirun -np 8 hello_c successfully works on 
>>> all machines. I have setup my two test machines with DSA key pairs that 
>>> successfully work with each other.
>>> 
>>> The problem comes when I initiate my hostfile to attempt to communicate 
>>> across machines. The hostfile is setup correctly with   
>>> . When running with all verbose options enabled "mpirun --mca 
>>> plm_base_verbose 99 --debug-daemons --mca btl_base_verbose 30 --mca 
>>> oob_base_verbose 99 --mca pml_base_verbose 99 -hostfile hostfile -np 16 
>>> hello_c" I receive the following text output.
>>> 
>>> [machine1:03578] mca: base: components_open: Looking for plm components
>>> [machine1:03578] mca: base: components_open: opening plm components
>>> [machine1:03578] mca: base: components_open: found loaded component rsh
>>> [machine1:03578] mca: base: components_open: component rsh has no register 
>>> function
>>> [machine1:03578] mca: base: com

Re: [OMPI users] OpenMPI Hangs, No Error

2010-07-10 Thread Robert Walters
Hello again,

I believe my administrator has opened the ports I requested. The problem I am 
having now is that OpenMPI is not listening to my defined port assignments in 
openmpi-mca-params.conf (looks like permission 644 on those files should it be 
755?)

When I perform netstat -ltnup I see that orted is listening 14 processes in tcp 
but scaterred in the 26000ish port range when I specified 60001-60016 in the 
mca-params file. Is there a parameter I am missing? In any case I am still 
hanging as mentioned originally even with the port forwarding enabled and 
specifications in mca-param enabled. 

Any other ideas on what might be causing the hang? Is there a more verbose mode 
I can employ to see more deeply into the issue? I have run --debug-daemons and 
--mca plm_base_verbose 99.

Thanks!
--- On Tue, 7/6/10, Robert Walters  wrote:

From: Robert Walters 
Subject: Re: [OMPI users] OpenMPI Hangs, No Error
To: "Open MPI Users" 
List-Post: users@lists.open-mpi.org
Date: Tuesday, July 6, 2010, 5:41 PM

Thanks for your expeditious responses, Ralph.

Just to confirm with you, I should change openmpi-mca-params.conf to include:

oob_tcp_port_min_v4 = (My minimum port in the range)
oob_tcp_port_range_v4 = (My port range)
btl_tcp_port_min_v4 = (My minimum port in the range)
btl_tcp_port_range_v4 = (My port range)

correct?

Also, for a cluster of around 32-64 processes (8 processors per node), how wide 
of a range will I require? I've noticed some entries in the mailing list 
suggesting you need a few to get started and then it opens as necessary. Will I 
be safe with 20 or should I go for 100? 

Thanks again for all of your help!

--- On Tue, 7/6/10, Ralph Castain  wrote:

From:
 Ralph Castain 
Subject: Re: [OMPI users] OpenMPI Hangs, No Error
To: "Open MPI Users" 
List-Post: users@lists.open-mpi.org
Date: Tuesday, July 6, 2010, 5:31 PM

Problem isn't with ssh - the problem is that the daemons need to open a TCP 
connection back to the machine where mpirun is running. If the firewall blocks 
that connection, then we can't run.
If you can get a range of ports opened, then you can specify the ports OMPI 
should use for this purpose. If the sysadmin won't allow even that, then you 
are pretty well hosed.

On Jul 6, 2010, at 2:23 PM, Robert Walters wrote:
Yes, there is a system firewall. I don't think the sysadmin will allow it to go 
disabled. Each Linux machine
 has the built-in RHEL firewall. SSH is enabled through the firewall though.

--- On Tue, 7/6/10, Ralph Castain  wrote:

From: Ralph Castain 
Subject: Re: [OMPI users] OpenMPI Hangs, No Error
To: "Open MPI Users" 
List-Post: users@lists.open-mpi.org
Date: Tuesday, July 6, 2010, 4:19 PM

It looks like the remote daemon is starting - is there a firewall in the way?
On Jul 6, 2010, at 2:04 PM, Robert Walters
 wrote:
Hello all,

I am using OpenMPI 1.4.2 on RHEL. I have a cluster of AMD Opteron's and right 
now I am just working on getting OpenMPI itself up and running. I have a 
successful configure and make all install. LD_LIBRARY_PATH and PATH variables 
were correctly edited. mpirun -np 8 hello_c successfully works on all machines. 
I have setup my two test machines with DSA key pairs that successfully work 
with each other.

The problem comes when I initiate my hostfile to attempt to communicate across 
machines. The hostfile is setup correctly with   . 
When running with all verbose options enabled "mpirun --mca plm_base_verbose 99 
--debug-daemons --mca btl_base_verbose 30 --mca oob_base_verbose 99 --mca
 pml_base_verbose 99 -hostfile hostfile -np 16 hello_c" I receive the following 
text
 output.

[machine1:03578] mca: base: components_open: Looking for plm components
[machine1:03578] mca: base: components_open: opening plm components
[machine1:03578] mca: base: components_open: found loaded component rsh
[machine1:03578] mca: base: components_open: component rsh has no register 
function
[machine1:03578] mca: base: components_open: component rsh open function 
successful
[machine1:03578] mca: base: components_open: found loaded component slurm
[machine1:03578] mca: base: components_open: component slurm has no register 
function
[machine1:03578] mca: base: components_open: component slurm open function 
successful
[machine1:03578] mca:base:select: Auto-selecting plm components
[machine1:03578] mca:base:select:(  plm) Querying component [rsh]
[machine1:03578] mca:base:select:(  plm) Query of component [rsh] set priority 
to 10
[machine1:03578] mca:base:select:(  plm) Querying component
 [slurm]
[machine1:03578] mca:base:select:(  plm) Skipping component [slurm]. Query 
failed to return a module
[machine1:03578] mca:base:select:(  plm) Selected component [rsh]
[machine1:03578] mca: base: close: component slurm closed
[machine1:03578] mca: base: close: unloading component slurm
[machine1:03578] mca: base: components_open: Looking for oob components
[machine1:03578] mca: ba

Re: [OMPI users] OpenMPI Hangs, No Error

2010-07-07 Thread Jeff Squyres
On Jul 6, 2010, at 6:36 PM, Reuti wrote:

> But just for curiosity: at one point Open MPI chooses the ports. At 
> that point it might possible to implement to start two SSH tunnels per 
> slave node to have both directions and the daemons have to contact 
> then "localhost" on a specific port which will be tunneled to each 
> slave. In principle it should work I think, but it's just not 
> implemented for now.

Agreed.  Patches would be welcome!  :-)

> Maybe it could be an addition to Open MPI for security concerned 
> usage. I wonder about the speed impact, when compression is switched 
> on per se in SSH in such a setup in case you transfer large amounts of 
> data via Open MPI.

For control data (i.e., control messages passed during MPI startup, shutdown, 
etc.), the impact may not matter much.  For MPI data (i.e., having something 
like an "ssh" BTL), I could imagine quite a bit of slowdown.  But then again, 
it depends on what your goals are -- if your local policies demand ssh or 
nothing, having "slow" MPI might be better than nothing.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI users] OpenMPI Hangs, No Error

2010-07-06 Thread Reuti

Am 06.07.2010 um 23:31 schrieb Ralph Castain:

Problem isn't with ssh - the problem is that the daemons need to  
open a TCP connection back to the machine where mpirun is running.  
If the firewall blocks that connection, then we can't run.


If you can get a range of ports opened, then you can specify the  
ports OMPI should use for this purpose. If the sysadmin won't allow  
even that, then you are pretty well hosed.


Yes, often MPI takes place inside a cluster which is on a private  
subnet anyway, hence there are no security impacts at all. I have no  
firewalls on my cluster nodes (only on the headnode), as they are not  
connected to the outside world.


But just for curiosity: at one point Open MPI chooses the ports. At  
that point it might possible to implement to start two SSH tunnels per  
slave node to have both directions and the daemons have to contact  
then "localhost" on a specific port which will be tunneled to each  
slave. In principle it should work I think, but it's just not  
implemented for now.


Maybe it could be an addition to Open MPI for security concerned  
usage. I wonder about the speed impact, when compression is switched  
on per se in SSH in such a setup in case you transfer large amounts of  
data via Open MPI.


-- Reuti



On Jul 6, 2010, at 2:23 PM, Robert Walters wrote:

Yes, there is a system firewall. I don't think the sysadmin will  
allow it to go disabled. Each Linux machine has the built-in RHEL  
firewall. SSH is enabled through the firewall though.


--- On Tue, 7/6/10, Ralph Castain  wrote:

From: Ralph Castain 
Subject: Re: [OMPI users] OpenMPI Hangs, No Error
To: "Open MPI Users" 
Date: Tuesday, July 6, 2010, 4:19 PM

It looks like the remote daemon is starting - is there a firewall  
in the way?


On Jul 6, 2010, at 2:04 PM, Robert Walters wrote:


Hello all,

I am using OpenMPI 1.4.2 on RHEL. I have a cluster of AMD  
Opteron's and right now I am just working on getting OpenMPI  
itself up and running. I have a successful configure and make all  
install. LD_LIBRARY_PATH and PATH variables were correctly edited.  
mpirun -np 8 hello_c successfully works on all machines. I have  
setup my two test machines with DSA key pairs that successfully  
work with each other.


The problem comes when I initiate my hostfile to attempt to  
communicate across machines. The hostfile is setup correctly with  
  . When running with all verbose  
options enabled "mpirun --mca plm_base_verbose 99 --debug-daemons  
--mca btl_base_verbose 30 --mca oob_base_verbose 99 --mca  
pml_base_verbose 99 -hostfile hostfile -np 16 hello_c" I receive  
the following text output.


[machine1:03578] mca: base: components_open: Looking for plm  
components

[machine1:03578] mca: base: components_open: opening plm components
[machine1:03578] mca: base: components_open: found loaded  
component rsh
[machine1:03578] mca: base: components_open: component rsh has no  
register function
[machine1:03578] mca: base: components_open: component rsh open  
function successful
[machine1:03578] mca: base: components_open: found loaded  
component slurm
[machine1:03578] mca: base: components_open: component slurm has  
no register function
[machine1:03578] mca: base: components_open: component slurm open  
function successful

[machine1:03578] mca:base:select: Auto-selecting plm components
[machine1:03578] mca:base:select:(  plm) Querying component [rsh]
[machine1:03578] mca:base:select:(  plm) Query of component [rsh]  
set priority to 10

[machine1:03578] mca:base:select:(  plm) Querying component [slurm]
[machine1:03578] mca:base:select:(  plm) Skipping component  
[slurm]. Query failed to return a module

[machine1:03578] mca:base:select:(  plm) Selected component [rsh]
[machine1:03578] mca: base: close: component slurm closed
[machine1:03578] mca: base: close: unloading component slurm
[machine1:03578] mca: base: components_open: Looking for oob  
components

[machine1:03578] mca: base: components_open: opening oob components
[machine1:03578] mca: base: components_open: found loaded  
component tcp
[machine1:03578] mca: base: components_open: component tcp has no  
register function
[machine1:03578] mca: base: components_open: component tcp open  
function successful

Daemon was launched on machine2- beginning to initialize
[machine2:01962] mca: base: components_open: Looking for oob  
components

[machine2:01962] mca: base: components_open: opening oob components
[machine2:01962] mca: base: components_open: found loaded  
component tcp
[machine2:01962] mca: base: components_open: component tcp has no  
register function
[machine2:01962] mca: base: components_open: component tcp open  
function successful

Daemon [[1418,0],1] checking in as pid 1962 on host machine2
Daemon [[1418,0],1] not using static ports

At this point the system hangs indefinitely. While running top on  
the machine2 terminal, I see several things come up b

Re: [OMPI users] OpenMPI Hangs, No Error

2010-07-06 Thread Jeff Squyres
On Jul 6, 2010, at 5:41 PM, Robert Walters wrote:

> Thanks for your expeditious responses, Ralph.
> 
> Just to confirm with you, I should change openmpi-mca-params.conf to include:
> 
> oob_tcp_port_min_v4 = (My minimum port in the range)
> oob_tcp_port_range_v4 = (My port range)
> btl_tcp_port_min_v4 = (My minimum port in the range)
> btl_tcp_port_range_v4 = (My port range)
> 
> correct?

That should do ya.  Use the same values on all nodes.  You should be able to 
confirm that OMPI's run-time system is working if you are able to mpirun a 
non-MPI program like "hostname" or somesuch.  If that works, then the daemons 
are launching, talking to each other, launching the app, shuttling the I/O 
around, noticing that the app is dying, tidying everything up, and telling 
mpirun that everything is done.  In short: lots of things are happening right 
if you're able to mpirun "hostname" across multiple hosts.

> Also, for a cluster of around 32-64 processes (8 processors per node), how 
> wide of a range will I require? I've noticed some entries in the mailing list 
> suggesting you need a few to get started and then it opens as necessary. Will 
> I be safe with 20 or should I go for 100? 

If you have 64 hosts, each with 8 processors, meaning that the largest MPI job 
you would run would be 64 * 8 = 512 MPI processes, then I'd ask for at least 
1024 -- 2048 would be better (you have a zillion ports; better to ask for more 
than you need).  We recently found a bug in the TCP BTL where it *may* use 2 
sockets for each peerwise connection in some cases.

Additionally, your sysadmin *might* be more amenable to opening up ports *only 
between the cluster nodes* (vs. opening up the ports to anything).  If that's 
the case, you might as well go for the gold and ask them if they can open up 
*all* the ports between all your nodes (while still rejecting everything from 
non-cluster nodes).

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI users] OpenMPI Hangs, No Error

2010-07-06 Thread Robert Walters
Thanks for your expeditious responses, Ralph.

Just to confirm with you, I should change openmpi-mca-params.conf to include:

oob_tcp_port_min_v4 = (My minimum port in the range)
oob_tcp_port_range_v4 = (My port range)
btl_tcp_port_min_v4 = (My minimum port in the range)
btl_tcp_port_range_v4 = (My port range)

correct?

Also, for a cluster of around 32-64 processes (8 processors per node), how wide 
of a range will I require? I've noticed some entries in the mailing list 
suggesting you need a few to get started and then it opens as necessary. Will I 
be safe with 20 or should I go for 100? 

Thanks again for all of your help!

--- On Tue, 7/6/10, Ralph Castain  wrote:

From: Ralph Castain 
Subject: Re: [OMPI users] OpenMPI Hangs, No Error
To: "Open MPI Users" 
List-Post: users@lists.open-mpi.org
Date: Tuesday, July 6, 2010, 5:31 PM

Problem isn't with ssh - the problem is that the daemons need to open a TCP 
connection back to the machine where mpirun is running. If the firewall blocks 
that connection, then we can't run.
If you can get a range of ports opened, then you can specify the ports OMPI 
should use for this purpose. If the sysadmin won't allow even that, then you 
are pretty well hosed.

On Jul 6, 2010, at 2:23 PM, Robert Walters wrote:
Yes, there is a system firewall. I don't think the sysadmin will allow it to go 
disabled. Each Linux machine has the built-in RHEL firewall. SSH is enabled 
through the firewall though.

--- On Tue, 7/6/10, Ralph Castain  wrote:

From: Ralph Castain 
Subject: Re: [OMPI users] OpenMPI Hangs, No Error
To: "Open MPI Users" 
List-Post: users@lists.open-mpi.org
Date: Tuesday, July 6, 2010, 4:19 PM

It looks like the remote daemon is starting - is there a firewall in the way?
On Jul 6, 2010, at 2:04 PM, Robert Walters wrote:
Hello all,

I am using OpenMPI 1.4.2 on RHEL. I have a cluster of AMD Opteron's and right 
now I am just working on getting OpenMPI itself up and running. I have a 
successful configure and make all install. LD_LIBRARY_PATH and PATH variables 
were correctly edited. mpirun -np 8 hello_c successfully works on all machines. 
I have setup my two test machines with DSA key pairs that successfully work 
with each other.

The problem comes when I initiate my hostfile to attempt to communicate across 
machines. The hostfile is setup correctly with   . 
When running with all verbose options enabled "mpirun --mca plm_base_verbose 99 
--debug-daemons --mca btl_base_verbose 30 --mca oob_base_verbose 99 --mca 
pml_base_verbose 99 -hostfile hostfile -np 16 hello_c" I receive the following 
text
 output.

[machine1:03578] mca: base: components_open: Looking for plm components
[machine1:03578] mca: base: components_open: opening plm components
[machine1:03578] mca: base: components_open: found loaded component rsh
[machine1:03578] mca: base: components_open: component rsh has no register 
function
[machine1:03578] mca: base: components_open: component rsh open function 
successful
[machine1:03578] mca: base: components_open: found loaded component slurm
[machine1:03578] mca: base: components_open: component slurm has no register 
function
[machine1:03578] mca: base: components_open: component slurm open function 
successful
[machine1:03578] mca:base:select: Auto-selecting plm components
[machine1:03578] mca:base:select:(  plm) Querying component [rsh]
[machine1:03578] mca:base:select:(  plm) Query of component [rsh] set priority 
to 10
[machine1:03578] mca:base:select:(  plm) Querying component
 [slurm]
[machine1:03578] mca:base:select:(  plm) Skipping component [slurm]. Query 
failed to return a module
[machine1:03578] mca:base:select:(  plm) Selected component [rsh]
[machine1:03578] mca: base: close: component slurm closed
[machine1:03578] mca: base: close: unloading component slurm
[machine1:03578] mca: base: components_open: Looking for oob components
[machine1:03578] mca: base: components_open: opening oob components
[machine1:03578] mca: base: components_open: found loaded component tcp
[machine1:03578] mca: base: components_open: component tcp has no register 
function
[machine1:03578] mca: base: components_open: component tcp open function 
successful
Daemon was launched on machine2- beginning to initialize
[machine2:01962] mca: base: components_open: Looking for oob components
[machine2:01962] mca: base: components_open: opening oob components
[machine2:01962] mca: base: components_open:
 found loaded component tcp
[machine2:01962] mca: base: components_open: component tcp has no register 
function
[machine2:01962] mca: base: components_open: component tcp open function 
successful
Daemon [[1418,0],1] checking in as pid 1962 on host machine2
Daemon [[1418,0],1] not using static ports

At this point the system hangs indefinitely. While running top on the machine2 
terminal, I see several things come up briefly. These items are: sshd (root), 
tcsh (myuser), orted (myuser), and mcstransd (roo

Re: [OMPI users] OpenMPI Hangs, No Error

2010-07-06 Thread Ralph Castain
Problem isn't with ssh - the problem is that the daemons need to open a TCP 
connection back to the machine where mpirun is running. If the firewall blocks 
that connection, then we can't run.

If you can get a range of ports opened, then you can specify the ports OMPI 
should use for this purpose. If the sysadmin won't allow even that, then you 
are pretty well hosed.


On Jul 6, 2010, at 2:23 PM, Robert Walters wrote:

> Yes, there is a system firewall. I don't think the sysadmin will allow it to 
> go disabled. Each Linux machine has the built-in RHEL firewall. SSH is 
> enabled through the firewall though.
> 
> --- On Tue, 7/6/10, Ralph Castain  wrote:
> 
> From: Ralph Castain 
> Subject: Re: [OMPI users] OpenMPI Hangs, No Error
> To: "Open MPI Users" 
> Date: Tuesday, July 6, 2010, 4:19 PM
> 
> It looks like the remote daemon is starting - is there a firewall in the way?
> 
> On Jul 6, 2010, at 2:04 PM, Robert Walters wrote:
> 
>> Hello all,
>> 
>> I am using OpenMPI 1.4.2 on RHEL. I have a cluster of AMD Opteron's and 
>> right now I am just working on getting OpenMPI itself up and running. I have 
>> a successful configure and make all install. LD_LIBRARY_PATH and PATH 
>> variables were correctly edited. mpirun -np 8 hello_c successfully works on 
>> all machines. I have setup my two test machines with DSA key pairs that 
>> successfully work with each other.
>> 
>> The problem comes when I initiate my hostfile to attempt to communicate 
>> across machines. The hostfile is setup correctly with   
>> . When running with all verbose options enabled "mpirun --mca 
>> plm_base_verbose 99 --debug-daemons --mca btl_base_verbose 30 --mca 
>> oob_base_verbose 99 --mca pml_base_verbose 99 -hostfile hostfile -np 16 
>> hello_c" I receive the following text output.
>> 
>> [machine1:03578] mca: base: components_open: Looking for plm components
>> [machine1:03578] mca: base: components_open: opening plm components
>> [machine1:03578] mca: base: components_open: found loaded component rsh
>> [machine1:03578] mca: base: components_open: component rsh has no register 
>> function
>> [machine1:03578] mca: base: components_open: component rsh open function 
>> successful
>> [machine1:03578] mca: base: components_open: found loaded component slurm
>> [machine1:03578] mca: base: components_open: component slurm has no register 
>> function
>> [machine1:03578] mca: base: components_open: component slurm open function 
>> successful
>> [machine1:03578] mca:base:select: Auto-selecting plm components
>> [machine1:03578] mca:base:select:(  plm) Querying component [rsh]
>> [machine1:03578] mca:base:select:(  plm) Query of component [rsh] set 
>> priority to 10
>> [machine1:03578] mca:base:select:(  plm) Querying component [slurm]
>> [machine1:03578] mca:base:select:(  plm) Skipping component [slurm]. Query 
>> failed to return a module
>> [machine1:03578] mca:base:select:(  plm) Selected component [rsh]
>> [machine1:03578] mca: base: close: component slurm closed
>> [machine1:03578] mca: base: close: unloading component slurm
>> [machine1:03578] mca: base: components_open: Looking for oob components
>> [machine1:03578] mca: base: components_open: opening oob components
>> [machine1:03578] mca: base: components_open: found loaded component tcp
>> [machine1:03578] mca: base: components_open: component tcp has no register 
>> function
>> [machine1:03578] mca: base: components_open: component tcp open function 
>> successful
>> Daemon was launched on machine2- beginning to initialize
>> [machine2:01962] mca: base: components_open: Looking for oob components
>> [machine2:01962] mca: base: components_open: opening oob components
>> [machine2:01962] mca: base: components_open: found loaded component tcp
>> [machine2:01962] mca: base: components_open: component tcp has no register 
>> function
>> [machine2:01962] mca: base: components_open: component tcp open function 
>> successful
>> Daemon [[1418,0],1] checking in as pid 1962 on host machine2
>> Daemon [[1418,0],1] not using static ports
>> 
>> At this point the system hangs indefinitely. While running top on the 
>> machine2 terminal, I see several things come up briefly. These items are: 
>> sshd (root), tcsh (myuser), orted (myuser), and mcstransd (root). I was 
>> wondering if sshd needs to be initiated by myuser? It is currently turned 
>> off in sshd_config through UsePAM yes. This was setup by the sysadmin but it 
>> can be worked around if this is neces

Re: [OMPI users] OpenMPI Hangs, No Error

2010-07-06 Thread Robert Walters
Yes, there is a system firewall. I don't think the sysadmin will allow it to go 
disabled. Each Linux machine has the built-in RHEL firewall. SSH is enabled 
through the firewall though.

--- On Tue, 7/6/10, Ralph Castain  wrote:

From: Ralph Castain 
Subject: Re: [OMPI users] OpenMPI Hangs, No Error
To: "Open MPI Users" 
List-Post: users@lists.open-mpi.org
Date: Tuesday, July 6, 2010, 4:19 PM

It looks like the remote daemon is starting - is there a firewall in the way?
On Jul 6, 2010, at 2:04 PM, Robert Walters wrote:
Hello all,

I am using OpenMPI 1.4.2 on RHEL. I have a cluster of AMD Opteron's and right 
now I am just working on getting OpenMPI itself up and running. I have a 
successful configure and make all install. LD_LIBRARY_PATH and PATH variables 
were correctly edited. mpirun -np 8 hello_c successfully works on all machines. 
I have setup my two test machines with DSA key pairs that successfully work 
with each other.

The problem comes when I initiate my hostfile to attempt to communicate across 
machines. The hostfile is setup correctly with   . 
When running with all verbose options enabled "mpirun --mca plm_base_verbose 99 
--debug-daemons --mca btl_base_verbose 30 --mca oob_base_verbose 99 --mca 
pml_base_verbose 99 -hostfile hostfile -np 16 hello_c" I receive the following 
text
 output.

[machine1:03578] mca: base: components_open: Looking for plm components
[machine1:03578] mca: base: components_open: opening plm components
[machine1:03578] mca: base: components_open: found loaded component rsh
[machine1:03578] mca: base: components_open: component rsh has no register 
function
[machine1:03578] mca: base: components_open: component rsh open function 
successful
[machine1:03578] mca: base: components_open: found loaded component slurm
[machine1:03578] mca: base: components_open: component slurm has no register 
function
[machine1:03578] mca: base: components_open: component slurm open function 
successful
[machine1:03578] mca:base:select: Auto-selecting plm components
[machine1:03578] mca:base:select:(  plm) Querying component [rsh]
[machine1:03578] mca:base:select:(  plm) Query of component [rsh] set priority 
to 10
[machine1:03578] mca:base:select:(  plm) Querying component
 [slurm]
[machine1:03578] mca:base:select:(  plm) Skipping component [slurm]. Query 
failed to return a module
[machine1:03578] mca:base:select:(  plm) Selected component [rsh]
[machine1:03578] mca: base: close: component slurm closed
[machine1:03578] mca: base: close: unloading component slurm
[machine1:03578] mca: base: components_open: Looking for oob components
[machine1:03578] mca: base: components_open: opening oob components
[machine1:03578] mca: base: components_open: found loaded component tcp
[machine1:03578] mca: base: components_open: component tcp has no register 
function
[machine1:03578] mca: base: components_open: component tcp open function 
successful
Daemon was launched on machine2- beginning to initialize
[machine2:01962] mca: base: components_open: Looking for oob components
[machine2:01962] mca: base: components_open: opening oob components
[machine2:01962] mca: base: components_open:
 found loaded component tcp
[machine2:01962] mca: base: components_open: component tcp has no register 
function
[machine2:01962] mca: base: components_open: component tcp open function 
successful
Daemon [[1418,0],1] checking in as pid 1962 on host machine2
Daemon [[1418,0],1] not using static ports

At this point the system hangs indefinitely. While running top on the machine2 
terminal, I see several things come up briefly. These items are: sshd (root), 
tcsh (myuser), orted (myuser), and mcstransd (root). I was wondering if sshd 
needs to be initiated by myuser? It is currently turned off in sshd_config 
through UsePAM yes. This was setup by the sysadmin but it can be worked around 
if this is necessary.

So in summary, mpirun works on each machine individually, but hangs when 
initiated through a hostfile or with the -host flag. ./configure with defaults 
and --prefix. LD_LIBRARY_PATH and PATH set up correctly. Any help is
 appreciated. Thanks!




  ___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

-Inline Attachment Follows-

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


  

Re: [OMPI users] OpenMPI Hangs, No Error

2010-07-06 Thread Ralph Castain
It looks like the remote daemon is starting - is there a firewall in the way?

On Jul 6, 2010, at 2:04 PM, Robert Walters wrote:

> Hello all,
> 
> I am using OpenMPI 1.4.2 on RHEL. I have a cluster of AMD Opteron's and right 
> now I am just working on getting OpenMPI itself up and running. I have a 
> successful configure and make all install. LD_LIBRARY_PATH and PATH variables 
> were correctly edited. mpirun -np 8 hello_c successfully works on all 
> machines. I have setup my two test machines with DSA key pairs that 
> successfully work with each other.
> 
> The problem comes when I initiate my hostfile to attempt to communicate 
> across machines. The hostfile is setup correctly with   
> . When running with all verbose options enabled "mpirun --mca 
> plm_base_verbose 99 --debug-daemons --mca btl_base_verbose 30 --mca 
> oob_base_verbose 99 --mca pml_base_verbose 99 -hostfile hostfile -np 16 
> hello_c" I receive the following text output.
> 
> [machine1:03578] mca: base: components_open: Looking for plm components
> [machine1:03578] mca: base: components_open: opening plm components
> [machine1:03578] mca: base: components_open: found loaded component rsh
> [machine1:03578] mca: base: components_open: component rsh has no register 
> function
> [machine1:03578] mca: base: components_open: component rsh open function 
> successful
> [machine1:03578] mca: base: components_open: found loaded component slurm
> [machine1:03578] mca: base: components_open: component slurm has no register 
> function
> [machine1:03578] mca: base: components_open: component slurm open function 
> successful
> [machine1:03578] mca:base:select: Auto-selecting plm components
> [machine1:03578] mca:base:select:(  plm) Querying component [rsh]
> [machine1:03578] mca:base:select:(  plm) Query of component [rsh] set 
> priority to 10
> [machine1:03578] mca:base:select:(  plm) Querying component [slurm]
> [machine1:03578] mca:base:select:(  plm) Skipping component [slurm]. Query 
> failed to return a module
> [machine1:03578] mca:base:select:(  plm) Selected component [rsh]
> [machine1:03578] mca: base: close: component slurm closed
> [machine1:03578] mca: base: close: unloading component slurm
> [machine1:03578] mca: base: components_open: Looking for oob components
> [machine1:03578] mca: base: components_open: opening oob components
> [machine1:03578] mca: base: components_open: found loaded component tcp
> [machine1:03578] mca: base: components_open: component tcp has no register 
> function
> [machine1:03578] mca: base: components_open: component tcp open function 
> successful
> Daemon was launched on machine2- beginning to initialize
> [machine2:01962] mca: base: components_open: Looking for oob components
> [machine2:01962] mca: base: components_open: opening oob components
> [machine2:01962] mca: base: components_open: found loaded component tcp
> [machine2:01962] mca: base: components_open: component tcp has no register 
> function
> [machine2:01962] mca: base: components_open: component tcp open function 
> successful
> Daemon [[1418,0],1] checking in as pid 1962 on host machine2
> Daemon [[1418,0],1] not using static ports
> 
> At this point the system hangs indefinitely. While running top on the 
> machine2 terminal, I see several things come up briefly. These items are: 
> sshd (root), tcsh (myuser), orted (myuser), and mcstransd (root). I was 
> wondering if sshd needs to be initiated by myuser? It is currently turned off 
> in sshd_config through UsePAM yes. This was setup by the sysadmin but it can 
> be worked around if this is necessary.
> 
> So in summary, mpirun works on each machine individually, but hangs when 
> initiated through a hostfile or with the -host flag. ./configure with 
> defaults and --prefix. LD_LIBRARY_PATH and PATH set up correctly. Any help is 
> appreciated. Thanks!
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users