Re: [OMPI users] problems with the -xterm option

2011-04-28 Thread jody
Hi Ralph

Thank you for your suggestions.
I'll be happy to help  you.
I'm not sure if i'll get around to this tomorrow,
but i certainly will do so on Monday.

Thanks
  Jody

On Thu, Apr 28, 2011 at 11:53 PM, Ralph Castain  wrote:
> Hi Jody
>
> I'm not sure when I'll get a chance to work on this - got a deadline to meet. 
> I do have a couple of suggestions, if you wouldn't mind helping debug the 
> problem?
>
> It looks to me like the problem is that mpirun is crashing or terminating 
> early for some reason - hence the failures to send msgs to it, and the 
> "lifeline lost" error that leads to the termination of the daemon. If you 
> build a debug version of the code (i.e., --enable-debug on configure), you 
> can get a lot of debug info that traces the behavior.
>
> If you could then run your program with
>
>  -mca plm_base_verbose 5 -mca odls_base_verbose 5 --leave-session-attached
>
> and send it to me, we'll see what ORTE thinks it is doing.
>
> You could also take a look at the code for implementing the xterm option. 
> You'll find it in
>
> orte/mca/odls/base/odls_base_default_fns.c
>
> around line 1115. The xterm command syntax is defined in
>
> orte/mca/odls/base/odls_base_open.c
>
> around line 233 and following. Note that we use "xterm -T" as the cmd. 
> Perhaps you can spot an error in the way we treat xterm?
>
> Also, remember that you have to specify that you want us to "hold" the xterm 
> window open even after the process terminates. If you don't specify it, the 
> window automatically closes upon completion of the process. So a fast-running 
> cmd like "hostname" might disappear so quickly that it causes a race 
> condition problem.
>
> You might want to try a spinner application - i.e.., output something and 
> then sit in a loop or sleep for some period of time. Or, use the "hold" 
> option to keep the window open - you designate "hold" by putting a '!' before 
> the rank, e.g., "mpirun -np 2 -xterm \!2 hostname"
>
>
> On Apr 28, 2011, at 8:38 AM, jody wrote:
>
>> Hi
>>
>> Unfortunately this does not solve my problem.
>> While i can do
>>  ssh -Y squid_0 xterm
>> and this will open an xterm on m,y machiine (chefli),
>> i run into problems with the -xterm option of openmpi:
>>
>>  jody@chefli ~/share/neander $ mpirun -np 4  -mca plm_rsh_agent "ssh
>> -Y" -host squid_0 --xterm 1 hostname
>>  squid_0
>>  [squid_0:28046] [[35219,0],1]->[[35219,0],0]
>> mca_oob_tcp_msg_send_handler: writev failed: Bad file descriptor (9)
>> [sd = 8]
>>  [squid_0:28046] [[35219,0],1] routed:binomial: Connection to
>> lifeline [[35219,0],0] lost
>>  [squid_0:28046] [[35219,0],1]->[[35219,0],0]
>> mca_oob_tcp_msg_send_handler: writev failed: Bad file descriptor (9)
>> [sd = 8]
>>  [squid_0:28046] [[35219,0],1] routed:binomial: Connection to
>> lifeline [[35219,0],0] lost
>>  /usr/bin/xterm Xt error: Can't open display: localhost:11.0
>>
>> By the way when i look at the DISPLAY variable in the xterm window
>> opened via squid_0,
>> i also have the display variable "localhost:11.0"
>>
>> Actually, the difference with using the "-mca plm_rsh_agent" is that
>> the lines wiht the warnings about "xauth" and "untrusted X" do not
>> appear:
>>
>>  jody@chefli ~/share/neander $ mpirun -np 4   -host squid_0 -xterm 1 hostname
>>  Warning: untrusted X11 forwarding setup failed: xauth key data not generated
>>  Warning: No xauth data; using fake authentication data for X11 forwarding.
>>  squid_0
>>  [squid_0:28337] [[34926,0],1]->[[34926,0],0]
>> mca_oob_tcp_msg_send_handler: writev failed: Bad file descriptor (9)
>> [sd = 8]
>>  [squid_0:28337] [[34926,0],1] routed:binomial: Connection to
>> lifeline [[34926,0],0] lost
>>  [squid_0:28337] [[34926,0],1]->[[34926,0],0]
>> mca_oob_tcp_msg_send_handler: writev failed: Bad file descriptor (9)
>> [sd = 8]
>>  [squid_0:28337] [[34926,0],1] routed:binomial: Connection to
>> lifeline [[34926,0],0] lost
>>  /usr/bin/xterm Xt error: Can't open display: localhost:11.0
>>
>>
>> I have doubts that the "-Y" is passed correctly:
>>   jody@triops ~/share/neander $ mpirun -np   -mca plm_rsh_agent "ssh
>> -Y" -host squid_0 xterm
>>  xterm Xt error: Can't open display:
>>  xterm:  DISPLAY is not set
>>  xterm Xt error: Can't open display:
>>  xterm:  DISPLAY is not set
>>
>>
>> ---> as a matter of fact i noticed that the xterm option doesn't work 
>> locally:
>>  mpirun -np 4    -xterm 1 /usr/bin/printenv
>> prints verything onto the console.
>>
>> Do you have any other suggestions i could try?
>>
>> Thank You
>> Jody
>>
>> On Thu, Apr 28, 2011 at 3:06 PM, Ralph Castain  wrote:
>>> Should be able to just set
>>>
>>> -mca plm_rsh_agent "ssh -Y"
>>>
>>> on your cmd line, I believe
>>>
>>> On Apr 28, 2011, at 12:53 AM, jody wrote:
>>>
 Hi Ralph

 Is there an easy way i could modify the OpenMPI code so that it would use
 the -Y option for ssh when connecting to remote machines?

 Thank You
   Jody

 On Thu, Apr 7, 2011 at 

Re: [OMPI users] problems with the -xterm option

2011-04-28 Thread Ralph Castain
Hi Jody

I'm not sure when I'll get a chance to work on this - got a deadline to meet. I 
do have a couple of suggestions, if you wouldn't mind helping debug the problem?

It looks to me like the problem is that mpirun is crashing or terminating early 
for some reason - hence the failures to send msgs to it, and the "lifeline 
lost" error that leads to the termination of the daemon. If you build a debug 
version of the code (i.e., --enable-debug on configure), you can get a lot of 
debug info that traces the behavior.

If you could then run your program with

 -mca plm_base_verbose 5 -mca odls_base_verbose 5 --leave-session-attached

and send it to me, we'll see what ORTE thinks it is doing.

You could also take a look at the code for implementing the xterm option. 
You'll find it in

orte/mca/odls/base/odls_base_default_fns.c

around line 1115. The xterm command syntax is defined in

orte/mca/odls/base/odls_base_open.c

around line 233 and following. Note that we use "xterm -T" as the cmd. Perhaps 
you can spot an error in the way we treat xterm?

Also, remember that you have to specify that you want us to "hold" the xterm 
window open even after the process terminates. If you don't specify it, the 
window automatically closes upon completion of the process. So a fast-running 
cmd like "hostname" might disappear so quickly that it causes a race condition 
problem.

You might want to try a spinner application - i.e.., output something and then 
sit in a loop or sleep for some period of time. Or, use the "hold" option to 
keep the window open - you designate "hold" by putting a '!' before the rank, 
e.g., "mpirun -np 2 -xterm \!2 hostname"


On Apr 28, 2011, at 8:38 AM, jody wrote:

> Hi
> 
> Unfortunately this does not solve my problem.
> While i can do
>  ssh -Y squid_0 xterm
> and this will open an xterm on m,y machiine (chefli),
> i run into problems with the -xterm option of openmpi:
> 
>  jody@chefli ~/share/neander $ mpirun -np 4  -mca plm_rsh_agent "ssh
> -Y" -host squid_0 --xterm 1 hostname
>  squid_0
>  [squid_0:28046] [[35219,0],1]->[[35219,0],0]
> mca_oob_tcp_msg_send_handler: writev failed: Bad file descriptor (9)
> [sd = 8]
>  [squid_0:28046] [[35219,0],1] routed:binomial: Connection to
> lifeline [[35219,0],0] lost
>  [squid_0:28046] [[35219,0],1]->[[35219,0],0]
> mca_oob_tcp_msg_send_handler: writev failed: Bad file descriptor (9)
> [sd = 8]
>  [squid_0:28046] [[35219,0],1] routed:binomial: Connection to
> lifeline [[35219,0],0] lost
>  /usr/bin/xterm Xt error: Can't open display: localhost:11.0
> 
> By the way when i look at the DISPLAY variable in the xterm window
> opened via squid_0,
> i also have the display variable "localhost:11.0"
> 
> Actually, the difference with using the "-mca plm_rsh_agent" is that
> the lines wiht the warnings about "xauth" and "untrusted X" do not
> appear:
> 
>  jody@chefli ~/share/neander $ mpirun -np 4   -host squid_0 -xterm 1 hostname
>  Warning: untrusted X11 forwarding setup failed: xauth key data not generated
>  Warning: No xauth data; using fake authentication data for X11 forwarding.
>  squid_0
>  [squid_0:28337] [[34926,0],1]->[[34926,0],0]
> mca_oob_tcp_msg_send_handler: writev failed: Bad file descriptor (9)
> [sd = 8]
>  [squid_0:28337] [[34926,0],1] routed:binomial: Connection to
> lifeline [[34926,0],0] lost
>  [squid_0:28337] [[34926,0],1]->[[34926,0],0]
> mca_oob_tcp_msg_send_handler: writev failed: Bad file descriptor (9)
> [sd = 8]
>  [squid_0:28337] [[34926,0],1] routed:binomial: Connection to
> lifeline [[34926,0],0] lost
>  /usr/bin/xterm Xt error: Can't open display: localhost:11.0
> 
> 
> I have doubts that the "-Y" is passed correctly:
>   jody@triops ~/share/neander $ mpirun -np   -mca plm_rsh_agent "ssh
> -Y" -host squid_0 xterm
>  xterm Xt error: Can't open display:
>  xterm:  DISPLAY is not set
>  xterm Xt error: Can't open display:
>  xterm:  DISPLAY is not set
> 
> 
> ---> as a matter of fact i noticed that the xterm option doesn't work locally:
>  mpirun -np 4-xterm 1 /usr/bin/printenv
> prints verything onto the console.
> 
> Do you have any other suggestions i could try?
> 
> Thank You
> Jody
> 
> On Thu, Apr 28, 2011 at 3:06 PM, Ralph Castain  wrote:
>> Should be able to just set
>> 
>> -mca plm_rsh_agent "ssh -Y"
>> 
>> on your cmd line, I believe
>> 
>> On Apr 28, 2011, at 12:53 AM, jody wrote:
>> 
>>> Hi Ralph
>>> 
>>> Is there an easy way i could modify the OpenMPI code so that it would use
>>> the -Y option for ssh when connecting to remote machines?
>>> 
>>> Thank You
>>>   Jody
>>> 
>>> On Thu, Apr 7, 2011 at 4:01 PM, jody  wrote:
 Hi Ralph
 thank you for your suggestions. After some fiddling, i found that after my
 last update (gentoo) my sshd_config had been overwritten
 (X11Forwarding was set to 'no').
 
 After correcting that, i can now open remote terminals with 'ssh -Y'
 and with 'ssh -X'
 (but with '-X' is till get those xauth warnings)

Re: [OMPI users] OpenMPI out of band TCP retry exceeded

2011-04-28 Thread Sindhi, Waris PW
Do you know when this fix is slated for an official release ?  


Sincerely,

Waris Sindhi
High Performance Computing, TechApps
Pratt & Whitney, UTC
(860)-565-8486

-Original Message-
From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On
Behalf Of Ralph Castain
Sent: Thursday, April 28, 2011 9:03 AM
To: Open MPI Users
Subject: Re: [OMPI users] OpenMPI out of band TCP retry exceeded


On Apr 28, 2011, at 6:56 AM, Sindhi, Waris PW wrote:

> Yes the procgroup file has more than 128 applications in it.
> 
> % wc -l procgroup
> 239 procgroup 
> 
> Is 128 the max applications that can be in a procgroup file ? 

Yep - this limitation is lifted in the developer's trunk, but not yet in
a release.


> 
> Sincerely,
> 
> Waris Sindhi
> High Performance Computing, TechApps
> Pratt & Whitney, UTC
> (860)-565-8486
> 
> -Original Message-
> From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org]
On
> Behalf Of Ralph Castain
> Sent: Wednesday, April 27, 2011 8:09 PM
> To: Open MPI Users
> Subject: Re: [OMPI users] OpenMPI out of band TCP retry exceeded
> 
> 
> On Apr 27, 2011, at 1:31 PM, Sindhi, Waris PW wrote:
> 
>> No we do not have a firewall turned on. I can run smaller 96 slave
> cases
>> on ln10 and ln13 included on the slavelist. 
>> 
>> Could there be another reason for this to fail ? 
> 
> What is in "procgroup"? Is it a single application?
> 
> Offhand, there is nothing in OMPI that would explain the problem. The
> only possibility I can think of would be if your "procgroup" file
> contains more than 128 applications in it.
> 
>> 
>> 
>> Sincerely,
>> 
>> Waris Sindhi
>> High Performance Computing, TechApps
>> Pratt & Whitney, UTC
>> (860)-565-8486
>> 
>> -Original Message-
>> From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org]
> On
>> Behalf Of Ralph Castain
>> Sent: Wednesday, April 27, 2011 2:18 PM
>> To: Open MPI Users
>> Subject: Re: [OMPI users] OpenMPI out of band TCP retry exceeded
>> 
>> Perhaps a firewall? All it is telling you is that mpirun couldn't
>> establish TCP communications with the daemon on ln10.
>> 
>> 
>> On Apr 27, 2011, at 11:58 AM, Sindhi, Waris PW wrote:
>> 
>>> Hi,
>>>   I am getting a "oob-tcp: Communication retries exceeded" error
>>> message when I run a 238 MPI slave code
>>> 
>>> 
>>> /opt/openmpi/i386/bin/mpirun -mca btl_openib_verbose 1 --mca btl
^tcp
>>> --mca pls_ssh_agent ssh -mca oob_tcp_peer_retries 1000 --prefix
>>> /usr/lib/openmpi/1.2.8-gcc/bin -np 239 --app procgroup
>>> 
>> 
>

>>> --
>>> mpirun was unable to start the specified application as it
> encountered
>>> an error:
>>> 
>>> Error name: Unknown error: 1
>>> Node: ln10
>>> 
>>> when attempting to start process rank 234.
>>> 
>> 
>

>>> --
>>> [ln13:27867] [[61748,0],0]-[[61748,0],32] oob-tcp: Communication
>> retries
>>> exceeded.  Can not communicate with peer
>>> [ln13:27867] [[61748,0],0] ORTE_ERROR_LOG: Unreachable in file
>>> orted/orted_comm.c at line 130
>>> [ln13:27867] [[61748,0],0] ORTE_ERROR_LOG: Unreachable in file
>>> orted/orted_comm.c at line 130
>>> [ln13:27867] [[61748,0],0]-[[61748,0],32] oob-tcp: Communication
>> retries
>>> exceeded.  Can not communicate with peer
>>> [ln13:27867] [[61748,0],0]-[[61748,0],32] oob-tcp: Communication
>> retries
>>> exceeded.  Can not communicate with peer
>>> [ln13:27867] [[61748,0],0]-[[61748,0],32] oob-tcp: Communication
>> retries
>>> exceeded.  Can not communicate with peer
>>> [ln13:27867] [[61748,0],0]-[[61748,0],32] oob-tcp: Communication
>> retries
>>> exceeded.  Can not communicate with peer
>>> [ln13:27867] [[61748,0],0]-[[61748,0],32] oob-tcp: Communication
>> retries
>>> exceeded.  Can not communicate with peer
>>> [ln13:27867] [[61748,0],0]-[[61748,0],32] oob-tcp: Communication
>> retries
>>> exceeded.  Can not communicate with peer
>>> [ln13:27867] [[61748,0],0]-[[61748,0],32] oob-tcp: Communication
>> retries
>>> exceeded.  Can not communicate with peer
>>> [ln13:27867] [[61748,0],0]-[[61748,0],32] oob-tcp: Communication
>> retries
>>> exceeded.  Can not communicate with peer
>>> [ln13:27867] [[61748,0],0]-[[61748,0],32] oob-tcp: Communication
>> retries
>>> exceeded.  Can not communicate with peer
>>> [ln13:27867] [[61748,0],0]-[[61748,0],32] oob-tcp: Communication
>> retries
>>> exceeded.  Can not communicate with peer
>>> 
>>> Any help would be greatly appreciated.
>>> 
>>> Sincerely,
>>> 
>>> Waris Sindhi
>>> High Performance Computing, TechApps
>>> Pratt & Whitney, UTC
>>> (860)-565-8486
>>> 
>>> 
>>> ___
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
>> 
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
>> 

Re: [OMPI users] srun and openmpi

2011-04-28 Thread Michael Di Domenico
On Thu, Apr 28, 2011 at 9:03 AM, Ralph Castain  wrote:
>
> On Apr 28, 2011, at 6:49 AM, Michael Di Domenico wrote:
>
>> On Wed, Apr 27, 2011 at 11:47 PM, Ralph Castain  wrote:
>>>
>>> On Apr 27, 2011, at 1:06 PM, Michael Di Domenico wrote:
>>>
 On Wed, Apr 27, 2011 at 2:46 PM, Ralph Castain  wrote:
>
> On Apr 27, 2011, at 12:38 PM, Michael Di Domenico wrote:
>
>> On Wed, Apr 27, 2011 at 2:25 PM, Ralph Castain  wrote:
>>>
>>> On Apr 27, 2011, at 10:09 AM, Michael Di Domenico wrote:
>>>
 Was this ever committed to the OMPI src as something not having to be
 run outside of OpenMPI, but as part of the PSM setup that OpenMPI
 does?
>>>
>>> Not that I know of - I don't think the PSM developers ever looked at it.
>>>
>>> Thought about this some more and I believe I have a soln to the problem. 
>>> Will try to commit something to the devel trunk by the end of the week.
>>
>> Thanks
>
> Just to save me looking back thru the thread - what OMPI version are you 
> using? If it isn't the trunk, I'll send you a patch you can use.

I'm using OpenMPI v1.5.3 currently


Re: [OMPI users] OpenMPI out of band TCP retry exceeded

2011-04-28 Thread Ralph Castain
We figured out that in the case where you provide the full path to mpirun -and- 
the -prefix option, we ignore the latter anyway. :-/

I'm working on a patch to at least warn you we are ignoring it.


On Apr 28, 2011, at 2:03 PM, Sindhi, Waris PW wrote:

> The --prefix directory is a typo and no longer exists on our system. 
> 
> We are running 1.4-4 version of OpenMPI
> 
> % /opt/openmpi/x86_64/bin/ompi_info
> 
> Package: Open MPI
> mockbu...@x86-004.build.bos.redhat.com Distribution
>Open MPI: 1.4
>   Open MPI SVN revision: r22285
>   Open MPI release date: Dec 08, 2009
>Open RTE: 1.4
> 
> 
> Sincerely,
> 
> Waris Sindhi
> High Performance Computing, TechApps
> Pratt & Whitney, UTC
> (860)-565-8486
> 
> -Original Message-
> From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On
> Behalf Of Ralph Castain
> Sent: Thursday, April 28, 2011 9:02 AM
> To: Open MPI Users
> Subject: Re: [OMPI users] OpenMPI out of band TCP retry exceeded
> 
> 
> On Apr 28, 2011, at 6:49 AM, Jeff Squyres wrote:
> 
>> On Apr 28, 2011, at 8:45 AM, Ralph Castain wrote:
>> 
>>> What lead you to conclude 1.2.8?
>>> 
>>> /opt/openmpi/i386/bin/mpirun -mca btl_openib_verbose 1 --mca btl
> ^tcp
>>> --mca pls_ssh_agent ssh -mca oob_tcp_peer_retries 1000 --prefix
>>> /usr/lib/openmpi/1.2.8-gcc/bin -np 239 --app procgroup
>> 
>> His command line has "1.2.8" in it.
> 
> Actually, that isn't totally correct and may point to the problem. The
> mpirun cmd itself points to a version of OMPI located in /opt/openmpi.
> The error messages are clearly from a 1.3+ version - they look totally
> different for 1.2
> 
> However, the prefix passed to the backend nodes points to /usr/lib, and
> indeed looks like a 1.2.8 version.
> 
> Waris: is this a mistype? Are these two versions actually the same?
> 
> If not, that would explain the problem - you can't mix OMPI versions. As
> written, the cmd line has the potential to mix one version of mpirun
> with another version of the daemons.
> 
> 
>> 
>> -- 
>> Jeff Squyres
>> jsquy...@cisco.com
>> For corporate legal information go to:
>> http://www.cisco.com/web/about/doing_business/legal/cri/
>> 
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] OpenMPI out of band TCP retry exceeded

2011-04-28 Thread Sindhi, Waris PW
The --prefix directory is a typo and no longer exists on our system. 

We are running 1.4-4 version of OpenMPI

% /opt/openmpi/x86_64/bin/ompi_info

 Package: Open MPI
mockbu...@x86-004.build.bos.redhat.com Distribution
Open MPI: 1.4
   Open MPI SVN revision: r22285
   Open MPI release date: Dec 08, 2009
Open RTE: 1.4


Sincerely,

Waris Sindhi
High Performance Computing, TechApps
Pratt & Whitney, UTC
(860)-565-8486

-Original Message-
From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On
Behalf Of Ralph Castain
Sent: Thursday, April 28, 2011 9:02 AM
To: Open MPI Users
Subject: Re: [OMPI users] OpenMPI out of band TCP retry exceeded


On Apr 28, 2011, at 6:49 AM, Jeff Squyres wrote:

> On Apr 28, 2011, at 8:45 AM, Ralph Castain wrote:
> 
>> What lead you to conclude 1.2.8?
>> 
>> /opt/openmpi/i386/bin/mpirun -mca btl_openib_verbose 1 --mca btl
^tcp
>> --mca pls_ssh_agent ssh -mca oob_tcp_peer_retries 1000 --prefix
>> /usr/lib/openmpi/1.2.8-gcc/bin -np 239 --app procgroup
> 
> His command line has "1.2.8" in it.

Actually, that isn't totally correct and may point to the problem. The
mpirun cmd itself points to a version of OMPI located in /opt/openmpi.
The error messages are clearly from a 1.3+ version - they look totally
different for 1.2

However, the prefix passed to the backend nodes points to /usr/lib, and
indeed looks like a 1.2.8 version.

Waris: is this a mistype? Are these two versions actually the same?

If not, that would explain the problem - you can't mix OMPI versions. As
written, the cmd line has the potential to mix one version of mpirun
with another version of the daemons.


> 
> -- 
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



Re: [OMPI users] problems with the -xterm option

2011-04-28 Thread Ralph Castain
No immediate suggestions - I won't get a chance to test this until later as I 
don't normally run an x11 server on my box, and don't have another way to test 
it.


On Apr 28, 2011, at 8:38 AM, jody wrote:

> Hi
> 
> Unfortunately this does not solve my problem.
> While i can do
>  ssh -Y squid_0 xterm
> and this will open an xterm on m,y machiine (chefli),
> i run into problems with the -xterm option of openmpi:
> 
>  jody@chefli ~/share/neander $ mpirun -np 4  -mca plm_rsh_agent "ssh
> -Y" -host squid_0 --xterm 1 hostname
>  squid_0
>  [squid_0:28046] [[35219,0],1]->[[35219,0],0]
> mca_oob_tcp_msg_send_handler: writev failed: Bad file descriptor (9)
> [sd = 8]
>  [squid_0:28046] [[35219,0],1] routed:binomial: Connection to
> lifeline [[35219,0],0] lost
>  [squid_0:28046] [[35219,0],1]->[[35219,0],0]
> mca_oob_tcp_msg_send_handler: writev failed: Bad file descriptor (9)
> [sd = 8]
>  [squid_0:28046] [[35219,0],1] routed:binomial: Connection to
> lifeline [[35219,0],0] lost
>  /usr/bin/xterm Xt error: Can't open display: localhost:11.0
> 
> By the way when i look at the DISPLAY variable in the xterm window
> opened via squid_0,
> i also have the display variable "localhost:11.0"
> 
> Actually, the difference with using the "-mca plm_rsh_agent" is that
> the lines wiht the warnings about "xauth" and "untrusted X" do not
> appear:
> 
>  jody@chefli ~/share/neander $ mpirun -np 4   -host squid_0 -xterm 1 hostname
>  Warning: untrusted X11 forwarding setup failed: xauth key data not generated
>  Warning: No xauth data; using fake authentication data for X11 forwarding.
>  squid_0
>  [squid_0:28337] [[34926,0],1]->[[34926,0],0]
> mca_oob_tcp_msg_send_handler: writev failed: Bad file descriptor (9)
> [sd = 8]
>  [squid_0:28337] [[34926,0],1] routed:binomial: Connection to
> lifeline [[34926,0],0] lost
>  [squid_0:28337] [[34926,0],1]->[[34926,0],0]
> mca_oob_tcp_msg_send_handler: writev failed: Bad file descriptor (9)
> [sd = 8]
>  [squid_0:28337] [[34926,0],1] routed:binomial: Connection to
> lifeline [[34926,0],0] lost
>  /usr/bin/xterm Xt error: Can't open display: localhost:11.0
> 
> 
> I have doubts that the "-Y" is passed correctly:
>   jody@triops ~/share/neander $ mpirun -np   -mca plm_rsh_agent "ssh
> -Y" -host squid_0 xterm
>  xterm Xt error: Can't open display:
>  xterm:  DISPLAY is not set
>  xterm Xt error: Can't open display:
>  xterm:  DISPLAY is not set
> 
> 
> ---> as a matter of fact i noticed that the xterm option doesn't work locally:
>  mpirun -np 4-xterm 1 /usr/bin/printenv
> prints verything onto the console.
> 
> Do you have any other suggestions i could try?
> 
> Thank You
> Jody
> 
> On Thu, Apr 28, 2011 at 3:06 PM, Ralph Castain  wrote:
>> Should be able to just set
>> 
>> -mca plm_rsh_agent "ssh -Y"
>> 
>> on your cmd line, I believe
>> 
>> On Apr 28, 2011, at 12:53 AM, jody wrote:
>> 
>>> Hi Ralph
>>> 
>>> Is there an easy way i could modify the OpenMPI code so that it would use
>>> the -Y option for ssh when connecting to remote machines?
>>> 
>>> Thank You
>>>   Jody
>>> 
>>> On Thu, Apr 7, 2011 at 4:01 PM, jody  wrote:
 Hi Ralph
 thank you for your suggestions. After some fiddling, i found that after my
 last update (gentoo) my sshd_config had been overwritten
 (X11Forwarding was set to 'no').
 
 After correcting that, i can now open remote terminals with 'ssh -Y'
 and with 'ssh -X'
 (but with '-X' is till get those xauth warnings)
 
 But the xterm option still doesn't work:
  jody@chefli ~/share/neander $ mpirun -np 4 -host squid_0 -xterm 1,2
 printenv | grep WORLD_RANK
  Warning: untrusted X11 forwarding setup failed: xauth key data not 
 generated
  Warning: No xauth data; using fake authentication data for X11 forwarding.
  /usr/bin/xterm Xt error: Can't open display: localhost:11.0
  /usr/bin/xterm Xt error: Can't open display: localhost:11.0
  OMPI_COMM_WORLD_RANK=0
  [aim-squid_0:09856] [[54132,0],1]->[[54132,0],0]
 mca_oob_tcp_msg_send_handler: writev failed: Bad file descriptor (9)
 [sd = 8]
  [aim-squid_0:09856] [[54132,0],1] routed:binomial: Connection to
 lifeline [[54132,0],0] lost
 
 So it looks like the two processes from squid_0 can't open the display 
 this way,
 but one of them writes the output to the console...
 Surprisingly, they are trying 'localhost:11.0' whereas when i use 'ssh -Y' 
 the
 DISPLAY variable is set to 'localhost:10.0'
 
 So in what way would OMPI have to be adapted, so -xterm would work?
 
 Thank You
  Jody
 
 On Wed, Apr 6, 2011 at 8:32 PM, Ralph Castain  wrote:
> Here's a little more info - it's for Cygwin, but I don't see anything
> Cygwin-specific in the answers:
> http://x.cygwin.com/docs/faq/cygwin-x-faq.html#q-ssh-no-x11forwarding
> 
> On Apr 6, 2011, at 12:30 PM, Ralph Castain wrote:
> 
> Sorry Jody - 

Re: [OMPI users] srun and openmpi

2011-04-28 Thread Ralph Castain
Per earlier in the thread, it looks like you are using a 1.5 series release - 
so here is a patch that -should- fix the PSM setup problem.

Please let me know if/how it works as I honestly have no way of testing it.
Ralph



slurmd.diff
Description: Binary data


On Apr 28, 2011, at 7:03 AM, Ralph Castain wrote:

> 
> On Apr 28, 2011, at 6:49 AM, Michael Di Domenico wrote:
> 
>> On Wed, Apr 27, 2011 at 11:47 PM, Ralph Castain  wrote:
>>> 
>>> On Apr 27, 2011, at 1:06 PM, Michael Di Domenico wrote:
>>> 
 On Wed, Apr 27, 2011 at 2:46 PM, Ralph Castain  wrote:
> 
> On Apr 27, 2011, at 12:38 PM, Michael Di Domenico wrote:
> 
>> On Wed, Apr 27, 2011 at 2:25 PM, Ralph Castain  wrote:
>>> 
>>> On Apr 27, 2011, at 10:09 AM, Michael Di Domenico wrote:
>>> 
 Was this ever committed to the OMPI src as something not having to be
 run outside of OpenMPI, but as part of the PSM setup that OpenMPI
 does?
>>> 
>>> Not that I know of - I don't think the PSM developers ever looked at it.
>>> 
>>> Thought about this some more and I believe I have a soln to the problem. 
>>> Will try to commit something to the devel trunk by the end of the week.
>> 
>> Thanks
> 
> Just to save me looking back thru the thread - what OMPI version are you 
> using? If it isn't the trunk, I'll send you a patch you can use.
> 
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 



Re: [OMPI users] problems with the -xterm option

2011-04-28 Thread jody
Hi

Unfortunately this does not solve my problem.
While i can do
  ssh -Y squid_0 xterm
and this will open an xterm on m,y machiine (chefli),
i run into problems with the -xterm option of openmpi:

  jody@chefli ~/share/neander $ mpirun -np 4  -mca plm_rsh_agent "ssh
-Y" -host squid_0 --xterm 1 hostname
  squid_0
  [squid_0:28046] [[35219,0],1]->[[35219,0],0]
mca_oob_tcp_msg_send_handler: writev failed: Bad file descriptor (9)
[sd = 8]
  [squid_0:28046] [[35219,0],1] routed:binomial: Connection to
lifeline [[35219,0],0] lost
  [squid_0:28046] [[35219,0],1]->[[35219,0],0]
mca_oob_tcp_msg_send_handler: writev failed: Bad file descriptor (9)
[sd = 8]
  [squid_0:28046] [[35219,0],1] routed:binomial: Connection to
lifeline [[35219,0],0] lost
  /usr/bin/xterm Xt error: Can't open display: localhost:11.0

By the way when i look at the DISPLAY variable in the xterm window
opened via squid_0,
i also have the display variable "localhost:11.0"

Actually, the difference with using the "-mca plm_rsh_agent" is that
the lines wiht the warnings about "xauth" and "untrusted X" do not
appear:

  jody@chefli ~/share/neander $ mpirun -np 4   -host squid_0 -xterm 1 hostname
  Warning: untrusted X11 forwarding setup failed: xauth key data not generated
  Warning: No xauth data; using fake authentication data for X11 forwarding.
  squid_0
  [squid_0:28337] [[34926,0],1]->[[34926,0],0]
mca_oob_tcp_msg_send_handler: writev failed: Bad file descriptor (9)
[sd = 8]
  [squid_0:28337] [[34926,0],1] routed:binomial: Connection to
lifeline [[34926,0],0] lost
  [squid_0:28337] [[34926,0],1]->[[34926,0],0]
mca_oob_tcp_msg_send_handler: writev failed: Bad file descriptor (9)
[sd = 8]
  [squid_0:28337] [[34926,0],1] routed:binomial: Connection to
lifeline [[34926,0],0] lost
  /usr/bin/xterm Xt error: Can't open display: localhost:11.0


I have doubts that the "-Y" is passed correctly:
   jody@triops ~/share/neander $ mpirun -np   -mca plm_rsh_agent "ssh
-Y" -host squid_0 xterm
  xterm Xt error: Can't open display:
  xterm:  DISPLAY is not set
  xterm Xt error: Can't open display:
  xterm:  DISPLAY is not set


---> as a matter of fact i noticed that the xterm option doesn't work locally:
  mpirun -np 4-xterm 1 /usr/bin/printenv
prints verything onto the console.

Do you have any other suggestions i could try?

Thank You
 Jody

On Thu, Apr 28, 2011 at 3:06 PM, Ralph Castain  wrote:
> Should be able to just set
>
> -mca plm_rsh_agent "ssh -Y"
>
> on your cmd line, I believe
>
> On Apr 28, 2011, at 12:53 AM, jody wrote:
>
>> Hi Ralph
>>
>> Is there an easy way i could modify the OpenMPI code so that it would use
>> the -Y option for ssh when connecting to remote machines?
>>
>> Thank You
>>   Jody
>>
>> On Thu, Apr 7, 2011 at 4:01 PM, jody  wrote:
>>> Hi Ralph
>>> thank you for your suggestions. After some fiddling, i found that after my
>>> last update (gentoo) my sshd_config had been overwritten
>>> (X11Forwarding was set to 'no').
>>>
>>> After correcting that, i can now open remote terminals with 'ssh -Y'
>>> and with 'ssh -X'
>>> (but with '-X' is till get those xauth warnings)
>>>
>>> But the xterm option still doesn't work:
>>>  jody@chefli ~/share/neander $ mpirun -np 4 -host squid_0 -xterm 1,2
>>> printenv | grep WORLD_RANK
>>>  Warning: untrusted X11 forwarding setup failed: xauth key data not 
>>> generated
>>>  Warning: No xauth data; using fake authentication data for X11 forwarding.
>>>  /usr/bin/xterm Xt error: Can't open display: localhost:11.0
>>>  /usr/bin/xterm Xt error: Can't open display: localhost:11.0
>>>  OMPI_COMM_WORLD_RANK=0
>>>  [aim-squid_0:09856] [[54132,0],1]->[[54132,0],0]
>>> mca_oob_tcp_msg_send_handler: writev failed: Bad file descriptor (9)
>>> [sd = 8]
>>>  [aim-squid_0:09856] [[54132,0],1] routed:binomial: Connection to
>>> lifeline [[54132,0],0] lost
>>>
>>> So it looks like the two processes from squid_0 can't open the display this 
>>> way,
>>> but one of them writes the output to the console...
>>> Surprisingly, they are trying 'localhost:11.0' whereas when i use 'ssh -Y' 
>>> the
>>> DISPLAY variable is set to 'localhost:10.0'
>>>
>>> So in what way would OMPI have to be adapted, so -xterm would work?
>>>
>>> Thank You
>>>  Jody
>>>
>>> On Wed, Apr 6, 2011 at 8:32 PM, Ralph Castain  wrote:
 Here's a little more info - it's for Cygwin, but I don't see anything
 Cygwin-specific in the answers:
 http://x.cygwin.com/docs/faq/cygwin-x-faq.html#q-ssh-no-x11forwarding

 On Apr 6, 2011, at 12:30 PM, Ralph Castain wrote:

 Sorry Jody - I should have read your note more carefully to see that you
 already tried -Y. :-(
 Not sure what to suggest...

 On Apr 6, 2011, at 12:29 PM, Ralph Castain wrote:

 Like I said, I'm not expert. However, a quick "google" of revealed this
 result:

 When trying to set up x11 forwarding over an ssh session to a remote server
 with the -X switch, I was 

Re: [OMPI users] btl_openib_cpc_include rdmacm questions

2011-04-28 Thread Brock Palen
Attached is the output of running with verbose 100, mpirun --mca 
btl_openib_cpc_include rdmacm --mca btl_base_verbose 100 NPmpi
[nyx0665.engin.umich.edu:06399] mca: base: components_open: Looking for btl 
components
[nyx0666.engin.umich.edu:07210] mca: base: components_open: Looking for btl 
components
[nyx0665.engin.umich.edu:06399] mca: base: components_open: opening btl 
components
[nyx0665.engin.umich.edu:06399] mca: base: components_open: found loaded 
component ofud
[nyx0665.engin.umich.edu:06399] mca: base: components_open: component ofud has 
no register function
[nyx0665.engin.umich.edu:06399] mca: base: components_open: component ofud open 
function successful
[nyx0665.engin.umich.edu:06399] mca: base: components_open: found loaded 
component openib
[nyx0665.engin.umich.edu:06399] mca: base: components_open: component openib 
has no register function
[nyx0665.engin.umich.edu:06399] mca: base: components_open: component openib 
open function successful
[nyx0665.engin.umich.edu:06399] mca: base: components_open: found loaded 
component self
[nyx0665.engin.umich.edu:06399] mca: base: components_open: component self has 
no register function
[nyx0665.engin.umich.edu:06399] mca: base: components_open: component self open 
function successful
[nyx0665.engin.umich.edu:06399] mca: base: components_open: found loaded 
component sm
[nyx0665.engin.umich.edu:06399] mca: base: components_open: component sm has no 
register function
[nyx0665.engin.umich.edu:06399] mca: base: components_open: component sm open 
function successful
[nyx0665.engin.umich.edu:06399] mca: base: components_open: found loaded 
component tcp
[nyx0665.engin.umich.edu:06399] mca: base: components_open: component tcp has 
no register function
[nyx0665.engin.umich.edu:06399] mca: base: components_open: component tcp open 
function successful
[nyx0666.engin.umich.edu:07210] mca: base: components_open: opening btl 
components
[nyx0666.engin.umich.edu:07210] mca: base: components_open: found loaded 
component ofud
[nyx0666.engin.umich.edu:07210] mca: base: components_open: component ofud has 
no register function
[nyx0666.engin.umich.edu:07210] mca: base: components_open: component ofud open 
function successful
[nyx0666.engin.umich.edu:07210] mca: base: components_open: found loaded 
component openib
[nyx0666.engin.umich.edu:07210] mca: base: components_open: component openib 
has no register function
[nyx0666.engin.umich.edu:07210] mca: base: components_open: component openib 
open function successful
[nyx0666.engin.umich.edu:07210] mca: base: components_open: found loaded 
component self
[nyx0666.engin.umich.edu:07210] mca: base: components_open: component self has 
no register function
[nyx0666.engin.umich.edu:07210] mca: base: components_open: component self open 
function successful
[nyx0666.engin.umich.edu:07210] mca: base: components_open: found loaded 
component sm
[nyx0666.engin.umich.edu:07210] mca: base: components_open: component sm has no 
register function
[nyx0666.engin.umich.edu:07210] mca: base: components_open: component sm open 
function successful
[nyx0666.engin.umich.edu:07210] mca: base: components_open: found loaded 
component tcp
[nyx0666.engin.umich.edu:07210] mca: base: components_open: component tcp has 
no register function
[nyx0666.engin.umich.edu:07210] mca: base: components_open: component tcp open 
function successful
[nyx0665.engin.umich.edu:06399] select: initializing btl component ofud
[nyx0665.engin.umich.edu:06399] select: init of component ofud returned failure
[nyx0665.engin.umich.edu:06399] select: module ofud unloaded
[nyx0665.engin.umich.edu:06399] select: initializing btl component openib
[nyx0666.engin.umich.edu:07210] select: initializing btl component ofud
[nyx0666.engin.umich.edu:07210] select: init of component ofud returned failure
[nyx0666.engin.umich.edu:07210] select: module ofud unloaded
[nyx0666.engin.umich.edu:07210] select: initializing btl component openib
[nyx0665.engin.umich.edu:06399] openib BTL: rdmacm IP address not found on port
[nyx0665.engin.umich.edu:06399] openib BTL: rdmacm CPC unavailable for use on 
mthca0:1; skipped
--
No OpenFabrics connection schemes reported that they were able to be
used on a specific port.  As such, the openib BTL (OpenFabrics
support) will be disabled for this port.

  Local host:   nyx0665.engin.umich.edu
  Local device: mthca0
  Local port:   1
  CPCs attempted:   rdmacm
--
[nyx0665.engin.umich.edu:06399] select: init of component openib returned 
failure
[nyx0665.engin.umich.edu:06399] select: module openib unloaded
[nyx0665.engin.umich.edu:06399] select: initializing btl component self
[nyx0665.engin.umich.edu:06399] select: init of component self returned success
[nyx0665.engin.umich.edu:06399] select: initializing btl component sm
[nyx0665.engin.umich.edu:06399] select: 

Re: [OMPI users] --enable-progress-threads broken in 1.5.3?

2011-04-28 Thread Eugene Loh

CMR 2728 did this.  I think the changes are in 1.5.4.

On 4/28/2011 5:00 AM, Jeff Squyres wrote:

It is quite likely that --enable-progress-threads is broken.  I think it's even 
disabled in 1.4.x; I wonder if we should do the same in 1.5.x...


Re: [OMPI users] problems with the -xterm option

2011-04-28 Thread Ralph Castain
Should be able to just set

-mca plm_rsh_agent "ssh -Y"

on your cmd line, I believe

On Apr 28, 2011, at 12:53 AM, jody wrote:

> Hi Ralph
> 
> Is there an easy way i could modify the OpenMPI code so that it would use
> the -Y option for ssh when connecting to remote machines?
> 
> Thank You
>   Jody
> 
> On Thu, Apr 7, 2011 at 4:01 PM, jody  wrote:
>> Hi Ralph
>> thank you for your suggestions. After some fiddling, i found that after my
>> last update (gentoo) my sshd_config had been overwritten
>> (X11Forwarding was set to 'no').
>> 
>> After correcting that, i can now open remote terminals with 'ssh -Y'
>> and with 'ssh -X'
>> (but with '-X' is till get those xauth warnings)
>> 
>> But the xterm option still doesn't work:
>>  jody@chefli ~/share/neander $ mpirun -np 4 -host squid_0 -xterm 1,2
>> printenv | grep WORLD_RANK
>>  Warning: untrusted X11 forwarding setup failed: xauth key data not generated
>>  Warning: No xauth data; using fake authentication data for X11 forwarding.
>>  /usr/bin/xterm Xt error: Can't open display: localhost:11.0
>>  /usr/bin/xterm Xt error: Can't open display: localhost:11.0
>>  OMPI_COMM_WORLD_RANK=0
>>  [aim-squid_0:09856] [[54132,0],1]->[[54132,0],0]
>> mca_oob_tcp_msg_send_handler: writev failed: Bad file descriptor (9)
>> [sd = 8]
>>  [aim-squid_0:09856] [[54132,0],1] routed:binomial: Connection to
>> lifeline [[54132,0],0] lost
>> 
>> So it looks like the two processes from squid_0 can't open the display this 
>> way,
>> but one of them writes the output to the console...
>> Surprisingly, they are trying 'localhost:11.0' whereas when i use 'ssh -Y' 
>> the
>> DISPLAY variable is set to 'localhost:10.0'
>> 
>> So in what way would OMPI have to be adapted, so -xterm would work?
>> 
>> Thank You
>>  Jody
>> 
>> On Wed, Apr 6, 2011 at 8:32 PM, Ralph Castain  wrote:
>>> Here's a little more info - it's for Cygwin, but I don't see anything
>>> Cygwin-specific in the answers:
>>> http://x.cygwin.com/docs/faq/cygwin-x-faq.html#q-ssh-no-x11forwarding
>>> 
>>> On Apr 6, 2011, at 12:30 PM, Ralph Castain wrote:
>>> 
>>> Sorry Jody - I should have read your note more carefully to see that you
>>> already tried -Y. :-(
>>> Not sure what to suggest...
>>> 
>>> On Apr 6, 2011, at 12:29 PM, Ralph Castain wrote:
>>> 
>>> Like I said, I'm not expert. However, a quick "google" of revealed this
>>> result:
>>> 
>>> When trying to set up x11 forwarding over an ssh session to a remote server
>>> with the -X switch, I was getting an error like Warning: No xauth
>>> data; using fake authentication data for X11 forwarding.
>>> 
>>> When doing something like:
>>> ssh -Xl root 10.1.1.9 to a remote server, the authentication worked, but I
>>> got an error message like:
>>> 
>>> 
>>> jason@badman ~/bin $ ssh -Xl root 10.1.1.9
>>> Warning: untrusted X11 forwarding setup failed: xauth key data not generated
>>> Warning: No xauth data; using fake authentication data for X11 forwarding.
>>> Last login: Wed Apr 14 18:18:39 2010 from 10.1.1.5
>>> [root@RHEL ~]#
>>> and any X programs I ran would not display on my local system..
>>> 
>>> Turns out the solution is to use the -Y switch instead.
>>> 
>>> ssh -Yl root 10.1.1.9
>>> 
>>> and that worked fine.
>>> 
>>> See if that works for you - if it does, we may have to modify OMPI to
>>> accommodate.
>>> 
>>> On Apr 6, 2011, at 9:19 AM, jody wrote:
>>> 
>>> Hi Ralph
>>> No, after the above error message mpirun has exited.
>>> 
>>> But i also noticed that it is to ssh into squid_0 and open a xterm there:
>>> 
>>>  jody@chefli ~/share/neander $ ssh -Y squid_0
>>>  Last login: Wed Apr  6 17:14:02 CEST 2011 from chefli.uzh.ch on pts/0
>>>  jody@squid_0 ~ $ xterm
>>>  xterm Xt error: Can't open display:
>>>  xterm:  DISPLAY is not set
>>>  jody@squid_0 ~ $ export DISPLAY=130.60.126.74:0.0
>>>  jody@squid_0 ~ $ xterm
>>>  xterm Xt error: Can't open display: 130.60.126.74:0.0
>>>  jody@squid_0 ~ $ export DISPLAY=chefli.uzh.ch:0.0
>>>  jody@squid_0 ~ $ xterm
>>>  xterm Xt error: Can't open display: chefli.uzh.ch:0.0
>>>  jody@squid_0 ~ $ exit
>>>  logout
>>> 
>>> same thing with ssh -X, but here i get the same warning/error message
>>> as with mpirun:
>>> 
>>>  jody@chefli ~/share/neander $ ssh -X squid_0
>>>  Warning: untrusted X11 forwarding setup failed: xauth key data not
>>> generated
>>>  Warning: No xauth data; using fake authentication data for X11 forwarding.
>>>  Last login: Wed Apr  6 17:12:31 CEST 2011 from chefli.uzh.ch on ssh
>>> 
>>> So perhaps the whole problem is linked to that xauth-thing.
>>> Do you have a suggestion how this can be solved?
>>> 
>>> Thank You
>>>  Jody
>>> 
>>> On Wed, Apr 6, 2011 at 4:41 PM, Ralph Castain  wrote:
>>> 
>>> If I read your error messages correctly, it looks like mpirun is crashing -
>>> the daemon is complaining that it lost the socket connection back to mpirun,
>>> and hence will abort.
>>> 
>>> Are you seeing mpirun still alive?
>>> 
>>> 
>>> On Apr 5, 

Re: [OMPI users] srun and openmpi

2011-04-28 Thread Ralph Castain

On Apr 28, 2011, at 6:49 AM, Michael Di Domenico wrote:

> On Wed, Apr 27, 2011 at 11:47 PM, Ralph Castain  wrote:
>> 
>> On Apr 27, 2011, at 1:06 PM, Michael Di Domenico wrote:
>> 
>>> On Wed, Apr 27, 2011 at 2:46 PM, Ralph Castain  wrote:
 
 On Apr 27, 2011, at 12:38 PM, Michael Di Domenico wrote:
 
> On Wed, Apr 27, 2011 at 2:25 PM, Ralph Castain  wrote:
>> 
>> On Apr 27, 2011, at 10:09 AM, Michael Di Domenico wrote:
>> 
>>> Was this ever committed to the OMPI src as something not having to be
>>> run outside of OpenMPI, but as part of the PSM setup that OpenMPI
>>> does?
>> 
>> Not that I know of - I don't think the PSM developers ever looked at it.
>> 
>> Thought about this some more and I believe I have a soln to the problem. 
>> Will try to commit something to the devel trunk by the end of the week.
> 
> Thanks

Just to save me looking back thru the thread - what OMPI version are you using? 
If it isn't the trunk, I'll send you a patch you can use.

> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] OpenMPI out of band TCP retry exceeded

2011-04-28 Thread Ralph Castain

On Apr 28, 2011, at 6:56 AM, Sindhi, Waris PW wrote:

> Yes the procgroup file has more than 128 applications in it.
> 
> % wc -l procgroup
> 239 procgroup 
> 
> Is 128 the max applications that can be in a procgroup file ? 

Yep - this limitation is lifted in the developer's trunk, but not yet in a 
release.


> 
> Sincerely,
> 
> Waris Sindhi
> High Performance Computing, TechApps
> Pratt & Whitney, UTC
> (860)-565-8486
> 
> -Original Message-
> From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On
> Behalf Of Ralph Castain
> Sent: Wednesday, April 27, 2011 8:09 PM
> To: Open MPI Users
> Subject: Re: [OMPI users] OpenMPI out of band TCP retry exceeded
> 
> 
> On Apr 27, 2011, at 1:31 PM, Sindhi, Waris PW wrote:
> 
>> No we do not have a firewall turned on. I can run smaller 96 slave
> cases
>> on ln10 and ln13 included on the slavelist. 
>> 
>> Could there be another reason for this to fail ? 
> 
> What is in "procgroup"? Is it a single application?
> 
> Offhand, there is nothing in OMPI that would explain the problem. The
> only possibility I can think of would be if your "procgroup" file
> contains more than 128 applications in it.
> 
>> 
>> 
>> Sincerely,
>> 
>> Waris Sindhi
>> High Performance Computing, TechApps
>> Pratt & Whitney, UTC
>> (860)-565-8486
>> 
>> -Original Message-
>> From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org]
> On
>> Behalf Of Ralph Castain
>> Sent: Wednesday, April 27, 2011 2:18 PM
>> To: Open MPI Users
>> Subject: Re: [OMPI users] OpenMPI out of band TCP retry exceeded
>> 
>> Perhaps a firewall? All it is telling you is that mpirun couldn't
>> establish TCP communications with the daemon on ln10.
>> 
>> 
>> On Apr 27, 2011, at 11:58 AM, Sindhi, Waris PW wrote:
>> 
>>> Hi,
>>>   I am getting a "oob-tcp: Communication retries exceeded" error
>>> message when I run a 238 MPI slave code
>>> 
>>> 
>>> /opt/openmpi/i386/bin/mpirun -mca btl_openib_verbose 1 --mca btl ^tcp
>>> --mca pls_ssh_agent ssh -mca oob_tcp_peer_retries 1000 --prefix
>>> /usr/lib/openmpi/1.2.8-gcc/bin -np 239 --app procgroup
>>> 
>> 
> 
>>> --
>>> mpirun was unable to start the specified application as it
> encountered
>>> an error:
>>> 
>>> Error name: Unknown error: 1
>>> Node: ln10
>>> 
>>> when attempting to start process rank 234.
>>> 
>> 
> 
>>> --
>>> [ln13:27867] [[61748,0],0]-[[61748,0],32] oob-tcp: Communication
>> retries
>>> exceeded.  Can not communicate with peer
>>> [ln13:27867] [[61748,0],0] ORTE_ERROR_LOG: Unreachable in file
>>> orted/orted_comm.c at line 130
>>> [ln13:27867] [[61748,0],0] ORTE_ERROR_LOG: Unreachable in file
>>> orted/orted_comm.c at line 130
>>> [ln13:27867] [[61748,0],0]-[[61748,0],32] oob-tcp: Communication
>> retries
>>> exceeded.  Can not communicate with peer
>>> [ln13:27867] [[61748,0],0]-[[61748,0],32] oob-tcp: Communication
>> retries
>>> exceeded.  Can not communicate with peer
>>> [ln13:27867] [[61748,0],0]-[[61748,0],32] oob-tcp: Communication
>> retries
>>> exceeded.  Can not communicate with peer
>>> [ln13:27867] [[61748,0],0]-[[61748,0],32] oob-tcp: Communication
>> retries
>>> exceeded.  Can not communicate with peer
>>> [ln13:27867] [[61748,0],0]-[[61748,0],32] oob-tcp: Communication
>> retries
>>> exceeded.  Can not communicate with peer
>>> [ln13:27867] [[61748,0],0]-[[61748,0],32] oob-tcp: Communication
>> retries
>>> exceeded.  Can not communicate with peer
>>> [ln13:27867] [[61748,0],0]-[[61748,0],32] oob-tcp: Communication
>> retries
>>> exceeded.  Can not communicate with peer
>>> [ln13:27867] [[61748,0],0]-[[61748,0],32] oob-tcp: Communication
>> retries
>>> exceeded.  Can not communicate with peer
>>> [ln13:27867] [[61748,0],0]-[[61748,0],32] oob-tcp: Communication
>> retries
>>> exceeded.  Can not communicate with peer
>>> [ln13:27867] [[61748,0],0]-[[61748,0],32] oob-tcp: Communication
>> retries
>>> exceeded.  Can not communicate with peer
>>> 
>>> Any help would be greatly appreciated.
>>> 
>>> Sincerely,
>>> 
>>> Waris Sindhi
>>> High Performance Computing, TechApps
>>> Pratt & Whitney, UTC
>>> (860)-565-8486
>>> 
>>> 
>>> ___
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
>> 
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> ___
> users mailing list
> us...@open-mpi.org
> 

Re: [OMPI users] OpenMPI out of band TCP retry exceeded

2011-04-28 Thread Ralph Castain

On Apr 28, 2011, at 6:49 AM, Jeff Squyres wrote:

> On Apr 28, 2011, at 8:45 AM, Ralph Castain wrote:
> 
>> What lead you to conclude 1.2.8?
>> 
>> /opt/openmpi/i386/bin/mpirun -mca btl_openib_verbose 1 --mca btl ^tcp
>> --mca pls_ssh_agent ssh -mca oob_tcp_peer_retries 1000 --prefix
>> /usr/lib/openmpi/1.2.8-gcc/bin -np 239 --app procgroup
> 
> His command line has "1.2.8" in it.

Actually, that isn't totally correct and may point to the problem. The mpirun 
cmd itself points to a version of OMPI located in /opt/openmpi. The error 
messages are clearly from a 1.3+ version - they look totally different for 1.2

However, the prefix passed to the backend nodes points to /usr/lib, and indeed 
looks like a 1.2.8 version.

Waris: is this a mistype? Are these two versions actually the same?

If not, that would explain the problem - you can't mix OMPI versions. As 
written, the cmd line has the potential to mix one version of mpirun with 
another version of the daemons.


> 
> -- 
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] OpenMPI out of band TCP retry exceeded

2011-04-28 Thread Sindhi, Waris PW
Yes the procgroup file has more than 128 applications in it.

% wc -l procgroup
239 procgroup 

Is 128 the max applications that can be in a procgroup file ? 

Sincerely,

Waris Sindhi
High Performance Computing, TechApps
Pratt & Whitney, UTC
(860)-565-8486

-Original Message-
From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On
Behalf Of Ralph Castain
Sent: Wednesday, April 27, 2011 8:09 PM
To: Open MPI Users
Subject: Re: [OMPI users] OpenMPI out of band TCP retry exceeded


On Apr 27, 2011, at 1:31 PM, Sindhi, Waris PW wrote:

> No we do not have a firewall turned on. I can run smaller 96 slave
cases
> on ln10 and ln13 included on the slavelist. 
> 
> Could there be another reason for this to fail ? 

What is in "procgroup"? Is it a single application?

Offhand, there is nothing in OMPI that would explain the problem. The
only possibility I can think of would be if your "procgroup" file
contains more than 128 applications in it.

> 
> 
> Sincerely,
> 
> Waris Sindhi
> High Performance Computing, TechApps
> Pratt & Whitney, UTC
> (860)-565-8486
> 
> -Original Message-
> From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org]
On
> Behalf Of Ralph Castain
> Sent: Wednesday, April 27, 2011 2:18 PM
> To: Open MPI Users
> Subject: Re: [OMPI users] OpenMPI out of band TCP retry exceeded
> 
> Perhaps a firewall? All it is telling you is that mpirun couldn't
> establish TCP communications with the daemon on ln10.
> 
> 
> On Apr 27, 2011, at 11:58 AM, Sindhi, Waris PW wrote:
> 
>> Hi,
>>I am getting a "oob-tcp: Communication retries exceeded" error
>> message when I run a 238 MPI slave code
>> 
>> 
>> /opt/openmpi/i386/bin/mpirun -mca btl_openib_verbose 1 --mca btl ^tcp
>> --mca pls_ssh_agent ssh -mca oob_tcp_peer_retries 1000 --prefix
>> /usr/lib/openmpi/1.2.8-gcc/bin -np 239 --app procgroup
>> 
>

>> --
>> mpirun was unable to start the specified application as it
encountered
>> an error:
>> 
>> Error name: Unknown error: 1
>> Node: ln10
>> 
>> when attempting to start process rank 234.
>> 
>

>> --
>> [ln13:27867] [[61748,0],0]-[[61748,0],32] oob-tcp: Communication
> retries
>> exceeded.  Can not communicate with peer
>> [ln13:27867] [[61748,0],0] ORTE_ERROR_LOG: Unreachable in file
>> orted/orted_comm.c at line 130
>> [ln13:27867] [[61748,0],0] ORTE_ERROR_LOG: Unreachable in file
>> orted/orted_comm.c at line 130
>> [ln13:27867] [[61748,0],0]-[[61748,0],32] oob-tcp: Communication
> retries
>> exceeded.  Can not communicate with peer
>> [ln13:27867] [[61748,0],0]-[[61748,0],32] oob-tcp: Communication
> retries
>> exceeded.  Can not communicate with peer
>> [ln13:27867] [[61748,0],0]-[[61748,0],32] oob-tcp: Communication
> retries
>> exceeded.  Can not communicate with peer
>> [ln13:27867] [[61748,0],0]-[[61748,0],32] oob-tcp: Communication
> retries
>> exceeded.  Can not communicate with peer
>> [ln13:27867] [[61748,0],0]-[[61748,0],32] oob-tcp: Communication
> retries
>> exceeded.  Can not communicate with peer
>> [ln13:27867] [[61748,0],0]-[[61748,0],32] oob-tcp: Communication
> retries
>> exceeded.  Can not communicate with peer
>> [ln13:27867] [[61748,0],0]-[[61748,0],32] oob-tcp: Communication
> retries
>> exceeded.  Can not communicate with peer
>> [ln13:27867] [[61748,0],0]-[[61748,0],32] oob-tcp: Communication
> retries
>> exceeded.  Can not communicate with peer
>> [ln13:27867] [[61748,0],0]-[[61748,0],32] oob-tcp: Communication
> retries
>> exceeded.  Can not communicate with peer
>> [ln13:27867] [[61748,0],0]-[[61748,0],32] oob-tcp: Communication
> retries
>> exceeded.  Can not communicate with peer
>> 
>> Any help would be greatly appreciated.
>> 
>> Sincerely,
>> 
>> Waris Sindhi
>> High Performance Computing, TechApps
>> Pratt & Whitney, UTC
>> (860)-565-8486
>> 
>> 
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



Re: [OMPI users] srun and openmpi

2011-04-28 Thread Michael Di Domenico
On Wed, Apr 27, 2011 at 11:47 PM, Ralph Castain  wrote:
>
> On Apr 27, 2011, at 1:06 PM, Michael Di Domenico wrote:
>
>> On Wed, Apr 27, 2011 at 2:46 PM, Ralph Castain  wrote:
>>>
>>> On Apr 27, 2011, at 12:38 PM, Michael Di Domenico wrote:
>>>
 On Wed, Apr 27, 2011 at 2:25 PM, Ralph Castain  wrote:
>
> On Apr 27, 2011, at 10:09 AM, Michael Di Domenico wrote:
>
>> Was this ever committed to the OMPI src as something not having to be
>> run outside of OpenMPI, but as part of the PSM setup that OpenMPI
>> does?
>
> Not that I know of - I don't think the PSM developers ever looked at it.
>
> Thought about this some more and I believe I have a soln to the problem. Will 
> try to commit something to the devel trunk by the end of the week.

Thanks


Re: [OMPI users] OpenMPI out of band TCP retry exceeded

2011-04-28 Thread Jeff Squyres
On Apr 28, 2011, at 8:45 AM, Ralph Castain wrote:

> What lead you to conclude 1.2.8?
> 
> /opt/openmpi/i386/bin/mpirun -mca btl_openib_verbose 1 --mca btl ^tcp
> --mca pls_ssh_agent ssh -mca oob_tcp_peer_retries 1000 --prefix
> /usr/lib/openmpi/1.2.8-gcc/bin -np 239 --app procgroup

His command line has "1.2.8" in it.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/



Re: [OMPI users] OpenMPI out of band TCP retry exceeded

2011-04-28 Thread Ralph Castain

On Apr 28, 2011, at 6:04 AM, Jeff Squyres wrote:

> I do note that you are using an ancient version of Open MPI (1.2.8).

I don't think that is accurate - at least, the output doesn't match that old a 
version. The process name format is indicative of something 1.3 or more recent.

What lead you to conclude 1.2.8?


>  Is there any way you can upgrade to a (much) later version, such as 1.4.3?  
> That might improve your TCP connectivity -- we made improvements in those 
> portions of the code over the years.
> 
> On Apr 27, 2011, at 8:09 PM, Ralph Castain wrote:
> 
>> 
>> On Apr 27, 2011, at 1:31 PM, Sindhi, Waris PW wrote:
>> 
>>> No we do not have a firewall turned on. I can run smaller 96 slave cases
>>> on ln10 and ln13 included on the slavelist. 
>>> 
>>> Could there be another reason for this to fail ? 
>> 
>> What is in "procgroup"? Is it a single application?
>> 
>> Offhand, there is nothing in OMPI that would explain the problem. The only 
>> possibility I can think of would be if your "procgroup" file contains more 
>> than 128 applications in it.
>> 
>>> 
>>> 
>>> Sincerely,
>>> 
>>> Waris Sindhi
>>> High Performance Computing, TechApps
>>> Pratt & Whitney, UTC
>>> (860)-565-8486
>>> 
>>> -Original Message-
>>> From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On
>>> Behalf Of Ralph Castain
>>> Sent: Wednesday, April 27, 2011 2:18 PM
>>> To: Open MPI Users
>>> Subject: Re: [OMPI users] OpenMPI out of band TCP retry exceeded
>>> 
>>> Perhaps a firewall? All it is telling you is that mpirun couldn't
>>> establish TCP communications with the daemon on ln10.
>>> 
>>> 
>>> On Apr 27, 2011, at 11:58 AM, Sindhi, Waris PW wrote:
>>> 
 Hi,
  I am getting a "oob-tcp: Communication retries exceeded" error
 message when I run a 238 MPI slave code
 
 
 /opt/openmpi/i386/bin/mpirun -mca btl_openib_verbose 1 --mca btl ^tcp
 --mca pls_ssh_agent ssh -mca oob_tcp_peer_retries 1000 --prefix
 /usr/lib/openmpi/1.2.8-gcc/bin -np 239 --app procgroup
 
>>> 
 --
 mpirun was unable to start the specified application as it encountered
 an error:
 
 Error name: Unknown error: 1
 Node: ln10
 
 when attempting to start process rank 234.
 
>>> 
 --
 [ln13:27867] [[61748,0],0]-[[61748,0],32] oob-tcp: Communication
>>> retries
 exceeded.  Can not communicate with peer
 [ln13:27867] [[61748,0],0] ORTE_ERROR_LOG: Unreachable in file
 orted/orted_comm.c at line 130
 [ln13:27867] [[61748,0],0] ORTE_ERROR_LOG: Unreachable in file
 orted/orted_comm.c at line 130
 [ln13:27867] [[61748,0],0]-[[61748,0],32] oob-tcp: Communication
>>> retries
 exceeded.  Can not communicate with peer
 [ln13:27867] [[61748,0],0]-[[61748,0],32] oob-tcp: Communication
>>> retries
 exceeded.  Can not communicate with peer
 [ln13:27867] [[61748,0],0]-[[61748,0],32] oob-tcp: Communication
>>> retries
 exceeded.  Can not communicate with peer
 [ln13:27867] [[61748,0],0]-[[61748,0],32] oob-tcp: Communication
>>> retries
 exceeded.  Can not communicate with peer
 [ln13:27867] [[61748,0],0]-[[61748,0],32] oob-tcp: Communication
>>> retries
 exceeded.  Can not communicate with peer
 [ln13:27867] [[61748,0],0]-[[61748,0],32] oob-tcp: Communication
>>> retries
 exceeded.  Can not communicate with peer
 [ln13:27867] [[61748,0],0]-[[61748,0],32] oob-tcp: Communication
>>> retries
 exceeded.  Can not communicate with peer
 [ln13:27867] [[61748,0],0]-[[61748,0],32] oob-tcp: Communication
>>> retries
 exceeded.  Can not communicate with peer
 [ln13:27867] [[61748,0],0]-[[61748,0],32] oob-tcp: Communication
>>> retries
 exceeded.  Can not communicate with peer
 [ln13:27867] [[61748,0],0]-[[61748,0],32] oob-tcp: Communication
>>> retries
 exceeded.  Can not communicate with peer
 
 Any help would be greatly appreciated.
 
 Sincerely,
 
 Waris Sindhi
 High Performance Computing, TechApps
 Pratt & Whitney, UTC
 (860)-565-8486
 
 
 ___
 users mailing list
 us...@open-mpi.org
 http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> 
>>> 
>>> ___
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> 
>>> ___
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
>> 
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 
> -- 
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> 

Re: [OMPI users] btl_openib_cpc_include rdmacm questions

2011-04-28 Thread Jeff Squyres
On Apr 27, 2011, at 10:02 AM, Brock Palen wrote:

> Argh, our messed up environment with three generations on infiniband bit us,
> Setting openib_cpc_include to rdmacm causes ib to not be used on our old DDR 
> ib on some of our hosts.  Note that jobs will never run across our old DDR ib 
> and our new QDR stuff where rdmacm does work.

Hmm -- odd.  I use RDMACM on some old DDR (and SDR!) IB hardware and it seems 
to work fine.

Do you have any indication as to why OMPI is refusing to use rdmacm on your 
older hardware, other than "No OF connection schemes reported..."?  Try running 
with --mca btl_base_verbose 100 (beware: it will be a truckload of output).  
Make sure that you have rdmacm support available on those machines, both in 
OMPI and in OFED/the OS.

> I am doing some testing with:
> export OMPI_MCA_btl_openib_cpc_include=rdmacm,oob,xoob
> 
> What I want to know is there a way to tell mpirun to 'dump all resolved mca 
> settings'  Or something similar. 

I'm not quite sure what you're asking here -- do you want to override MCA 
params on specific hosts?

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI users] OpenMPI out of band TCP retry exceeded

2011-04-28 Thread Jeff Squyres
I do note that you are using an ancient version of Open MPI (1.2.8).  Is there 
any way you can upgrade to a (much) later version, such as 1.4.3?  That might 
improve your TCP connectivity -- we made improvements in those portions of the 
code over the years.

On Apr 27, 2011, at 8:09 PM, Ralph Castain wrote:

> 
> On Apr 27, 2011, at 1:31 PM, Sindhi, Waris PW wrote:
> 
>> No we do not have a firewall turned on. I can run smaller 96 slave cases
>> on ln10 and ln13 included on the slavelist. 
>> 
>> Could there be another reason for this to fail ? 
> 
> What is in "procgroup"? Is it a single application?
> 
> Offhand, there is nothing in OMPI that would explain the problem. The only 
> possibility I can think of would be if your "procgroup" file contains more 
> than 128 applications in it.
> 
>> 
>> 
>> Sincerely,
>> 
>> Waris Sindhi
>> High Performance Computing, TechApps
>> Pratt & Whitney, UTC
>> (860)-565-8486
>> 
>> -Original Message-
>> From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On
>> Behalf Of Ralph Castain
>> Sent: Wednesday, April 27, 2011 2:18 PM
>> To: Open MPI Users
>> Subject: Re: [OMPI users] OpenMPI out of band TCP retry exceeded
>> 
>> Perhaps a firewall? All it is telling you is that mpirun couldn't
>> establish TCP communications with the daemon on ln10.
>> 
>> 
>> On Apr 27, 2011, at 11:58 AM, Sindhi, Waris PW wrote:
>> 
>>> Hi,
>>>   I am getting a "oob-tcp: Communication retries exceeded" error
>>> message when I run a 238 MPI slave code
>>> 
>>> 
>>> /opt/openmpi/i386/bin/mpirun -mca btl_openib_verbose 1 --mca btl ^tcp
>>> --mca pls_ssh_agent ssh -mca oob_tcp_peer_retries 1000 --prefix
>>> /usr/lib/openmpi/1.2.8-gcc/bin -np 239 --app procgroup
>>> 
>> 
>>> --
>>> mpirun was unable to start the specified application as it encountered
>>> an error:
>>> 
>>> Error name: Unknown error: 1
>>> Node: ln10
>>> 
>>> when attempting to start process rank 234.
>>> 
>> 
>>> --
>>> [ln13:27867] [[61748,0],0]-[[61748,0],32] oob-tcp: Communication
>> retries
>>> exceeded.  Can not communicate with peer
>>> [ln13:27867] [[61748,0],0] ORTE_ERROR_LOG: Unreachable in file
>>> orted/orted_comm.c at line 130
>>> [ln13:27867] [[61748,0],0] ORTE_ERROR_LOG: Unreachable in file
>>> orted/orted_comm.c at line 130
>>> [ln13:27867] [[61748,0],0]-[[61748,0],32] oob-tcp: Communication
>> retries
>>> exceeded.  Can not communicate with peer
>>> [ln13:27867] [[61748,0],0]-[[61748,0],32] oob-tcp: Communication
>> retries
>>> exceeded.  Can not communicate with peer
>>> [ln13:27867] [[61748,0],0]-[[61748,0],32] oob-tcp: Communication
>> retries
>>> exceeded.  Can not communicate with peer
>>> [ln13:27867] [[61748,0],0]-[[61748,0],32] oob-tcp: Communication
>> retries
>>> exceeded.  Can not communicate with peer
>>> [ln13:27867] [[61748,0],0]-[[61748,0],32] oob-tcp: Communication
>> retries
>>> exceeded.  Can not communicate with peer
>>> [ln13:27867] [[61748,0],0]-[[61748,0],32] oob-tcp: Communication
>> retries
>>> exceeded.  Can not communicate with peer
>>> [ln13:27867] [[61748,0],0]-[[61748,0],32] oob-tcp: Communication
>> retries
>>> exceeded.  Can not communicate with peer
>>> [ln13:27867] [[61748,0],0]-[[61748,0],32] oob-tcp: Communication
>> retries
>>> exceeded.  Can not communicate with peer
>>> [ln13:27867] [[61748,0],0]-[[61748,0],32] oob-tcp: Communication
>> retries
>>> exceeded.  Can not communicate with peer
>>> [ln13:27867] [[61748,0],0]-[[61748,0],32] oob-tcp: Communication
>> retries
>>> exceeded.  Can not communicate with peer
>>> 
>>> Any help would be greatly appreciated.
>>> 
>>> Sincerely,
>>> 
>>> Waris Sindhi
>>> High Performance Computing, TechApps
>>> Pratt & Whitney, UTC
>>> (860)-565-8486
>>> 
>>> 
>>> ___
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
>> 
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI users] MPI_Comm_create prevents external socket connections

2011-04-28 Thread Jeff Squyres
MPI_Comm_create shouldn't have any effect on existing fd's.

Have you run your code through a memory-checking debugger such as valgrind?


On Apr 28, 2011, at 12:57 AM, Randolph Pullen wrote:

> I have a problem with MPI_Comm_create,
> 
> My server application has 2 processes per node, 1 listener and 1 worker.
> 
> Each listener monitors a specified port for incoming TCP connections with the 
> goal that on receipt of a request it will distribute it over the workers in a 
> SIMD fashion.
> 
> This all works fine unless MPI_Comm_create is called on the listener process. 
>  Then after the call the incoming socket cannot be reached by the external 
> client processes.  The client reports “”Could’t open socket”.  No other error 
> is apparent.   I have tried using a variety of different sockets but to no 
> effect.
> 
> I use OpenMPI 1.4.1 on FD10 with vanilla TCP.  The install is totally 
> standard with no changes.
> 
> Is this a known issue?
> 
> An help appreciated.
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI users] --enable-progress-threads broken in 1.5.3?

2011-04-28 Thread Jeff Squyres
It is quite likely that --enable-progress-threads is broken.  I think it's even 
disabled in 1.4.x; I wonder if we should do the same in 1.5.x...


On Apr 28, 2011, at 5:20 AM, Paul Kapinos wrote:

> Hi OpenMPI folks,
> 
> I've tried to install /1.5.3 version with aktivated progress threads (just to 
> try it out) in addition to --enable-mpi-threads. The installation was fine, I 
> also could build binaries, but each mpiexec call hangs forever silently. With 
> the very same configuration options but without --enable-progress-threads, 
> everything runs fine.
> 
> So I wonder about the --enable-progress-threads is broken, or maybe I did 
> something wrong?
> 
> 
> The configuration line was:
> 
> ./configure --with-openib --with-lsf --with-devel-headers 
> --enable-contrib-no-build=vt --enable-mpi-threads --enable-progress-threads 
> --enable-heterogeneous --enable-cxx-exceptions 
> --enable-orterun-prefix-by-default <>
> 
> where <> contain prefix and some compiler-specific stuff.
> 
> All versions compilerd (GCC, Intel, PGI, Sun Studio compilers, 23bit and 
> 64bit) behaves the very same way.
> 
> 
> Best wishes,
> 
> Paul
> 
> 
> -- 
> Dipl.-Inform. Paul Kapinos   -   High Performance Computing,
> RWTH Aachen University, Center for Computing and Communication
> Seffenter Weg 23,  D 52074  Aachen (Germany)
> Tel: +49 241/80-24915
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




[OMPI users] --enable-progress-threads broken in 1.5.3?

2011-04-28 Thread Paul Kapinos

Hi OpenMPI folks,

I've tried to install /1.5.3 version with aktivated progress threads 
(just to try it out) in addition to --enable-mpi-threads. The 
installation was fine, I also could build binaries, but each mpiexec 
call hangs forever silently. With the very same configuration options 
but without --enable-progress-threads, everything runs fine.


So I wonder about the --enable-progress-threads is broken, or maybe I 
did something wrong?



The configuration line was:

./configure --with-openib --with-lsf --with-devel-headers 
--enable-contrib-no-build=vt --enable-mpi-threads 
--enable-progress-threads --enable-heterogeneous --enable-cxx-exceptions 
--enable-orterun-prefix-by-default <>


where <> contain prefix and some compiler-specific stuff.

All versions compilerd (GCC, Intel, PGI, Sun Studio compilers, 23bit and 
64bit) behaves the very same way.



Best wishes,

Paul


--
Dipl.-Inform. Paul Kapinos   -   High Performance Computing,
RWTH Aachen University, Center for Computing and Communication
Seffenter Weg 23,  D 52074  Aachen (Germany)
Tel: +49 241/80-24915


smime.p7s
Description: S/MIME Cryptographic Signature


Re: [OMPI users] problems with the -xterm option

2011-04-28 Thread jody
Hi Ralph

Is there an easy way i could modify the OpenMPI code so that it would use
the -Y option for ssh when connecting to remote machines?

Thank You
   Jody

On Thu, Apr 7, 2011 at 4:01 PM, jody  wrote:
> Hi Ralph
> thank you for your suggestions. After some fiddling, i found that after my
> last update (gentoo) my sshd_config had been overwritten
> (X11Forwarding was set to 'no').
>
> After correcting that, i can now open remote terminals with 'ssh -Y'
> and with 'ssh -X'
> (but with '-X' is till get those xauth warnings)
>
> But the xterm option still doesn't work:
>  jody@chefli ~/share/neander $ mpirun -np 4 -host squid_0 -xterm 1,2
> printenv | grep WORLD_RANK
>  Warning: untrusted X11 forwarding setup failed: xauth key data not generated
>  Warning: No xauth data; using fake authentication data for X11 forwarding.
>  /usr/bin/xterm Xt error: Can't open display: localhost:11.0
>  /usr/bin/xterm Xt error: Can't open display: localhost:11.0
>  OMPI_COMM_WORLD_RANK=0
>  [aim-squid_0:09856] [[54132,0],1]->[[54132,0],0]
> mca_oob_tcp_msg_send_handler: writev failed: Bad file descriptor (9)
> [sd = 8]
>  [aim-squid_0:09856] [[54132,0],1] routed:binomial: Connection to
> lifeline [[54132,0],0] lost
>
> So it looks like the two processes from squid_0 can't open the display this 
> way,
> but one of them writes the output to the console...
> Surprisingly, they are trying 'localhost:11.0' whereas when i use 'ssh -Y' the
> DISPLAY variable is set to 'localhost:10.0'
>
> So in what way would OMPI have to be adapted, so -xterm would work?
>
> Thank You
>  Jody
>
> On Wed, Apr 6, 2011 at 8:32 PM, Ralph Castain  wrote:
>> Here's a little more info - it's for Cygwin, but I don't see anything
>> Cygwin-specific in the answers:
>> http://x.cygwin.com/docs/faq/cygwin-x-faq.html#q-ssh-no-x11forwarding
>>
>> On Apr 6, 2011, at 12:30 PM, Ralph Castain wrote:
>>
>> Sorry Jody - I should have read your note more carefully to see that you
>> already tried -Y. :-(
>> Not sure what to suggest...
>>
>> On Apr 6, 2011, at 12:29 PM, Ralph Castain wrote:
>>
>> Like I said, I'm not expert. However, a quick "google" of revealed this
>> result:
>>
>> When trying to set up x11 forwarding over an ssh session to a remote server
>> with the -X switch, I was getting an error like Warning: No xauth
>> data; using fake authentication data for X11 forwarding.
>>
>> When doing something like:
>> ssh -Xl root 10.1.1.9 to a remote server, the authentication worked, but I
>> got an error message like:
>>
>>
>> jason@badman ~/bin $ ssh -Xl root 10.1.1.9
>> Warning: untrusted X11 forwarding setup failed: xauth key data not generated
>> Warning: No xauth data; using fake authentication data for X11 forwarding.
>> Last login: Wed Apr 14 18:18:39 2010 from 10.1.1.5
>> [root@RHEL ~]#
>> and any X programs I ran would not display on my local system..
>>
>> Turns out the solution is to use the -Y switch instead.
>>
>> ssh -Yl root 10.1.1.9
>>
>> and that worked fine.
>>
>> See if that works for you - if it does, we may have to modify OMPI to
>> accommodate.
>>
>> On Apr 6, 2011, at 9:19 AM, jody wrote:
>>
>> Hi Ralph
>> No, after the above error message mpirun has exited.
>>
>> But i also noticed that it is to ssh into squid_0 and open a xterm there:
>>
>>  jody@chefli ~/share/neander $ ssh -Y squid_0
>>  Last login: Wed Apr  6 17:14:02 CEST 2011 from chefli.uzh.ch on pts/0
>>  jody@squid_0 ~ $ xterm
>>  xterm Xt error: Can't open display:
>>  xterm:  DISPLAY is not set
>>  jody@squid_0 ~ $ export DISPLAY=130.60.126.74:0.0
>>  jody@squid_0 ~ $ xterm
>>  xterm Xt error: Can't open display: 130.60.126.74:0.0
>>  jody@squid_0 ~ $ export DISPLAY=chefli.uzh.ch:0.0
>>  jody@squid_0 ~ $ xterm
>>  xterm Xt error: Can't open display: chefli.uzh.ch:0.0
>>  jody@squid_0 ~ $ exit
>>  logout
>>
>> same thing with ssh -X, but here i get the same warning/error message
>> as with mpirun:
>>
>>  jody@chefli ~/share/neander $ ssh -X squid_0
>>  Warning: untrusted X11 forwarding setup failed: xauth key data not
>> generated
>>  Warning: No xauth data; using fake authentication data for X11 forwarding.
>>  Last login: Wed Apr  6 17:12:31 CEST 2011 from chefli.uzh.ch on ssh
>>
>> So perhaps the whole problem is linked to that xauth-thing.
>> Do you have a suggestion how this can be solved?
>>
>> Thank You
>>  Jody
>>
>> On Wed, Apr 6, 2011 at 4:41 PM, Ralph Castain  wrote:
>>
>> If I read your error messages correctly, it looks like mpirun is crashing -
>> the daemon is complaining that it lost the socket connection back to mpirun,
>> and hence will abort.
>>
>> Are you seeing mpirun still alive?
>>
>>
>> On Apr 5, 2011, at 4:46 AM, jody wrote:
>>
>> Hi
>>
>> On my workstation and  the cluster i set up OpenMPI (v 1.4.2) so that
>>
>> it works in "text-mode":
>>
>>  $ mpirun -np 4  -x DISPLAY -host squid_0   printenv | grep WORLD_RANK
>>
>>  OMPI_COMM_WORLD_RANK=0
>>
>>  OMPI_COMM_WORLD_RANK=1
>>
>>  

[OMPI users] MPI_Comm_create prevents external socket connections

2011-04-28 Thread Randolph Pullen

















I have a problem with MPI_Comm_create,

My server application has 2 processes per node, 1 listener
and 1 worker.

Each listener monitors a specified port for incoming TCP
connections with the goal that on receipt of a request it will distribute it 
over the workers in a
SIMD fashion.

This all works fine unless MPI_Comm_create is called on the
listener process.  Then after the
call the incoming socket cannot be reached by the external client processes.  
The client reports “”Could’t open socket”.  No other error is apparent.   I 
have tried using a variety of
different sockets but to no effect.

I use OpenMPI 1.4.1 on FD10 with vanilla TCP.  The install is totally standard 
with no
changes.

Is this a known issue?

An help appreciated.



Re: [OMPI users] srun and openmpi

2011-04-28 Thread Ralph Castain

On Apr 27, 2011, at 1:06 PM, Michael Di Domenico wrote:

> On Wed, Apr 27, 2011 at 2:46 PM, Ralph Castain  wrote:
>> 
>> On Apr 27, 2011, at 12:38 PM, Michael Di Domenico wrote:
>> 
>>> On Wed, Apr 27, 2011 at 2:25 PM, Ralph Castain  wrote:
 
 On Apr 27, 2011, at 10:09 AM, Michael Di Domenico wrote:
 
> Was this ever committed to the OMPI src as something not having to be
> run outside of OpenMPI, but as part of the PSM setup that OpenMPI
> does?
 
 Not that I know of - I don't think the PSM developers ever looked at it.

Thought about this some more and I believe I have a soln to the problem. Will 
try to commit something to the devel trunk by the end of the week.

Ralph


 
> 
> I'm having some trouble getting Slurm/OpenMPI to play nice with the
> setup of this key.  Namely, with slurm you cannot export variables
> from the --prolog of an srun, only from an --task-prolog,
> unfortunately, if you use a task-prolog each rank gets a different
> key, which doesn't work.
> 
> I'm also guessing that each unique mpirun needs it's own psm key, not
> one for the whole system, so i can't just make it a permanent
> parameter somewhere else.
> 
> Also, i recall reading somewhere that the --resv-ports parameter that
> OMPI uses from slurm to choose a list of ports to use for TCP comm's,
> tries to lock a port from the pool three times before giving up.
 
 Had to look back at the code - I think you misread this. I can find no 
 evidence in the code that we try to bind that port more than once.
>>> 
>>> Perhaps i misstated, i don't believe you're trying to bind to the same
>>> port twice during the same session.  i believe the code re-uses
>>> similar ports from session to session.  what i believe happens (but
>>> could be totally wrong) the previous session releases the port, but
>>> linux isn't quite done with it when the new session tries to bind to
>>> the port, in which case it tries three times and then fails the job
>> 
>> Actually, I understood you correctly. I'm just saying that I find no 
>> evidence in the code that we try three times before giving up. What I see is 
>> a single attempt to bind the port - if it fails, then we abort. There is no 
>> parameter to control that behavior.
>> 
>> So if the OS hasn't released the port by the time a new job starts on that 
>> node, then it will indeed abort if the job was unfortunately given the same 
>> port reservation.
> 
> Oh, okay, sorry...
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users