On May 2, 2011, at 2:34 AM, jody wrote:

> Hi Ralph
> 
> I rebuilt open MPI 1.4.2 with the debug option on both chefli and squid_0.
> The results are interesting!
> 
> I wrote a small HelloMPI app which basically calls usleep for a pause
> of 5 seconds.
> 
> Now calling it as i did before, no MPI errors appear anymore, only the
> display problems:
>  jody@chefli ~/share/neander $ mpirun -np 1 -host squid_0 -mca
> plm_rsh_agent "ssh -Y" --xterm 0 ./HelloMPI
>  /usr/bin/xterm Xt error: Can't open display: localhost:10.0
> 
> When i do the same call *with* the debug option, the xterm appears and
> shows the output of HelloMPI!
> I attach the output in ompidbg_1.txt (It also works if i call with
> '-np 4' and '--xterm 0,1,2,3'

Good!

> 
> Calling hostname the same way does not open an xterm (cf. ompidbg_2.txt).
> 
> If i use the hold-option, the xterm appears with the output of
> 'hostrname' (cf. ompidbg_3.txt)
> The xterm opens after the line "launch complete for job..." has been
> written (line 59)

Okay, that's also expected. Like I said, without the "hold", the output is 
generated so quickly that the window just flashes at best. I've had similar 
experiences - hence the "hold" option.

> 
> I just found that everything works as expected if i use the the
> '--leave-session-attached' option (without the debug options):
>  jody@chefli ~/share/neander $ mpirun -np 4 -host squid_0 -mca
> plm_rsh_agent "ssh -Y"  --leave-session-attached  --xterm 0,1,2,3!
> ./HelloMPI
> The xterms are also opened if i do not use the '!' hold option.

Okay, I can understand why. The --leave-session-attached option just tells 
mpirun to not daemonize the backend daemons - thus leaving the ssh session 
alive. The debug options do the same thing, but turn on all the debug output.

The problem is that if you don't leave the ssh session alive, then the xterm 
has no way back to your screen. By daemonizing, we severe that connection.

What I should do (and maybe used to do, but it got removed) is automatically 
turn "on" the leave-session-attached option if you give --xterm. I can enter 
that patch.

Note that this does limit the size of the launch to the number of ssh sessions 
the system allows you to have open at the same time. We default to a limit of 
128 nodes, which is likely adequate for an xterm-based debugging session. 
However, you can increase it using an mca param (see ompi_info) to as high as 
the system allows.

Thanks for helping debug this! I'll add you to the patch list so you can track 
it.

> 
> What does *not* work is
>  jody@aim-triops ~/share/neander $ mpirun -np 2 -host squid_0 -mca
> plm_rsh_agent "ssh -Y"  --leave-session-attached  xterm
>  xterm Xt error: Can't open display:
>  xterm:  DISPLAY is not set
>  xterm Xt error: Can't open display:
>  xterm:  DISPLAY is not set
> 
> But then again, this call works (i.e. an xterm is opened) if all the
> debug-options are used (ompidbg_4.txt).
> Here the '--leave-session-attached' is necessary - without it, no xterm.
> 
>> From these results i would say that there is no basic mishandling of
> 'ssh', though i have no idea
> what internal differences the use of the '-leave-session-attached'
> option or the debug options make.
> 
> I hope these observations are helpful
>  Jody
> 
> 
> On Fri, Apr 29, 2011 at 12:08 AM, jody <jody....@gmail.com> wrote:
>> Hi Ralph
>> 
>> Thank you for your suggestions.
>> I'll be happy to help  you.
>> I'm not sure if i'll get around to this tomorrow,
>> but i certainly will do so on Monday.
>> 
>> Thanks
>>  Jody
>> 
>> On Thu, Apr 28, 2011 at 11:53 PM, Ralph Castain <r...@open-mpi.org> wrote:
>>> Hi Jody
>>> 
>>> I'm not sure when I'll get a chance to work on this - got a deadline to 
>>> meet. I do have a couple of suggestions, if you wouldn't mind helping debug 
>>> the problem?
>>> 
>>> It looks to me like the problem is that mpirun is crashing or terminating 
>>> early for some reason - hence the failures to send msgs to it, and the 
>>> "lifeline lost" error that leads to the termination of the daemon. If you 
>>> build a debug version of the code (i.e., --enable-debug on configure), you 
>>> can get a lot of debug info that traces the behavior.
>>> 
>>> If you could then run your program with
>>> 
>>>  -mca plm_base_verbose 5 -mca odls_base_verbose 5 --leave-session-attached
>>> 
>>> and send it to me, we'll see what ORTE thinks it is doing.
>>> 
>>> You could also take a look at the code for implementing the xterm option. 
>>> You'll find it in
>>> 
>>> orte/mca/odls/base/odls_base_default_fns.c
>>> 
>>> around line 1115. The xterm command syntax is defined in
>>> 
>>> orte/mca/odls/base/odls_base_open.c
>>> 
>>> around line 233 and following. Note that we use "xterm -T" as the cmd. 
>>> Perhaps you can spot an error in the way we treat xterm?
>>> 
>>> Also, remember that you have to specify that you want us to "hold" the 
>>> xterm window open even after the process terminates. If you don't specify 
>>> it, the window automatically closes upon completion of the process. So a 
>>> fast-running cmd like "hostname" might disappear so quickly that it causes 
>>> a race condition problem.
>>> 
>>> You might want to try a spinner application - i.e.., output something and 
>>> then sit in a loop or sleep for some period of time. Or, use the "hold" 
>>> option to keep the window open - you designate "hold" by putting a '!' 
>>> before the rank, e.g., "mpirun -np 2 -xterm \!2 hostname"
>>> 
>>> 
>>> On Apr 28, 2011, at 8:38 AM, jody wrote:
>>> 
>>>> Hi
>>>> 
>>>> Unfortunately this does not solve my problem.
>>>> While i can do
>>>>  ssh -Y squid_0 xterm
>>>> and this will open an xterm on m,y machiine (chefli),
>>>> i run into problems with the -xterm option of openmpi:
>>>> 
>>>>  jody@chefli ~/share/neander $ mpirun -np 4  -mca plm_rsh_agent "ssh
>>>> -Y" -host squid_0 --xterm 1 hostname
>>>>  squid_0
>>>>  [squid_0:28046] [[35219,0],1]->[[35219,0],0]
>>>> mca_oob_tcp_msg_send_handler: writev failed: Bad file descriptor (9)
>>>> [sd = 8]
>>>>  [squid_0:28046] [[35219,0],1] routed:binomial: Connection to
>>>> lifeline [[35219,0],0] lost
>>>>  [squid_0:28046] [[35219,0],1]->[[35219,0],0]
>>>> mca_oob_tcp_msg_send_handler: writev failed: Bad file descriptor (9)
>>>> [sd = 8]
>>>>  [squid_0:28046] [[35219,0],1] routed:binomial: Connection to
>>>> lifeline [[35219,0],0] lost
>>>>  /usr/bin/xterm Xt error: Can't open display: localhost:11.0
>>>> 
>>>> By the way when i look at the DISPLAY variable in the xterm window
>>>> opened via squid_0,
>>>> i also have the display variable "localhost:11.0"
>>>> 
>>>> Actually, the difference with using the "-mca plm_rsh_agent" is that
>>>> the lines wiht the warnings about "xauth" and "untrusted X" do not
>>>> appear:
>>>> 
>>>>  jody@chefli ~/share/neander $ mpirun -np 4   -host squid_0 -xterm 1 
>>>> hostname
>>>>  Warning: untrusted X11 forwarding setup failed: xauth key data not 
>>>> generated
>>>>  Warning: No xauth data; using fake authentication data for X11 forwarding.
>>>>  squid_0
>>>>  [squid_0:28337] [[34926,0],1]->[[34926,0],0]
>>>> mca_oob_tcp_msg_send_handler: writev failed: Bad file descriptor (9)
>>>> [sd = 8]
>>>>  [squid_0:28337] [[34926,0],1] routed:binomial: Connection to
>>>> lifeline [[34926,0],0] lost
>>>>  [squid_0:28337] [[34926,0],1]->[[34926,0],0]
>>>> mca_oob_tcp_msg_send_handler: writev failed: Bad file descriptor (9)
>>>> [sd = 8]
>>>>  [squid_0:28337] [[34926,0],1] routed:binomial: Connection to
>>>> lifeline [[34926,0],0] lost
>>>>  /usr/bin/xterm Xt error: Can't open display: localhost:11.0
>>>> 
>>>> 
>>>> I have doubts that the "-Y" is passed correctly:
>>>>   jody@triops ~/share/neander $ mpirun -np   -mca plm_rsh_agent "ssh
>>>> -Y" -host squid_0 xterm
>>>>  xterm Xt error: Can't open display:
>>>>  xterm:  DISPLAY is not set
>>>>  xterm Xt error: Can't open display:
>>>>  xterm:  DISPLAY is not set
>>>> 
>>>> 
>>>> ---> as a matter of fact i noticed that the xterm option doesn't work 
>>>> locally:
>>>>  mpirun -np 4    -xterm 1 /usr/bin/printenv
>>>> prints verything onto the console.
>>>> 
>>>> Do you have any other suggestions i could try?
>>>> 
>>>> Thank You
>>>> Jody
>>>> 
>>>> On Thu, Apr 28, 2011 at 3:06 PM, Ralph Castain <r...@open-mpi.org> wrote:
>>>>> Should be able to just set
>>>>> 
>>>>> -mca plm_rsh_agent "ssh -Y"
>>>>> 
>>>>> on your cmd line, I believe
>>>>> 
>>>>> On Apr 28, 2011, at 12:53 AM, jody wrote:
>>>>> 
>>>>>> Hi Ralph
>>>>>> 
>>>>>> Is there an easy way i could modify the OpenMPI code so that it would use
>>>>>> the -Y option for ssh when connecting to remote machines?
>>>>>> 
>>>>>> Thank You
>>>>>>   Jody
>>>>>> 
>>>>>> On Thu, Apr 7, 2011 at 4:01 PM, jody <jody....@gmail.com> wrote:
>>>>>>> Hi Ralph
>>>>>>> thank you for your suggestions. After some fiddling, i found that after 
>>>>>>> my
>>>>>>> last update (gentoo) my sshd_config had been overwritten
>>>>>>> (X11Forwarding was set to 'no').
>>>>>>> 
>>>>>>> After correcting that, i can now open remote terminals with 'ssh -Y'
>>>>>>> and with 'ssh -X'
>>>>>>> (but with '-X' is till get those xauth warnings)
>>>>>>> 
>>>>>>> But the xterm option still doesn't work:
>>>>>>>  jody@chefli ~/share/neander $ mpirun -np 4 -host squid_0 -xterm 1,2
>>>>>>> printenv | grep WORLD_RANK
>>>>>>>  Warning: untrusted X11 forwarding setup failed: xauth key data not 
>>>>>>> generated
>>>>>>>  Warning: No xauth data; using fake authentication data for X11 
>>>>>>> forwarding.
>>>>>>>  /usr/bin/xterm Xt error: Can't open display: localhost:11.0
>>>>>>>  /usr/bin/xterm Xt error: Can't open display: localhost:11.0
>>>>>>>  OMPI_COMM_WORLD_RANK=0
>>>>>>>  [aim-squid_0:09856] [[54132,0],1]->[[54132,0],0]
>>>>>>> mca_oob_tcp_msg_send_handler: writev failed: Bad file descriptor (9)
>>>>>>> [sd = 8]
>>>>>>>  [aim-squid_0:09856] [[54132,0],1] routed:binomial: Connection to
>>>>>>> lifeline [[54132,0],0] lost
>>>>>>> 
>>>>>>> So it looks like the two processes from squid_0 can't open the display 
>>>>>>> this way,
>>>>>>> but one of them writes the output to the console...
>>>>>>> Surprisingly, they are trying 'localhost:11.0' whereas when i use 'ssh 
>>>>>>> -Y' the
>>>>>>> DISPLAY variable is set to 'localhost:10.0'
>>>>>>> 
>>>>>>> So in what way would OMPI have to be adapted, so -xterm would work?
>>>>>>> 
>>>>>>> Thank You
>>>>>>>  Jody
>>>>>>> 
>>>>>>> On Wed, Apr 6, 2011 at 8:32 PM, Ralph Castain <r...@open-mpi.org> wrote:
>>>>>>>> Here's a little more info - it's for Cygwin, but I don't see anything
>>>>>>>> Cygwin-specific in the answers:
>>>>>>>> http://x.cygwin.com/docs/faq/cygwin-x-faq.html#q-ssh-no-x11forwarding
>>>>>>>> 
>>>>>>>> On Apr 6, 2011, at 12:30 PM, Ralph Castain wrote:
>>>>>>>> 
>>>>>>>> Sorry Jody - I should have read your note more carefully to see that 
>>>>>>>> you
>>>>>>>> already tried -Y. :-(
>>>>>>>> Not sure what to suggest...
>>>>>>>> 
>>>>>>>> On Apr 6, 2011, at 12:29 PM, Ralph Castain wrote:
>>>>>>>> 
>>>>>>>> Like I said, I'm not expert. However, a quick "google" of revealed this
>>>>>>>> result:
>>>>>>>> 
>>>>>>>> When trying to set up x11 forwarding over an ssh session to a remote 
>>>>>>>> server
>>>>>>>> with the -X switch, I was getting an error like Warning: No xauth
>>>>>>>> data; using fake authentication data for X11 forwarding.
>>>>>>>> 
>>>>>>>> When doing something like:
>>>>>>>> ssh -Xl root 10.1.1.9 to a remote server, the authentication worked, 
>>>>>>>> but I
>>>>>>>> got an error message like:
>>>>>>>> 
>>>>>>>> 
>>>>>>>> jason@badman ~/bin $ ssh -Xl root 10.1.1.9
>>>>>>>> Warning: untrusted X11 forwarding setup failed: xauth key data not 
>>>>>>>> generated
>>>>>>>> Warning: No xauth data; using fake authentication data for X11 
>>>>>>>> forwarding.
>>>>>>>> Last login: Wed Apr 14 18:18:39 2010 from 10.1.1.5
>>>>>>>> [root@RHEL ~]#
>>>>>>>> and any X programs I ran would not display on my local system..
>>>>>>>> 
>>>>>>>> Turns out the solution is to use the -Y switch instead.
>>>>>>>> 
>>>>>>>> ssh -Yl root 10.1.1.9
>>>>>>>> 
>>>>>>>> and that worked fine.
>>>>>>>> 
>>>>>>>> See if that works for you - if it does, we may have to modify OMPI to
>>>>>>>> accommodate.
>>>>>>>> 
>>>>>>>> On Apr 6, 2011, at 9:19 AM, jody wrote:
>>>>>>>> 
>>>>>>>> Hi Ralph
>>>>>>>> No, after the above error message mpirun has exited.
>>>>>>>> 
>>>>>>>> But i also noticed that it is to ssh into squid_0 and open a xterm 
>>>>>>>> there:
>>>>>>>> 
>>>>>>>>  jody@chefli ~/share/neander $ ssh -Y squid_0
>>>>>>>>  Last login: Wed Apr  6 17:14:02 CEST 2011 from chefli.uzh.ch on pts/0
>>>>>>>>  jody@squid_0 ~ $ xterm
>>>>>>>>  xterm Xt error: Can't open display:
>>>>>>>>  xterm:  DISPLAY is not set
>>>>>>>>  jody@squid_0 ~ $ export DISPLAY=130.60.126.74:0.0
>>>>>>>>  jody@squid_0 ~ $ xterm
>>>>>>>>  xterm Xt error: Can't open display: 130.60.126.74:0.0
>>>>>>>>  jody@squid_0 ~ $ export DISPLAY=chefli.uzh.ch:0.0
>>>>>>>>  jody@squid_0 ~ $ xterm
>>>>>>>>  xterm Xt error: Can't open display: chefli.uzh.ch:0.0
>>>>>>>>  jody@squid_0 ~ $ exit
>>>>>>>>  logout
>>>>>>>> 
>>>>>>>> same thing with ssh -X, but here i get the same warning/error message
>>>>>>>> as with mpirun:
>>>>>>>> 
>>>>>>>>  jody@chefli ~/share/neander $ ssh -X squid_0
>>>>>>>>  Warning: untrusted X11 forwarding setup failed: xauth key data not
>>>>>>>> generated
>>>>>>>>  Warning: No xauth data; using fake authentication data for X11 
>>>>>>>> forwarding.
>>>>>>>>  Last login: Wed Apr  6 17:12:31 CEST 2011 from chefli.uzh.ch on ssh
>>>>>>>> 
>>>>>>>> So perhaps the whole problem is linked to that xauth-thing.
>>>>>>>> Do you have a suggestion how this can be solved?
>>>>>>>> 
>>>>>>>> Thank You
>>>>>>>>  Jody
>>>>>>>> 
>>>>>>>> On Wed, Apr 6, 2011 at 4:41 PM, Ralph Castain <r...@open-mpi.org> 
>>>>>>>> wrote:
>>>>>>>> 
>>>>>>>> If I read your error messages correctly, it looks like mpirun is 
>>>>>>>> crashing -
>>>>>>>> the daemon is complaining that it lost the socket connection back to 
>>>>>>>> mpirun,
>>>>>>>> and hence will abort.
>>>>>>>> 
>>>>>>>> Are you seeing mpirun still alive?
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Apr 5, 2011, at 4:46 AM, jody wrote:
>>>>>>>> 
>>>>>>>> Hi
>>>>>>>> 
>>>>>>>> On my workstation and  the cluster i set up OpenMPI (v 1.4.2) so that
>>>>>>>> 
>>>>>>>> it works in "text-mode":
>>>>>>>> 
>>>>>>>>  $ mpirun -np 4  -x DISPLAY -host squid_0   printenv | grep WORLD_RANK
>>>>>>>> 
>>>>>>>>  OMPI_COMM_WORLD_RANK=0
>>>>>>>> 
>>>>>>>>  OMPI_COMM_WORLD_RANK=1
>>>>>>>> 
>>>>>>>>  OMPI_COMM_WORLD_RANK=2
>>>>>>>> 
>>>>>>>>  OMPI_COMM_WORLD_RANK=3
>>>>>>>> 
>>>>>>>> but when i use  the -xterm option to mpirun, it doesn't work
>>>>>>>> 
>>>>>>>> $ mpirun -np 4  -x DISPLAY -host squid_0 -xterm 1,2  printenv | grep
>>>>>>>> WORLD_RANK
>>>>>>>> 
>>>>>>>>  Warning: untrusted X11 forwarding setup failed: xauth key data not
>>>>>>>> generated
>>>>>>>> 
>>>>>>>>  Warning: No xauth data; using fake authentication data for X11 
>>>>>>>> forwarding.
>>>>>>>> 
>>>>>>>>  OMPI_COMM_WORLD_RANK=0
>>>>>>>> 
>>>>>>>>  [squid_0:05266] [[55607,0],1]->[[55607,0],0]
>>>>>>>> 
>>>>>>>> mca_oob_tcp_msg_send_handler: writev failed: Bad file descriptor (9)
>>>>>>>> 
>>>>>>>> [sd = 8]
>>>>>>>> 
>>>>>>>>  [squid_0:05266] [[55607,0],1] routed:binomial: Connection to
>>>>>>>> 
>>>>>>>> lifeline [[55607,0],0] lost
>>>>>>>> 
>>>>>>>>  /usr/bin/xterm Xt error: Can't open display: chefli.uzh.ch:0.0
>>>>>>>> 
>>>>>>>>  /usr/bin/xterm Xt error: Can't open display: chefli.uzh.ch:0.0
>>>>>>>> 
>>>>>>>> (strange: somebody wrote his message to the console)
>>>>>>>> 
>>>>>>>> No matter whether i set the DISPLAY variable to the full hostname of
>>>>>>>> 
>>>>>>>> the workstation,
>>>>>>>> 
>>>>>>>> to the IP-Adress of the workstation or simply to ":0.0", it doesn't 
>>>>>>>> work
>>>>>>>> 
>>>>>>>> But i do have xauth data (as far as i know):
>>>>>>>> 
>>>>>>>> On the remote (squid_0):
>>>>>>>> 
>>>>>>>>  jody@squid_0 ~ $ xauth list
>>>>>>>> 
>>>>>>>>  chefli/unix:10  MIT-MAGIC-COOKIE-1  5293e179bc7b2036d87cbcdf14891d0c
>>>>>>>> 
>>>>>>>>  chefli/unix:0  MIT-MAGIC-COOKIE-1  146c7f438fab79deb8a8a7df242b6f4b
>>>>>>>> 
>>>>>>>>  chefli.uzh.ch:0  MIT-MAGIC-COOKIE-1  146c7f438fab79deb8a8a7df242b6f4b
>>>>>>>> 
>>>>>>>> on the workstation:
>>>>>>>> 
>>>>>>>>  $ xauth list
>>>>>>>> 
>>>>>>>>  chefli/unix:10  MIT-MAGIC-COOKIE-1  5293e179bc7b2036d87cbcdf14891d0c
>>>>>>>> 
>>>>>>>>  chefli/unix:0  MIT-MAGIC-COOKIE-1  146c7f438fab79deb8a8a7df242b6f4b
>>>>>>>> 
>>>>>>>>  localhost.localdomain/unix:0  MIT-MAGIC-COOKIE-1
>>>>>>>> 
>>>>>>>> 146c7f438fab79deb8a8a7df242b6f4b
>>>>>>>> 
>>>>>>>>  chefli.uzh.ch/unix:0  MIT-MAGIC-COOKIE-1  
>>>>>>>> 146c7f438fab79deb8a8a7df242b6f4b
>>>>>>>> 
>>>>>>>> In sshd_config on the workstation i have 'X11Forwarding yes'
>>>>>>>> 
>>>>>>>> I have also done
>>>>>>>> 
>>>>>>>>   xhost + squid_0
>>>>>>>> 
>>>>>>>> on the workstation.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> How can i get the -xterm option running?
>>>>>>>> 
>>>>>>>> Thank You
>>>>>>>> 
>>>>>>>>  Jody
>>>>>>>> 
>>>>>>>> _______________________________________________
>>>>>>>> 
>>>>>>>> users mailing list
>>>>>>>> 
>>>>>>>> us...@open-mpi.org
>>>>>>>> 
>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>> 
>>>>>>>> 
>>>>>>>> _______________________________________________
>>>>>>>> 
>>>>>>>> users mailing list
>>>>>>>> 
>>>>>>>> us...@open-mpi.org
>>>>>>>> 
>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>> 
>>>>>>>> 
>>>>>>>> _______________________________________________
>>>>>>>> users mailing list
>>>>>>>> us...@open-mpi.org
>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> _______________________________________________
>>>>>>>> users mailing list
>>>>>>>> us...@open-mpi.org
>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> us...@open-mpi.org
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>> 
>>>>> 
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> us...@open-mpi.org
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>> 
>>>> 
>>>> _______________________________________________
>>>> users mailing list
>>>> us...@open-mpi.org
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> 
>>> 
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> 
>> 
> <ompidbg_1.txt><ompidbg_2.txt><ompidbg_3.txt><ompidbg_4.txt>_______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


Reply via email to