Hi Gus,

Thank you for all your suggestions.

I fixed the limits as you suggested and ran the test and
I am still getting the same failure.  More on that in a
bit.  But here is a bit of my response to what you mentioned.

> the IP number you checked now is not the same as in your
> message with the MPI failure/errors.
> Not sure if I understand which computers we're talking about,
> or where these computers are (at Amazon?),
> or if they change depending on each session you use to run your programs,
> if they are identical machines with the same limits or if they differ.

Everything I mentioned in last 2-3 days is on Amazon EC2 cloud.  I
have no problem running the same thing locally (vixen is my local
machine):

  [tsakai@vixen Rmpi]$ cat app.ac1
  -H vixen   -np 1 Rscript /Users/tsakai/Notes/R/parallel/Rmpi/fib.R 5
  -H vixen   -np 1 Rscript /Users/tsakai/Notes/R/parallel/Rmpi/fib.R 6
  -H blitzen -np 1 Rscript /Users/tsakai/Notes/R/parallel/Rmpi/fib.R 7
  -H blitzen -np 1 Rscript /Users/tsakai/Notes/R/parallel/Rmpi/fib.R 8
  [tsakai@vixen Rmpi]$
  [tsakai@vixen Rmpi]$ mpirun --app app.ac1
  5 vixen.egcrc.org
  8 vixen.egcrc.org
  13 blitzen.egcrc.org
  21 blitzen.egcrc.org
  [tsakai@vixen Rmpi]$ # these lines are correct result.
  [tsakai@vixen Rmpi]$

With Amazon EC2, where the strange behavior happens, is a virtualized
environment.  They charge by hours.  I launch an instance of a machine
when I need it and I shut them down when I am done.  Each time I get
different IP addresses (2 per instance, one on internal network and
the other for public interface).  That is why I don't show consistent
ip address or dns.  Every time I shutdown the machine, what I did on
that instance disappears and on next instance I have to recreate it
from scratch --case in point is ~/home/.ssh/config--, which is what
I have been doing (unless I take 'snapshot' of the image and save it
to a persistent storage (and doing snapshot is a bit of work)).

> One of the error messages mentions LD_LIBRARY_PATH.
> Is it set to point to the OpenMPI lib directory?
> Remember, OpenMPI requires both PATH and LD_LIBRARY_PATH properly
> set.

Yes, I have been setting LD_LIBRARY_PATH manually every time, because
I have neglected to put it into my bash startup file as part of AMI
(Amazon Machine Image) building.

Now what I have done is get onto an instance as tsakai, save output
from 'ulimit -a', set /etc/security/limits.conf parameters as you
suggest, get off and re-log onto the instance (thereby activating
those ulimit parameters), and ran the same (actually simpler) test,
as tsakai and as root.

  [tsakai@vixen Rmpi]$
  [tsakai@vixen Rmpi]$ # 2ec2 below is a script/wrapper around ssh to
  [tsakai@vixen Rmpi]$ # make ssh invocation line shorter.
  [tsakai@vixen Rmpi]$
  [tsakai@vixen ec2]$ 2ec2 ec2-50-16-55-64.compute-1.amazonaws.com
  The authenticity of host 'ec2-50-16-55-64.compute-1.amazonaws.com
(50.16.55.64)' can't be established.
  RSA key fingerprint is e3:ad:75:b1:a4:63:7f:0f:c4:0b:10:71:f3:2f:21:81.
  Are you sure you want to continue connecting (yes/no)? yes
  Last login: Tue Feb  8 22:52:54 2011 from 10.201.197.188
  [tsakai@ip-10-114-138-129 ~]$
  [tsakai@ip-10-114-138-129 ~]$ ulimit -a > mylimit.1
  [tsakai@ip-10-114-138-129 ~]$
  [tsakai@ip-10-114-138-129 ~]$ sudo su
  bash-3.2#
  bash-3.2# cat - >> /etc/security/limits.conf
  *   -   memlock     -1
  *   -   stack       -1
  *   -   nofile      4096
  bash-3.2#
  bash-3.2# tail /etc/security/limits.conf
  #@student        hard    nproc           20
  #@faculty        soft    nproc           20
  #@faculty        hard    nproc           50
  #ftp             hard    nproc           0
  #@student        -       maxlogins       4

  # End of file
  *   -   memlock     -1
  *   -   stack       -1
  *   -   nofile      4096
  bash-3.2#
  bash-3.2# exit
  exit
  [tsakai@ip-10-114-138-129 ~]$
  [tsakai@ip-10-114-138-129 ~]$ # logout and log back in to activate the
  [tsakai@ip-10-114-138-129 ~]$ # new setting.
  [tsakai@ip-10-114-138-129 ~]$ exit
  logout
  [tsakai@vixen ec2]$
  [tsakai@vixen ec2]$ # I am back on vixen and about to relogging back onto
  [tsakai@vixen ec2]$ # the instance which is still running.
  [tsakai@vixen ec2]$
  [tsakai@vixen ec2]$ 2ec2 ec2-50-16-55-64.compute-1.amazonaws.com
  Last login: Fri Feb 11 23:50:47 2011 from 63.193.205.1
  [tsakai@ip-10-114-138-129 ~]$
  [tsakai@ip-10-114-138-129 ~]$ ulimit -a > mylimit.2
  [tsakai@ip-10-114-138-129 ~]$
  [tsakai@ip-10-114-138-129 ~]$ diff mylimit.1 mylimit.2
  6c6
  < max locked memory       (kbytes, -l) 32
  ---
  > max locked memory       (kbytes, -l) unlimited
  8c8
  < open files                      (-n) 1024
  ---
  > open files                      (-n) 4096
  12c12
  < stack size              (kbytes, -s) 8192
  ---
  > stack size              (kbytes, -s) unlimited
  [tsakai@ip-10-114-138-129 ~]$
  [tsakai@ip-10-114-138-129 ~]$ # yes, I have the same ulimit parameters as
  [tsakai@ip-10-114-138-129 ~]$ # Gus suggested.
  [tsakai@ip-10-114-138-129 ~]$
  [tsakai@ip-10-114-138-129 ~]$ export LD_LIBRARY_PATH=/usr/local/lib
  [tsakai@ip-10-114-138-129 ~]$
  [tsakai@ip-10-114-138-129 ~]$ env | grep LD_LIB
  LD_LIBRARY_PATH=/usr/local/lib
  [tsakai@ip-10-114-138-129 ~]$
  [tsakai@ip-10-114-138-129 ~]$ cat - > app.ac
  -H ip-10-114-138-129.ec2.internal -np 1 Rscript /home/tsakai/fib.R 5
  -H ip-10-114-138-129.ec2.internal -np 1 Rscript /home/tsakai/fib.R 6
  [tsakai@ip-10-114-138-129 ~]$
  [tsakai@ip-10-114-138-129 ~]$ cat app.ac
  -H ip-10-114-138-129.ec2.internal -np 1 Rscript /home/tsakai/fib.R 5
  -H ip-10-114-138-129.ec2.internal -np 1 Rscript /home/tsakai/fib.R 6
  [tsakai@ip-10-114-138-129 ~]$
  [tsakai@ip-10-114-138-129 ~]$ hostname
  ip-10-114-138-129
  [tsakai@ip-10-114-138-129 ~]$
  [tsakai@ip-10-114-138-129 ~]$ # this run doesn't involve other node.
  [tsakai@ip-10-114-138-129 ~]$ # just use this machine's cores.
  [tsakai@ip-10-114-138-129 ~]$ # there are 2 cores on this machine.
  [tsakai@ip-10-114-138-129 ~]$
  [tsakai@ip-10-114-138-129 ~]$ mpirun --app app.ac
  --------------------------------------------------------------------------
  mpirun was unable to launch the specified application as it encountered an
error:

  Error: pipe function call failed when setting up I/O forwarding subsystem
  Node: ip-10-114-138-129

  while attempting to start process rank 0.
  --------------------------------------------------------------------------
  [tsakai@ip-10-114-138-129 ~]$
  [tsakai@ip-10-114-138-129 ~]$ # I still get the same error!
  [tsakai@ip-10-114-138-129 ~]$
  [tsakai@ip-10-114-138-129 ~]$ cat /proc/sys/fs/file-nr
  512     0       762674
  [tsakai@ip-10-114-138-129 ~]$
  [tsakai@ip-10-114-138-129 ~]$ # number of open files (512) is no where
  [tsakai@ip-10-114-138-129 ~]$ # close to the limit, which is 4096 now.
  [tsakai@ip-10-114-138-129 ~]$ # now let's run it as root.
  [tsakai@ip-10-114-138-129 ~]$
  [tsakai@ip-10-114-138-129 ~]$ sudo su
  bash-3.2#
  bash-3.2# env | grep LD_LIBR
  LD_LIBRARY_PATH=/usr/local/lib
  bash-3.2#
  bash-3.2# pwd
  /home/tsakai
  bash-3.2#
  bash-3.2# mpirun --app ./app.ac
  5 ip-10-114-138-129
  8 ip-10-114-138-129
  bash-3.2#
  bash-3.2# # that's correct result!
  bash-3.2#
  bash-3.2# cat /proc/sys/fs/file-nr
  512     0       762674
  bash-3.2#
  bash-3.2# # this shows that mpirun didn't leave any
  bash-3.2# # oepn file behind, I think.  That's good.
  bash-3.2#
  bash-3.2# exit
  exit
  [tsakai@ip-10-114-138-129 ~]$
  [tsakai@ip-10-114-138-129 ~]$ exit
  logout
  [tsakai@vixen ec2]$

Had it been the case, it failed both as root and as user
tsakai, I can conclude that either the virtualized environment
is disagreeable with openmpi OR there is something wrong with
what I am trying to do.  But what kills me is that it *does*
work when run by root.  Why pipe system call fails on user
tsakai and not on root is something I don't understand.

BTW, here is the same test (using a single machine) in local
environment (i.e., no virtualized environment):

  [tsakai@vixen Rmpi]$ cat app.ac2
  -H vixen   -np 1 Rscript /Users/tsakai/Notes/R/parallel/Rmpi/fib.R 5
  -H vixen   -np 1 Rscript /Users/tsakai/Notes/R/parallel/Rmpi/fib.R 6
  [tsakai@vixen Rmpi]$
  [tsakai@vixen Rmpi]$ mpirun --app app.ac2
  5 vixen.egcrc.org
  8 vixen.egcrc.org
  [tsakai@vixen Rmpi]$

I am running out of stones to turn over for now and maybe it's
a good time to go to bed.  :)

I would appreciate it if you come up with a different things
to try.

Many thanks for your help.

Regards,

Tena


On 2/11/11 7:45 PM, "Gus Correa" <g...@ldeo.columbia.edu> wrote:

> Hi Tena
>
> We setup the cluster nodes to run MPI programs
> with stacksize unlimited,
> memlock unlimited,
> 4096 max open files,
> to avoid crashing on edge cases.
> This is kind of typical for HPC, MPI, number crunching.
>
> However, some are quite big codes,
> and from what you said yours is not (or not yet).
>
> Your stack limit sounds quite small, but when
> we had problems with stack the result was a segmentation fault.
> 1024 files I guess is a default for 32 bit Linux distributions,
> but some programs break there.
>
> If you want to do this, put these lines on the bottom
> of /etc/security/limits.conf:
>
> # End of file
> *   -   memlock     -1
> *   -   stack       -1
> *   -   nofile      4096
>
> I don't think you should give unlimited number of processes to
> regular users; keep this privilege to root (which is where
> the two have different limits).
>
> You may want to monitor /proc/sys/fs/file-nr while the program runs.
> The first number is the actual number of open files.
> Top or vmstat also help see how you are doing in terms of memory,
> although you suggested these are (small?) test programs, unlikely to run
> out of memory.
>
> If you are using two nodes, check the same stuff on the other node too.
> Also, the IP number you checked now is not the same as in your
> message with the MPI failure/errors.
> Not sure if I understand which computers we're talking about,
> or where these computers are (at Amazon?),
> or if they change depending on each session you use to run your programs,
> if they are identical machines with the same limits or if they differ.
>
> One of the error messages mentions LD_LIBRARY_PATH.
> Is it set to point to the OpenMPI lib directory?
> Remember, OpenMPI requires both PATH and LD_LIBRARY_PATH properly set.
>
> I hope this helps, although I am afraid I may be missing the point.
>
> Gus Correa
>
> Tena Sakai wrote:
>> Hi Gus,
>>
>> Thank you for your tips.
>>
>> I didn't find any smoking gun or anything comes close.
>> Here's the upshot:
>>
>>   [tsakai@ip-10-114-239-188 ~]$ ulimit -a
>>   core file size          (blocks, -c) 0
>>   data seg size           (kbytes, -d) unlimited
>>   scheduling priority             (-e) 0
>>   file size               (blocks, -f) unlimited
>>   pending signals                 (-i) 61504
>>   max locked memory       (kbytes, -l) 32
>>   max memory size         (kbytes, -m) unlimited
>>   open files                      (-n) 1024
>>   pipe size            (512 bytes, -p) 8
>>   POSIX message queues     (bytes, -q) 819200
>>   real-time priority              (-r) 0
>>   stack size              (kbytes, -s) 8192
>>   cpu time               (seconds, -t) unlimited
>>   max user processes              (-u) 61504
>>   virtual memory          (kbytes, -v) unlimited
>>   file locks                      (-x) unlimited
>>   [tsakai@ip-10-114-239-188 ~]$
>>   [tsakai@ip-10-114-239-188 ~]$ sudo su
>>   bash-3.2#
>>   bash-3.2# ulimit -a
>>   core file size          (blocks, -c) 0
>>   data seg size           (kbytes, -d) unlimited
>>   scheduling priority             (-e) 0
>>   file size               (blocks, -f) unlimited
>>   pending signals                 (-i) 61504
>>   max locked memory       (kbytes, -l) 32
>>   max memory size         (kbytes, -m) unlimited
>>   open files                      (-n) 1024
>>   pipe size            (512 bytes, -p) 8
>>   POSIX message queues     (bytes, -q) 819200
>>   real-time priority              (-r) 0
>>   stack size              (kbytes, -s) 8192
>>   cpu time               (seconds, -t) unlimited
>>   max user processes              (-u) unlimited
>>   virtual memory          (kbytes, -v) unlimited
>>   file locks                      (-x) unlimited
>>   bash-3.2#
>>   bash-3.2#
>>   bash-3.2# ulimit -a > root_ulimit-a
>>   bash-3.2# exit
>>   [tsakai@ip-10-114-239-188 ~]$
>>   [tsakai@ip-10-114-239-188 ~]$ ulimit -a > tsakai_ulimit-a
>>   [tsakai@ip-10-114-239-188 ~]$
>>   [tsakai@ip-10-114-239-188 ~]$ diff root_ulimit-a tsakai_ulimit-a
>>   14c14
>>   < max user processes              (-u) unlimited
>>   ---
>>> max user processes              (-u) 61504
>>   [tsakai@ip-10-114-239-188 ~]$
>>   [tsakai@ip-10-114-239-188 ~]$ cat /proc/sys/fs/file-nr
>> /proc/sys/fs/file-max
>>   480     0       762674
>>   762674
>>   [tsakai@ip-10-114-239-188 ~]$
>>   [tsakai@ip-10-114-239-188 ~]$ sudo su
>>   bash-3.2#
>>   bash-3.2# cat /proc/sys/fs/file-nr /proc/sys/fs/file-max
>>   512     0       762674
>>   762674
>>   bash-3.2# exit
>>   exit
>>   [tsakai@ip-10-114-239-188 ~]$
>>   [tsakai@ip-10-114-239-188 ~]$
>>   [tsakai@ip-10-114-239-188 ~]$ sysctl -a |grep fs.file-max
>>   -bash: sysctl: command not found
>>   [tsakai@ip-10-114-239-188 ~]$
>>   [tsakai@ip-10-114-239-188 ~]$ /sbin/!!
>>   /sbin/sysctl -a |grep fs.file-max
>>   error: permission denied on key 'kernel.cad_pid'
>>   error: permission denied on key 'kernel.cap-bound'
>>   fs.file-max = 762674
>>   [tsakai@ip-10-114-239-188 ~]$
>>   [tsakai@ip-10-114-239-188 ~]$ sudo /sbin/sysctl -a | grep fs.file-max
>>   fs.file-max = 762674
>>   [tsakai@ip-10-114-239-188 ~]$
>>
>> I see a bit of difference between root and tsakai, but I cannot
>> believe such small difference results in somewhat a catastrophic
>> failure as I have reported.  Would you agree with me?
>>
>> Regards,
>>
>> Tena
>>
>> On 2/11/11 6:06 PM, "Gus Correa" <g...@ldeo.columbia.edu> wrote:
>>
>>> Hi Tena
>>>
>>> Please read one answer inline.
>>>
>>> Tena Sakai wrote:
>>>> Hi Jeff,
>>>> Hi Gus,
>>>>
>>>> Thanks for your replies.
>>>>
>>>> I have pretty much ruled out PATH issues by setting tsakai's PATH
>>>> as identical to that of root.  In that setting I reproduced the
>>>> same result as before: root can run mpirun correctly and tsakai
>>>> cannot.
>>>>
>>>> I have also checked out permission on /tmp directory.  tsakai has
>>>> no problem creating files under /tmp.
>>>>
>>>> I am trying to come up with a strategy to show that each and every
>>>> programs in the PATH has "world" executable permission.  It is a
>>>> stone to turn over, but I am not holding my breath.
>>>>
>>>>> ... you are running out of file descriptors. Are file descriptors
>>>>> limited on a per-process basis, perchance?
>>>> I have never heard there is such restriction on Amazon EC2.  There
>>>> are folks who keep running instances for a long, long time.  Whereas
>>>> in my case, I launch 2 instances, check things out, and then turn
>>>> the instances off.  (Given that the state of California has a huge
>>>> debts, our funding is very tight.)  So, I really doubt that's the
>>>> case.  I have run mpirun unsuccessfully as user tsakai and immediately
>>>> after successfully as root.  Still, I would be happy if you can tell
>>>> me a way to tell number of file descriptors used or remmain.
>>>>
>>>> Your mentioned file descriptors made me think of something under
>>>> /dev.  But I don't know exactly what I am fishing.  Do you have
>>>> some suggestions?
>>>>
>>> 1) If the environment has anything to do with Linux,
>>> check:
>>>
>>> cat /proc/sys/fs/file-nr /proc/sys/fs/file-max
>>>
>>>
>>> or
>>>
>>> sysctl -a |grep fs.file-max
>>>
>>> This max can be set (fs.file-max=whatever_is_reasonable)
>>> in /etc/sysctl.conf
>>>
>>> See 'man sysctl' and 'man sysctl.conf'
>>>
>>> 2) Another possible source of limits.
>>>
>>> Check "ulimit -a" (bash) or "limit" (tcsh).
>>>
>>> If you need to change look at:
>>>
>>> /etc/security/limits.conf
>>>
>>> (See also 'man limits.conf')
>>>
>>> **
>>>
>>> Since "root can but Tena cannot",
>>> I would check 2) first,
>>> as they are the 'per user/per group' limits,
>>> whereas 1) is kernel/system-wise.
>>>
>>> I hope this helps,
>>> Gus Correa
>>>
>>> PS - I know you are a wise and careful programmer,
>>> but here we had cases of programs that would
>>> fail because of too many files that were open and never closed,
>>> eventually exceeding the max available/permissible.
>>> So, it does happen.
>>>
>>>> I wish I could reproduce this (weired) behavior on a different
>>>> set of machines.  I certainly cannot in my local environment.  Sigh!
>>>>
>>>> Regards,
>>>>
>>>> Tena
>>>>
>>>>
>>>> On 2/11/11 3:17 PM, "Jeff Squyres (jsquyres)" <jsquy...@cisco.com> wrote:
>>>>
>>>>> It is concerning if the pipe system call fails - I can't think of why that
>>>>> would happen. Thats not usually a permissions issue but rather a deeper
>>>>> indication that something is either seriously wrong on your system or you
>>>>> are
>>>>> running out of file descriptors. Are file descriptors limited on a
>>>>> per-process
>>>>> basis, perchance?
>>>>>
>>>>> Sent from my PDA. No type good.
>>>>>
>>>>> On Feb 11, 2011, at 10:08 AM, "Gus Correa" <g...@ldeo.columbia.edu> wrote:
>>>>>
>>>>>> Hi Tena
>>>>>>
>>>>>> Since root can but you can't,
>>>>>> is is a directory permission problem perhaps?
>>>>>> Check the execution directory permission (on both machines,
>>>>>> if this is not NFS mounted dir).
>>>>>> I am not sure, but IIRR OpenMPI also uses /tmp for
>>>>>> under-the-hood stuff, worth checking permissions there also.
>>>>>> Just a naive guess.
>>>>>>
>>>>>> Congrats for all the progress with the cloudy MPI!
>>>>>>
>>>>>> Gus Correa
>>>>>>
>>>>>> Tena Sakai wrote:
>>>>>>> Hi,
>>>>>>> I have made a bit more progress.  I think I can say ssh authenti-
>>>>>>> cation problem is behind me now.  I am still having a problem running
>>>>>>> mpirun, but the latest discovery, which I can reproduce, is that
>>>>>>> I can run mpirun as root.  Here's the session log:
>>>>>>>  [tsakai@vixen ec2]$ 2ec2 ec2-184-73-104-242.compute-1.amazonaws.com
>>>>>>>  Last login: Fri Feb 11 00:41:11 2011 from 10.100.243.195
>>>>>>>  [tsakai@ip-10-195-198-31 ~]$
>>>>>>>  [tsakai@ip-10-195-198-31 ~]$ ll
>>>>>>>  total 8
>>>>>>>  -rw-rw-r-- 1 tsakai tsakai 274 Feb 11 00:47 app.ac
>>>>>>>  -rwxr-xr-x 1 tsakai tsakai 379 Feb 11 00:48 fib.R
>>>>>>>  [tsakai@ip-10-195-198-31 ~]$
>>>>>>>  [tsakai@ip-10-195-198-31 ~]$ ll .ssh
>>>>>>>  total 16
>>>>>>>  -rw------- 1 tsakai tsakai  232 Feb  5 23:19 authorized_keys
>>>>>>>  -rw------- 1 tsakai tsakai  102 Feb 11 00:34 config
>>>>>>>  -rw-r--r-- 1 tsakai tsakai 1302 Feb 11 00:36 known_hosts
>>>>>>>  -rw------- 1 tsakai tsakai  887 Feb  8 22:03 tsakai
>>>>>>>  [tsakai@ip-10-195-198-31 ~]$
>>>>>>>  [tsakai@ip-10-195-198-31 ~]$ ssh ip-10-100-243-195.ec2.internal
>>>>>>>  Last login: Fri Feb 11 00:36:20 2011 from 10.195.198.31
>>>>>>>  [tsakai@ip-10-100-243-195 ~]$
>>>>>>>  [tsakai@ip-10-100-243-195 ~]$ # I am on machine B
>>>>>>>  [tsakai@ip-10-100-243-195 ~]$ hostname
>>>>>>>  ip-10-100-243-195
>>>>>>>  [tsakai@ip-10-100-243-195 ~]$
>>>>>>>  [tsakai@ip-10-100-243-195 ~]$ ll
>>>>>>>  total 8
>>>>>>>  -rw-rw-r-- 1 tsakai tsakai 274 Feb 11 00:44 app.ac
>>>>>>>  -rwxr-xr-x 1 tsakai tsakai 379 Feb 11 00:47 fib.R
>>>>>>>  [tsakai@ip-10-100-243-195 ~]$
>>>>>>>  [tsakai@ip-10-100-243-195 ~]$
>>>>>>>  [tsakai@ip-10-100-243-195 ~]$ cat app.ac
>>>>>>>  -H ip-10-195-198-31.ec2.internal -np 1 Rscript /home/tsakai/fib.R 5
>>>>>>>  -H ip-10-195-198-31.ec2.internal -np 1 Rscript /home/tsakai/fib.R 6
>>>>>>>  -H ip-10-100-243-195.ec2.internal -np 1 Rscript /home/tsakai/fib.R 7
>>>>>>>  -H ip-10-100-243-195.ec2.internal -np 1 Rscript /home/tsakai/fib.R 8
>>>>>>>  [tsakai@ip-10-100-243-195 ~]$
>>>>>>>  [tsakai@ip-10-100-243-195 ~]$ # go back to machine A
>>>>>>>  [tsakai@ip-10-100-243-195 ~]$
>>>>>>>  [tsakai@ip-10-100-243-195 ~]$ exit
>>>>>>>  logout
>>>>>>>  Connection to ip-10-100-243-195.ec2.internal closed.
>>>>>>>  [tsakai@ip-10-195-198-31 ~]$
>>>>>>>  [tsakai@ip-10-195-198-31 ~]$ hostname
>>>>>>>  ip-10-195-198-31
>>>>>>>  [tsakai@ip-10-195-198-31 ~]$
>>>>>>>  [tsakai@ip-10-195-198-31 ~]$ # Execute mpirun
>>>>>>>  [tsakai@ip-10-195-198-31 ~]$
>>>>>>>  [tsakai@ip-10-195-198-31 ~]$ mpirun -app app.ac
>>>>>>>
>>>>>>> ------------------------------------------------------------------------
>>>>>>> --
>>>>>>>  mpirun was unable to launch the specified application as it encountered
>>>>>>> an
>>>>>>> error:
>>>>>>>  Error: pipe function call failed when setting up I/O forwarding
>>>>>>> subsystem
>>>>>>>  Node: ip-10-195-198-31
>>>>>>>  while attempting to start process rank 0.
>>>>>>>
>>>>>>> ------------------------------------------------------------------------
>>>>>>> --
>>>>>>>  [tsakai@ip-10-195-198-31 ~]$
>>>>>>>  [tsakai@ip-10-195-198-31 ~]$ # try it as root
>>>>>>>  [tsakai@ip-10-195-198-31 ~]$
>>>>>>>  [tsakai@ip-10-195-198-31 ~]$ sudo su
>>>>>>>  bash-3.2#
>>>>>>>  bash-3.2# pwd
>>>>>>>  /home/tsakai
>>>>>>>  bash-3.2#
>>>>>>>  bash-3.2# ls -l /root/.ssh/config
>>>>>>>  -rw------- 1 root root 103 Feb 11 00:56 /root/.ssh/config
>>>>>>>  bash-3.2#
>>>>>>>  bash-3.2# cat /root/.ssh/config
>>>>>>>  Host *
>>>>>>>          IdentityFile /root/.ssh/.derobee/.kagi
>>>>>>>          IdentitiesOnly yes
>>>>>>>          BatchMode yes
>>>>>>>  bash-3.2#
>>>>>>>  bash-3.2# pwd
>>>>>>>  /home/tsakai
>>>>>>>  bash-3.2#
>>>>>>>  bash-3.2# ls -l
>>>>>>>  total 8
>>>>>>>  -rw-rw-r-- 1 tsakai tsakai 274 Feb 11 00:47 app.ac
>>>>>>>  -rwxr-xr-x 1 tsakai tsakai 379 Feb 11 00:48 fib.R
>>>>>>>  bash-3.2#
>>>>>>>  bash-3.2# # now is the time for mpirun
>>>>>>>  bash-3.2#
>>>>>>>  bash-3.2# mpirun --app ./app.ac
>>>>>>>  13 ip-10-100-243-195
>>>>>>>  21 ip-10-100-243-195
>>>>>>>  5 ip-10-195-198-31
>>>>>>>  8 ip-10-195-198-31
>>>>>>>  bash-3.2#
>>>>>>>  bash-3.2# # It works (being root)!
>>>>>>>  bash-3.2#
>>>>>>>  bash-3.2# exit
>>>>>>>  exit
>>>>>>>  [tsakai@ip-10-195-198-31 ~]$
>>>>>>>  [tsakai@ip-10-195-198-31 ~]$ # try it one more time as tsakai
>>>>>>>  [tsakai@ip-10-195-198-31 ~]$
>>>>>>>  [tsakai@ip-10-195-198-31 ~]$ mpirun --app app.ac
>>>>>>>
>>>>>>> ------------------------------------------------------------------------
>>>>>>> --
>>>>>>>  mpirun was unable to launch the specified application as it encountered
>>>>>>> an
>>>>>>> error:
>>>>>>>  Error: pipe function call failed when setting up I/O forwarding
>>>>>>> subsystem
>>>>>>>  Node: ip-10-195-198-31
>>>>>>>  while attempting to start process rank 0.
>>>>>>>
>>>>>>> ------------------------------------------------------------------------
>>>>>>> --
>>>>>>>  [tsakai@ip-10-195-198-31 ~]$
>>>>>>>  [tsakai@ip-10-195-198-31 ~]$ # I don't get it.
>>>>>>>  [tsakai@ip-10-195-198-31 ~]$
>>>>>>>  [tsakai@ip-10-195-198-31 ~]$ exit
>>>>>>>  logout
>>>>>>>  [tsakai@vixen ec2]$
>>>>>>> So, why does it say "pipe function call failed when setting up
>>>>>>> I/O forwarding subsystem Node: ip-10-195-198-31" ?
>>>>>>> The node it is referring to is not the remote machine.  It is
>>>>>>> What I call machine A.  I first thought maybe this is a problem
>>>>>>> With PATH variable.  But I don't think so.  I compared root's
>>>>>>> Path to that of tsaki's and made them identical and retried.
>>>>>>> I got the same behavior.
>>>>>>> If you could enlighten me why this is happening, I would really
>>>>>>> Appreciate it.
>>>>>>> Thank you.
>>>>>>> Tena
>>>>>>> On 2/10/11 4:12 PM, "Tena Sakai" <tsa...@gallo.ucsf.edu> wrote:
>>>>>>>> Hi jeff,
>>>>>>>>
>>>>>>>> Thanks for the firewall tip.  I tried it while allowing all tip traffic
>>>>>>>> and got interesting and preplexing result.  Here's what's interesting
>>>>>>>> (BTW, I got rid of "LogLevel DEBUG3" from ./ssh/config on this run):
>>>>>>>>
>>>>>>>>   [tsakai@ip-10-203-21-132 ~]$
>>>>>>>>   [tsakai@ip-10-203-21-132 ~]$ mpirun --app app.ac2
>>>>>>>>   Host key verification failed.
>>>>>>>>
>>>>>>>>
>> ------------------------------------------------------------------------->>>>
>> >>
>> -
>>>>>>>>   A daemon (pid 2743) died unexpectedly with status 255 while
>>>>>>>> attempting
>>>>>>>>   to launch so we are aborting.
>>>>>>>>
>>>>>>>>   There may be more information reported by the environment (see
>>>>>>>> above).
>>>>>>>>
>>>>>>>>   This may be because the daemon was unable to find all the needed
>>>>>>>> shared
>>>>>>>>   libraries on the remote node. You may set your LD_LIBRARY_PATH to
>>>>>>>> have
>>>>>>>> the
>>>>>>>>   location of the shared libraries on the remote nodes and this will
>>>>>>>>   automatically be forwarded to the remote nodes.
>>>>>>>>
>>>>>>>>
>> ------------------------------------------------------------------------->>>>
>> >>
>> -
>>>>>>>>
>> ------------------------------------------------------------------------->>>>
>> >>
>> -
>>>>>>>>   mpirun noticed that the job aborted, but has no info as to the
>>>>>>>> process
>>>>>>>>   that caused that situation.
>>>>>>>>
>>>>>>>>
>> ------------------------------------------------------------------------->>>>
>> >>
>> -
>>>>>>>>   mpirun: clean termination accomplished
>>>>>>>>
>>>>>>>>   [tsakai@ip-10-203-21-132 ~]$
>>>>>>>>   [tsakai@ip-10-203-21-132 ~]$ env | grep LD_LIB
>>>>>>>>   [tsakai@ip-10-203-21-132 ~]$
>>>>>>>>   [tsakai@ip-10-203-21-132 ~]$ # Let's set LD_LIBRARY_PATH to
>>>>>>>> /usr/local/lib
>>>>>>>>   [tsakai@ip-10-203-21-132 ~]$
>>>>>>>>   [tsakai@ip-10-203-21-132 ~]$
>>>>>>>>   [tsakai@ip-10-203-21-132 ~]$ export LD_LIBRARY_PATH='/usr/local/lib'
>>>>>>>>   [tsakai@ip-10-203-21-132 ~]$
>>>>>>>>   [tsakai@ip-10-203-21-132 ~]$ # I better to this on machine B as well
>>>>>>>>   [tsakai@ip-10-203-21-132 ~]$
>>>>>>>>   [tsakai@ip-10-203-21-132 ~]$ ssh -i tsakai ip-10-195-171-159
>>>>>>>>   Warning: Identity file tsakai not accessible: No such file or
>>>>>>>> directory.
>>>>>>>>   Last login: Thu Feb 10 18:31:20 2011 from 10.203.21.132
>>>>>>>>   [tsakai@ip-10-195-171-159 ~]$
>>>>>>>>   [tsakai@ip-10-195-171-159 ~]$ export LD_LIBRARY_PATH='/usr/local/lib'
>>>>>>>>   [tsakai@ip-10-195-171-159 ~]$
>>>>>>>>   [tsakai@ip-10-195-171-159 ~]$ env | grep LD_LIB
>>>>>>>>   LD_LIBRARY_PATH=/usr/local/lib
>>>>>>>>   [tsakai@ip-10-195-171-159 ~]$
>>>>>>>>   [tsakai@ip-10-195-171-159 ~]$ # OK, now go bak to machine A
>>>>>>>>   [tsakai@ip-10-195-171-159 ~]$ exit
>>>>>>>>   logout
>>>>>>>>   Connection to ip-10-195-171-159 closed.
>>>>>>>>   [tsakai@ip-10-203-21-132 ~]$
>>>>>>>>   [tsakai@ip-10-203-21-132 ~]$ hostname
>>>>>>>>   ip-10-203-21-132
>>>>>>>>   [tsakai@ip-10-203-21-132 ~]$ # try mpirun again
>>>>>>>>   [tsakai@ip-10-203-21-132 ~]$
>>>>>>>>   [tsakai@ip-10-203-21-132 ~]$ mpirun --app app.ac2
>>>>>>>>   Host key verification failed.
>>>>>>>>
>>>>>>>>
>> ------------------------------------------------------------------------->>>>
>> >>
>> -
>>>>>>>>   A daemon (pid 2789) died unexpectedly with status 255 while
>>>>>>>> attempting
>>>>>>>>   to launch so we are aborting.
>>>>>>>>
>>>>>>>>   There may be more information reported by the environment (see
>>>>>>>> above).
>>>>>>>>
>>>>>>>>   This may be because the daemon was unable to find all the needed
>>>>>>>> shared
>>>>>>>>   libraries on the remote node. You may set your LD_LIBRARY_PATH to
>>>>>>>> have
>>>>>>>> the
>>>>>>>>   location of the shared libraries on the remote nodes and this will
>>>>>>>>   automatically be forwarded to the remote nodes.
>>>>>>>>
>>>>>>>>
>> ------------------------------------------------------------------------->>>>
>> >>
>> -
>>>>>>>>
>> ------------------------------------------------------------------------->>>>
>> >>
>> -
>>>>>>>>   mpirun noticed that the job aborted, but has no info as to the
>>>>>>>> process
>>>>>>>>   that caused that situation.
>>>>>>>>
>>>>>>>>
>> ------------------------------------------------------------------------->>>>
>> >>
>> -
>>>>>>>>   mpirun: clean termination accomplished
>>>>>>>>
>>>>>>>>   [tsakai@ip-10-203-21-132 ~]$
>>>>>>>>   [tsakai@ip-10-203-21-132 ~]$ # I thought openmpi library was in
>>>>>>>> /usr/local/lib...
>>>>>>>>   [tsakai@ip-10-203-21-132 ~]$
>>>>>>>>   [tsakai@ip-10-203-21-132 ~]$ ll -t /usr/local/lib | less
>>>>>>>>   total 16604
>>>>>>>>   lrwxrwxrwx 1 root root      16 Feb  8 23:06 libfuse.so ->
>>>>>>>> libfuse.so.2.8.5
>>>>>>>>   lrwxrwxrwx 1 root root      16 Feb  8 23:06 libfuse.so.2 ->
>>>>>>>> libfuse.so.2.8.5
>>>>>>>>   lrwxrwxrwx 1 root root      25 Feb  8 23:06 libmca_common_sm.so ->
>>>>>>>> libmca_common_sm.so.1.0.0
>>>>>>>>   lrwxrwxrwx 1 root root      25 Feb  8 23:06 libmca_common_sm.so.1 ->
>>>>>>>> libmca_common_sm.so.1.0.0
>>>>>>>>   lrwxrwxrwx 1 root root      15 Feb  8 23:06 libmpi.so ->
>>>>>>>> libmpi.so.0.0.2
>>>>>>>>   lrwxrwxrwx 1 root root      15 Feb  8 23:06 libmpi.so.0 ->
>>>>>>>> libmpi.so.0.0.2
>>>>>>>>   lrwxrwxrwx 1 root root      19 Feb  8 23:06 libmpi_cxx.so ->
>>>>>>>> libmpi_cxx.so.0.0.1
>>>>>>>>   lrwxrwxrwx 1 root root      19 Feb  8 23:06 libmpi_cxx.so.0 ->
>>>>>>>> libmpi_cxx.so.0.0.1
>>>>>>>>   lrwxrwxrwx 1 root root      19 Feb  8 23:06 libmpi_f77.so ->
>>>>>>>> libmpi_f77.so.0.0.1
>>>>>>>>   lrwxrwxrwx 1 root root      19 Feb  8 23:06 libmpi_f77.so.0 ->
>>>>>>>> libmpi_f77.so.0.0.1
>>>>>>>>   lrwxrwxrwx 1 root root      19 Feb  8 23:06 libmpi_f90.so ->
>>>>>>>> libmpi_f90.so.0.0.1
>>>>>>>>   lrwxrwxrwx 1 root root      19 Feb  8 23:06 libmpi_f90.so.0 ->
>>>>>>>> libmpi_f90.so.0.0.1
>>>>>>>>   lrwxrwxrwx 1 root root      20 Feb  8 23:06 libopen-pal.so ->
>>>>>>>> libopen-pal.so.0.0.0
>>>>>>>>   lrwxrwxrwx 1 root root      20 Feb  8 23:06 libopen-pal.so.0 ->
>>>>>>>> libopen-pal.so.0.0.0
>>>>>>>>   lrwxrwxrwx 1 root root      20 Feb  8 23:06 libopen-rte.so ->
>>>>>>>> libopen-rte.so.0.0.0
>>>>>>>>   lrwxrwxrwx 1 root root      20 Feb  8 23:06 libopen-rte.so.0 ->
>>>>>>>> libopen-rte.so.0.0.0
>>>>>>>>   lrwxrwxrwx 1 root root      26 Feb  8 23:06 libopenmpi_malloc.so ->
>>>>>>>> libopenmpi_malloc.so.0.0.0
>>>>>>>>   lrwxrwxrwx 1 root root      26 Feb  8 23:06 libopenmpi_malloc.so.0 ->
>>>>>>>> libopenmpi_malloc.so.0.0.0
>>>>>>>>   lrwxrwxrwx 1 root root      20 Feb  8 23:06 libulockmgr.so ->
>>>>>>>> libulockmgr.so.1.0.1
>>>>>>>>   lrwxrwxrwx 1 root root      20 Feb  8 23:06 libulockmgr.so.1 ->
>>>>>>>> libulockmgr.so.1.0.1
>>>>>>>>   lrwxrwxrwx 1 root root      16 Feb  8 23:06 libxml2.so ->
>>>>>>>> libxml2.so.2.7.2
>>>>>>>>   lrwxrwxrwx 1 root root      16 Feb  8 23:06 libxml2.so.2 ->
>>>>>>>> libxml2.so.2.7.2
>>>>>>>>   -rw-r--r-- 1 root root  385912 Jan 26 01:00 libvt.a
>>>>>>>>   [tsakai@ip-10-203-21-132 ~]$
>>>>>>>>   [tsakai@ip-10-203-21-132 ~]$ # Now, I am really confused...
>>>>>>>>   [tsakai@ip-10-203-21-132 ~]$
>>>>>>>>
>>>>>>>> Do you know why it's complaining about shared libraries?
>>>>>>>>
>>>>>>>> Thank you.
>>>>>>>>
>>>>>>>> Tena
>>>>>>>>
>>>>>>>>
>>>>>>>> On 2/10/11 1:05 PM, "Jeff Squyres" <jsquy...@cisco.com> wrote:
>>>>>>>>
>>>>>>>>> Your prior mails were about ssh issues, but this one sounds like you
>>>>>>>>> might
>>>>>>>>> have firewall issues.
>>>>>>>>>
>>>>>>>>> That is, the "orted" command attempts to open a TCP socket back to
>>>>>>>>> mpirun
>>>>>>>>> for
>>>>>>>>> various command and control reasons.  If it is blocked from doing so
>>>>>>>>> by
>>>>>>>>> a
>>>>>>>>> firewall, Open MPI won't run.  In general, you can either disable your
>>>>>>>>> firewall or you can setup a trust relationship for TCP connections
>>>>>>>>> within
>>>>>>>>> your
>>>>>>>>> cluster.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Feb 10, 2011, at 1:03 PM, Tena Sakai wrote:
>>>>>>>>>
>>>>>>>>>> Hi Reuti,
>>>>>>>>>>
>>>>>>>>>> Thanks for suggesting "LogLevel DEBUG3."  I did so and complete
>>>>>>>>>> session is captured in the attached file.
>>>>>>>>>>
>>>>>>>>>> What I did is much similar to what I have done before: verify
>>>>>>>>>> that ssh works and then run mpirun command.  In my a bit lengthy
>>>>>>>>>> session log, there are two responses from "LogLevel DEBUG3."  First
>>>>>>>>>> from an scp invocation and then from mpirun invocation.  They both
>>>>>>>>>> say
>>>>>>>>>>   debug1: Authentication succeeded (publickey).
>>>>>>>>>>
>>>>>>>>>>> From mpirun invocation, I see a line:
>>>>>>>>>>   debug1: Sending command:  orted --daemonize -mca ess env -mca
>>>>>>>>>> orte_ess_jobid 3344891904 -mca orte_ess_vpid 1 -mca
>>>>>>>>>> orte_ess_num_procs
>>>>>>>>>>   2 --hnp-uri "3344891904.0;tcp://10.194.95.239:54256"
>>>>>>>>>> The IP address at the end of the line is indeed that of machine B.
>>>>>>>>>> After that there was hanging and I controlled-C out of it, which
>>>>>>>>>> gave me more lines.  But the lines after
>>>>>>>>>>   debug1: Sending command:  orted bla bla bla
>>>>>>>>>> doesn't look good to me.  But, in truth, I have no idea what they
>>>>>>>>>> mean.
>>>>>>>>>>
>>>>>>>>>> If you could shed some light, I would appreciate it very much.
>>>>>>>>>>
>>>>>>>>>> Regards,
>>>>>>>>>>
>>>>>>>>>> Tena
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On 2/10/11 10:57 AM, "Reuti" <re...@staff.uni-marburg.de> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi,
>>>>>>>>>>>
>>>>>>>>>>> Am 10.02.2011 um 19:11 schrieb Tena Sakai:
>>>>>>>>>>>
>>>>>>>>>>>> your local machine is Linux like, but the execution hosts
>>>>>>>>>>>> are Macs? I saw the /Users/tsakai/... in your output.
>>>>>>>>>>>> No, my environment is entirely linux.  The path to my home
>>>>>>>>>>>> directory on one host (blitzen) has been known as /Users/tsakai,
>>>>>>>>>>>> despite it is an nfs mount from vixen (which is known to
>>>>>>>>>>>> itself as /home/tsakai).  For historical reasons, I have
>>>>>>>>>>>> chosen to give a symbolic link named /Users to vixen's /Home,
>>>>>>>>>>>> so that I can use consistent path for both vixen and blitzen.
>>>>>>>>>>> okay. Sometimes the protection of the home directory must be
>>>>>>>>>>> adjusted
>>>>>>>>>>> too,
>>>>>>>>>>> but
>>>>>>>>>>> as you can do it from the command line this shouldn't be an issue.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> Is this a private cluster (or at least private interfaces)?
>>>>>>>>>>>> It would also be an option to use hostbased authentication,
>>>>>>>>>>>> which will avoid setting any known_hosts file or passphraseless
>>>>>>>>>>>> ssh-keys for each user.
>>>>>>>>>>>> No, it is not a private cluster.  It is Amazon EC2.  When I
>>>>>>>>>>>> Ssh from my local machine (vixen) I use its public interface,
>>>>>>>>>>>> but to address from one amazon cluster node to the other I
>>>>>>>>>>>> use nodes' private dns names: domU-12-31-39-07-35-21 and
>>>>>>>>>>>> domU-12-31-39-06-74-E2.  Both public and private dns names
>>>>>>>>>>>> change from a launch to another.  I am using passphrasesless
>>>>>>>>>>>> ssh-keys for authentication in all cases, i.e., from vixen to
>>>>>>>>>>>> Amazon node A, from amazon node A to amazon node B, and from
>>>>>>>>>>>> Amazon node B back to A.  (Please see my initail post.  There
>>>>>>>>>>>> is a session dialogue for this.)  They all work without authen-
>>>>>>>>>>>> tication dialogue, except a brief initial dialogue:
>>>>>>>>>>>>  The authenticity of host 'domu-xx-xx-xx-xx-xx-x (10.xx.xx.xx)'
>>>>>>>>>>>>  can't be established.
>>>>>>>>>>>>   RSA key fingerprint is
>>>>>>>>>>>> e3:ad:75:b1:a4:63:7f:0f:c4:0b:10:71:f3:2f:21:81.
>>>>>>>>>>>>   Are you sure you want to continue connecting (yes/no)?
>>>>>>>>>>>> to which I say "yes."
>>>>>>>>>>>> But I am unclear with what you mean by "hostbased authentication"?
>>>>>>>>>>>> Doesn't that mean with password?  If so, it is not an option.
>>>>>>>>>>> No. It's convenient inside a private cluster as it won't fill each
>>>>>>>>>>> users'
>>>>>>>>>>> known_hosts file and you don't need to create any ssh-keys. But when
>>>>>>>>>>> the
>>>>>>>>>>> hostname changes every time it might also create new hostkeys. It
>>>>>>>>>>> uses
>>>>>>>>>>> hostkeys (private and public), this way it works for all users. Just
>>>>>>>>>>> for
>>>>>>>>>>> reference:
>>>>>>>>>>>
>>>>>>>>>>> http://arc.liv.ac.uk/SGE/howto/hostbased-ssh.html
>>>>>>>>>>>
>>>>>>>>>>> You could look into it later.
>>>>>>>>>>>
>>>>>>>>>>> ==
>>>>>>>>>>>
>>>>>>>>>>> - Can you try to use a command when connecting from A to B? E.g. ssh
>>>>>>>>>>> `domU-12-31-39-06-74-E2 ls`. Is this working too?
>>>>>>>>>>>
>>>>>>>>>>> - What about putting:
>>>>>>>>>>>
>>>>>>>>>>> LogLevel DEBUG3
>>>>>>>>>>>
>>>>>>>>>>> In your ~/.ssh/config. Maybe we can see what he's trying to
>>>>>>>>>>> negotiate
>>>>>>>>>>> before
>>>>>>>>>>> it fails in verbose mode.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> -- Reuti
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> Regards,
>>>>>>>>>>>>
>>>>>>>>>>>> Tena
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On 2/10/11 2:27 AM, "Reuti" <re...@staff.uni-marburg.de> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> Hi,
>>>>>>>>>>>>
>>>>>>>>>>>> your local machine is Linux like, but the execution hosts are Macs?
>>>>>>>>>>>> I
>>>>>>>>>>>> saw
>>>>>>>>>>>> the
>>>>>>>>>>>> /Users/tsakai/... in your output.
>>>>>>>>>>>>
>>>>>>>>>>>> a) executing a command on them is also working, e.g.: ssh
>>>>>>>>>>>> domU-12-31-39-07-35-21 ls
>>>>>>>>>>>>
>>>>>>>>>>>> Am 10.02.2011 um 07:08 schrieb Tena Sakai:
>>>>>>>>>>>>
>>>>>>>>>>>> Hi,
>>>>>>>>>>>>
>>>>>>>>>>>> I have made a bit of progress(?)...
>>>>>>>>>>>> I made a config file in my .ssh directory on the cloud.  It looks
>>>>>>>>>>>> like:
>>>>>>>>>>>>  # machine A
>>>>>>>>>>>>  Host domU-12-31-39-07-35-21.compute-1.internal
>>>>>>>>>>>> This is just an abbreviation or nickname above. To use the
>>>>>>>>>>>> specified
>>>>>>>>>>>> settings,
>>>>>>>>>>>> it's necessary to specify exactly this name. When the settings are
>>>>>>>>>>>> the
>>>>>>>>>>>> same
>>>>>>>>>>>> anyway for all machines, you can use:
>>>>>>>>>>>>
>>>>>>>>>>>> Host *
>>>>>>>>>>>>  IdentityFile /home/tsakai/.ssh/tsakai
>>>>>>>>>>>>  IdentitiesOnly yes
>>>>>>>>>>>>  BatchMode yes
>>>>>>>>>>>>
>>>>>>>>>>>> instead.
>>>>>>>>>>>>
>>>>>>>>>>>> Is this a private cluster (or at least private interfaces)? It
>>>>>>>>>>>> would
>>>>>>>>>>>> also
>>>>>>>>>>>> be
>>>>>>>>>>>> an option to use hostbased authentication, which will avoid setting
>>>>>>>>>>>> any
>>>>>>>>>>>> known_hosts file or passphraseless ssh-keys for each user.
>>>>>>>>>>>>
>>>>>>>>>>>> -- Reuti
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>  HostName domU-12-31-39-07-35-21
>>>>>>>>>>>>  BatchMode yes
>>>>>>>>>>>>  IdentityFile /home/tsakai/.ssh/tsakai
>>>>>>>>>>>>  ChallengeResponseAuthentication no
>>>>>>>>>>>>  IdentitiesOnly yes
>>>>>>>>>>>>
>>>>>>>>>>>>  # machine B
>>>>>>>>>>>>  Host domU-12-31-39-06-74-E2.compute-1.internal
>>>>>>>>>>>>  HostName domU-12-31-39-06-74-E2
>>>>>>>>>>>>  BatchMode yes
>>>>>>>>>>>>  IdentityFile /home/tsakai/.ssh/tsakai
>>>>>>>>>>>>  ChallengeResponseAuthentication no
>>>>>>>>>>>>  IdentitiesOnly yes
>>>>>>>>>>>>
>>>>>>>>>>>> This file exists on both machine A and machine B.
>>>>>>>>>>>>
>>>>>>>>>>>> Now When I issue mpirun command as below:
>>>>>>>>>>>>  [tsakai@domU-12-31-39-06-74-E2 ~]$ mpirun -app app.ac2
>>>>>>>>>>>>
>>>>>>>>>>>> It hungs.  I control-C out of it and I get:
>>>>>>>>>>>>  mpirun: killing job...
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>> ------------------------------------------------------------------------
>>>>>>> ->
>>>>>>>> -
>>>>>>>>>>>>  mpirun noticed that the job aborted, but has no info as to the
>>>>>>>>>>>> process
>>>>>>>>>>>>  that caused that situation.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>> ------------------------------------------------------------------------
>>>>>>> ->
>>>>>>>> -
>>>>>>> ------------------------------------------------------------------------
>>>>>>> ->
>>>>>>>> -
>>>>>>>>>>>>  mpirun was unable to cleanly terminate the daemons on the nodes
>>>>>>>>>>>> shown
>>>>>>>>>>>>  below. Additional manual cleanup may be required - please refer to
>>>>>>>>>>>>  the "orte-clean" tool for assistance.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>> ------------------------------------------------------------------------
>>>>>>> ->
>>>>>>>> -
>>>>>>>>>>>>      domU-12-31-39-07-35-21.compute-1.internal - daemon did not
>>>>>>>>>>>> report
>>>>>>>>>>>> back when launched
>>>>>>>>>>>>
>>>>>>>>>>>> Am I making progress?
>>>>>>>>>>>>
>>>>>>>>>>>> Does this mean I am past authentication and something else is the
>>>>>>>>>>>> problem?
>>>>>>>>>>>> Does someone have an example .ssh/config file I can look at?  There
>>>>>>>>>>>> are
>>>>>>>>>>>> so
>>>>>>>>>>>> many keyword-argument paris for this config file and I would like
>>>>>>>>>>>> to
>>>>>>>>>>>> look
>>>>>>>>>>>> at
>>>>>>>>>>>> some very basic one that works.
>>>>>>>>>>>>
>>>>>>>>>>>> Thank you.
>>>>>>>>>>>>
>>>>>>>>>>>> Tena Sakai
>>>>>>>>>>>> tsa...@gallo.ucsf.edu
>>>>>>>>>>>>
>>>>>>>>>>>> On 2/9/11 7:52 PM, "Tena Sakai" <tsa...@gallo.ucsf.edu> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> Hi
>>>>>>>>>>>>
>>>>>>>>>>>> I have an app.ac1 file like below:
>>>>>>>>>>>>  [tsakai@vixen local]$ cat app.ac1
>>>>>>>>>>>>  -H vixen.egcrc.org   -np 1 Rscript
>>>>>>>>>>>> /Users/tsakai/Notes/R/parallel/Rmpi/local/fib.R 5
>>>>>>>>>>>>  -H vixen.egcrc.org   -np 1 Rscript
>>>>>>>>>>>> /Users/tsakai/Notes/R/parallel/Rmpi/local/fib.R 6
>>>>>>>>>>>>  -H blitzen.egcrc.org -np 1 Rscript
>>>>>>>>>>>> /Users/tsakai/Notes/R/parallel/Rmpi/local/fib.R 7
>>>>>>>>>>>>  -H blitzen.egcrc.org -np 1 Rscript
>>>>>>>>>>>> /Users/tsakai/Notes/R/parallel/Rmpi/local/fib.R 8
>>>>>>>>>>>>
>>>>>>>>>>>> The program I run is
>>>>>>>>>>>>  Rscript /Users/tsakai/Notes/R/parallel/Rmpi/local/fib.R x
>>>>>>>>>>>> Where x is [5..8].  The machines vixen and blitzen each run 2 runs.
>>>>>>>>>>>>
>>>>>>>>>>>> Here¹s the program fib.R:
>>>>>>>>>>>>  [ tsakai@vixen local]$ cat fib.R
>>>>>>>>>>>>      # fib() computes, given index n, fibonacci number iteratively
>>>>>>>>>>>>      # here's the first dozen sequence (indexed from 0..11)
>>>>>>>>>>>>      # 1, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89
>>>>>>>>>>>>
>>>>>>>>>>>>  fib <- function( n ) {
>>>>>>>>>>>>          a <- 0
>>>>>>>>>>>>          b <- 1
>>>>>>>>>>>>          for ( i in 1:n ) {
>>>>>>>>>>>>               t <- b
>>>>>>>>>>>>               b <- a
>>>>>>>>>>>>               a <- a + t
>>>>>>>>>>>>          }
>>>>>>>>>>>>      a
>>>>>>>>>>>>
>>>>>>>>>>>>  arg <- commandArgs( TRUE )
>>>>>>>>>>>>  myHost <- system( 'hostname', intern=TRUE )
>>>>>>>>>>>>  cat( fib(arg), myHost, '\n' )
>>>>>>>>>>>>
>>>>>>>>>>>> It reads an argument from command line and produces a fibonacci
>>>>>>>>>>>> number
>>>>>>>>>>>> that
>>>>>>>>>>>> corresponds to that index, followed by the machine name.  Pretty
>>>>>>>>>>>> simple
>>>>>>>>>>>> stuff.
>>>>>>>>>>>>
>>>>>>>>>>>> Here¹s the run output:
>>>>>>>>>>>>  [tsakai@vixen local]$ mpirun -app app.ac1
>>>>>>>>>>>>  5 vixen.egcrc.org
>>>>>>>>>>>>  8 vixen.egcrc.org
>>>>>>>>>>>>  13 blitzen.egcrc.org
>>>>>>>>>>>>  21 blitzen.egcrc.org
>>>>>>>>>>>>
>>>>>>>>>>>> Which is exactly what I expect.  So far so good.
>>>>>>>>>>>>
>>>>>>>>>>>> Now I want to run the same thing on cloud.  I launch 2 instances of
>>>>>>>>>>>> the
>>>>>>>>>>>> same
>>>>>>>>>>>> virtual machine, to which I get to by:
>>>>>>>>>>>>  [tsakai@vixen local]$ ssh ­A I ~/.ssh/tsakai
>>>>>>>>>>>> machine-instance-A-public-dns
>>>>>>>>>>>>
>>>>>>>>>>>> Now I am on machine A:
>>>>>>>>>>>>  [tsakai@domU-12-31-39-00-D1-F2 ~]$
>>>>>>>>>>>>  [tsakai@domU-12-31-39-00-D1-F2 ~]$ # and I can go to machine B
>>>>>>>>>>>> without
>>>>>>>>>>>> password authentication,
>>>>>>>>>>>>  [tsakai@domU-12-31-39-00-D1-F2 ~]$ # i.e., use public/private key
>>>>>>>>>>>>  [tsakai@domU-12-31-39-00-D1-F2 ~]$
>>>>>>>>>>>>  [tsakai@domU-12-31-39-00-D1-F2 ~]$ hostname
>>>>>>>>>>>>  domU-12-31-39-00-D1-F2
>>>>>>>>>>>>  [tsakai@domU-12-31-39-00-D1-F2 ~]$ ssh -i .ssh/tsakai
>>>>>>>>>>>> domU-12-31-39-0C-C8-01
>>>>>>>>>>>>  Last login: Wed Feb  9 20:51:48 2011 from 10.254.214.4
>>>>>>>>>>>>  [tsakai@domU-12-31-39-0C-C8-01 ~]$
>>>>>>>>>>>>  [tsakai@domU-12-31-39-0C-C8-01 ~]$ # I am now on machine B
>>>>>>>>>>>>  [tsakai@domU-12-31-39-0C-C8-01 ~]$ hostname
>>>>>>>>>>>>  domU-12-31-39-0C-C8-01
>>>>>>>>>>>>  [tsakai@domU-12-31-39-0C-C8-01 ~]$
>>>>>>>>>>>>  [tsakai@domU-12-31-39-0C-C8-01 ~]$ # now show I can get to machine
>>>>>>>>>>>> A
>>>>>>>>>>>> without using password
>>>>>>>>>>>>  [tsakai@domU-12-31-39-0C-C8-01 ~]$
>>>>>>>>>>>>  [tsakai@domU-12-31-39-0C-C8-01 ~]$ ssh -i .ssh/tsakai
>>>>>>>>>>>> domU-12-31-39-00-D1-F2
>>>>>>>>>>>>  The authenticity of host 'domu-12-31-39-00-d1-f2 (10.254.214.4)'
>>>>>>>>>>>> can't
>>>>>>>>>>>> be established.
>>>>>>>>>>>>  RSA key fingerprint is
>>>>>>>>>>>> e3:ad:75:b1:a4:63:7f:0f:c4:0b:10:71:f3:2f:21:81.
>>>>>>>>>>>>  Are you sure you want to continue connecting (yes/no)? yes
>>>>>>>>>>>>  Warning: Permanently added 'domu-12-31-39-00-d1-f2' (RSA) to the
>>>>>>>>>>>> list
>>>>>>>>>>>> of
>>>>>>>>>>>> known hosts.
>>>>>>>>>>>>  Last login: Wed Feb  9 20:49:34 2011 from 10.215.203.239
>>>>>>>>>>>>  [tsakai@domU-12-31-39-00-D1-F2 ~]$
>>>>>>>>>>>>  [tsakai@domU-12-31-39-00-D1-F2 ~]$ hostname
>>>>>>>>>>>>  domU-12-31-39-00-D1-F2
>>>>>>>>>>>>  [tsakai@domU-12-31-39-00-D1-F2 ~]$
>>>>>>>>>>>>  [tsakai@domU-12-31-39-00-D1-F2 ~]$ exit
>>>>>>>>>>>>  logout
>>>>>>>>>>>>  Connection to domU-12-31-39-00-D1-F2 closed.
>>>>>>>>>>>>  [tsakai@domU-12-31-39-0C-C8-01 ~]$
>>>>>>>>>>>>  [tsakai@domU-12-31-39-0C-C8-01 ~]$ exit
>>>>>>>>>>>>  logout
>>>>>>>>>>>>  Connection to domU-12-31-39-0C-C8-01 closed.
>>>>>>>>>>>>  [tsakai@domU-12-31-39-00-D1-F2 ~]$
>>>>>>>>>>>>  [tsakai@domU-12-31-39-00-D1-F2 ~]$ # back at machine A
>>>>>>>>>>>>  [tsakai@domU-12-31-39-00-D1-F2 ~]$ hostname
>>>>>>>>>>>>  domU-12-31-39-00-D1-F2
>>>>>>>>>>>>
>>>>>>>>>>>> As you can see, neither machine uses password for authentication;
>>>>>>>>>>>> it
>>>>>>>>>>>> uses
>>>>>>>>>>>> public/private key pairs.  There is no problem (that I can see) for
>>>>>>>>>>>> ssh
>>>>>>>>>>>> invocation
>>>>>>>>>>>> from one machine to the other.  This is so because I have a copy of
>>>>>>>>>>>> public
>>>>>>>>>>>> key
>>>>>>>>>>>> and a copy of private key on each instance.
>>>>>>>>>>>>
>>>>>>>>>>>> The app.ac file is identical, except the node names:
>>>>>>>>>>>>  [tsakai@domU-12-31-39-00-D1-F2 ~]$ cat app.ac1
>>>>>>>>>>>>  -H domU-12-31-39-00-D1-F2 -np 1 Rscript /home/tsakai/fib.R 5
>>>>>>>>>>>>  -H domU-12-31-39-00-D1-F2 -np 1 Rscript /home/tsakai/fib.R 6
>>>>>>>>>>>>  -H domU-12-31-39-0C-C8-01 -np 1 Rscript /home/tsakai/fib.R 7
>>>>>>>>>>>>  -H domU-12-31-39-0C-C8-01 -np 1 Rscript /home/tsakai/fib.R 8
>>>>>>>>>>>>
>>>>>>>>>>>> Here¹s what happens with mpirun:
>>>>>>>>>>>>
>>>>>>>>>>>>  [tsakai@domU-12-31-39-00-D1-F2 ~]$ mpirun -app app.ac1
>>>>>>>>>>>>  tsakai@domu-12-31-39-0c-c8-01's password:
>>>>>>>>>>>>  Permission denied, please try again.
>>>>>>>>>>>>  tsakai@domu-12-31-39-0c-c8-01's password: mpirun: killing job...
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>> ----------------------------------------------------------------------->
>>>>>>> >>
>>>>>>> -
>>>>>>>>>>>> --
>>>>>>>>>>>>  mpirun noticed that the job aborted, but has no info as to the
>>>>>>>>>>>> process
>>>>>>>>>>>>  that caused that situation.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>> ----------------------------------------------------------------------->
>>>>>>> >>
>>>>>>> -
>>>>>>>>>>>> --
>>>>>>>>>>>>
>>>>>>>>>>>>  mpirun: clean termination accomplished
>>>>>>>>>>>>
>>>>>>>>>>>>  [tsakai@domU-12-31-39-00-D1-F2 ~]$
>>>>>>>>>>>>
>>>>>>>>>>>> Mpirun (or somebody else?) asks me password, which I don¹t have.
>>>>>>>>>>>> I end up typing control-C.
>>>>>>>>>>>>
>>>>>>>>>>>> Here¹s my question:
>>>>>>>>>>>> How can I get past authentication by mpirun where there is no
>>>>>>>>>>>> password?
>>>>>>>>>>>>
>>>>>>>>>>>> I would appreciate your help/insight greatly.
>>>>>>>>>>>>
>>>>>>>>>>>> Thank you.
>>>>>>>>>>>>
>>>>>>>>>>>> Tena Sakai
>>>>>>>>>>>> tsa...@gallo.ucsf.edu
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


Reply via email to