Have you searched the email archive and/or web for openmpi and Amazon cloud? Others have previously worked through many of these problems for that environment - might be worth a look to see if someone already solved this, or at least a contact point for someone who is already running in that environment.
IIRC, there are some unique problems with running on that platform. On Feb 12, 2011, at 12:38 AM, Tena Sakai wrote: > Hi Gus, > > Thank you for all your suggestions. > > I fixed the limits as you suggested and ran the test and > I am still getting the same failure. More on that in a > bit. But here is a bit of my response to what you mentioned. > >> the IP number you checked now is not the same as in your >> message with the MPI failure/errors. >> Not sure if I understand which computers we're talking about, >> or where these computers are (at Amazon?), >> or if they change depending on each session you use to run your programs, >> if they are identical machines with the same limits or if they differ. > > Everything I mentioned in last 2-3 days is on Amazon EC2 cloud. I > have no problem running the same thing locally (vixen is my local > machine): > > [tsakai@vixen Rmpi]$ cat app.ac1 > -H vixen -np 1 Rscript /Users/tsakai/Notes/R/parallel/Rmpi/fib.R 5 > -H vixen -np 1 Rscript /Users/tsakai/Notes/R/parallel/Rmpi/fib.R 6 > -H blitzen -np 1 Rscript /Users/tsakai/Notes/R/parallel/Rmpi/fib.R 7 > -H blitzen -np 1 Rscript /Users/tsakai/Notes/R/parallel/Rmpi/fib.R 8 > [tsakai@vixen Rmpi]$ > [tsakai@vixen Rmpi]$ mpirun --app app.ac1 > 5 vixen.egcrc.org > 8 vixen.egcrc.org > 13 blitzen.egcrc.org > 21 blitzen.egcrc.org > [tsakai@vixen Rmpi]$ # these lines are correct result. > [tsakai@vixen Rmpi]$ > > With Amazon EC2, where the strange behavior happens, is a virtualized > environment. They charge by hours. I launch an instance of a machine > when I need it and I shut them down when I am done. Each time I get > different IP addresses (2 per instance, one on internal network and > the other for public interface). That is why I don't show consistent > ip address or dns. Every time I shutdown the machine, what I did on > that instance disappears and on next instance I have to recreate it > from scratch --case in point is ~/home/.ssh/config--, which is what > I have been doing (unless I take 'snapshot' of the image and save it > to a persistent storage (and doing snapshot is a bit of work)). > >> One of the error messages mentions LD_LIBRARY_PATH. >> Is it set to point to the OpenMPI lib directory? >> Remember, OpenMPI requires both PATH and LD_LIBRARY_PATH properly >> set. > > Yes, I have been setting LD_LIBRARY_PATH manually every time, because > I have neglected to put it into my bash startup file as part of AMI > (Amazon Machine Image) building. > > Now what I have done is get onto an instance as tsakai, save output > from 'ulimit -a', set /etc/security/limits.conf parameters as you > suggest, get off and re-log onto the instance (thereby activating > those ulimit parameters), and ran the same (actually simpler) test, > as tsakai and as root. > > [tsakai@vixen Rmpi]$ > [tsakai@vixen Rmpi]$ # 2ec2 below is a script/wrapper around ssh to > [tsakai@vixen Rmpi]$ # make ssh invocation line shorter. > [tsakai@vixen Rmpi]$ > [tsakai@vixen ec2]$ 2ec2 ec2-50-16-55-64.compute-1.amazonaws.com > The authenticity of host 'ec2-50-16-55-64.compute-1.amazonaws.com > (50.16.55.64)' can't be established. > RSA key fingerprint is e3:ad:75:b1:a4:63:7f:0f:c4:0b:10:71:f3:2f:21:81. > Are you sure you want to continue connecting (yes/no)? yes > Last login: Tue Feb 8 22:52:54 2011 from 10.201.197.188 > [tsakai@ip-10-114-138-129 ~]$ > [tsakai@ip-10-114-138-129 ~]$ ulimit -a > mylimit.1 > [tsakai@ip-10-114-138-129 ~]$ > [tsakai@ip-10-114-138-129 ~]$ sudo su > bash-3.2# > bash-3.2# cat - >> /etc/security/limits.conf > * - memlock -1 > * - stack -1 > * - nofile 4096 > bash-3.2# > bash-3.2# tail /etc/security/limits.conf > #@student hard nproc 20 > #@faculty soft nproc 20 > #@faculty hard nproc 50 > #ftp hard nproc 0 > #@student - maxlogins 4 > > # End of file > * - memlock -1 > * - stack -1 > * - nofile 4096 > bash-3.2# > bash-3.2# exit > exit > [tsakai@ip-10-114-138-129 ~]$ > [tsakai@ip-10-114-138-129 ~]$ # logout and log back in to activate the > [tsakai@ip-10-114-138-129 ~]$ # new setting. > [tsakai@ip-10-114-138-129 ~]$ exit > logout > [tsakai@vixen ec2]$ > [tsakai@vixen ec2]$ # I am back on vixen and about to relogging back onto > [tsakai@vixen ec2]$ # the instance which is still running. > [tsakai@vixen ec2]$ > [tsakai@vixen ec2]$ 2ec2 ec2-50-16-55-64.compute-1.amazonaws.com > Last login: Fri Feb 11 23:50:47 2011 from 63.193.205.1 > [tsakai@ip-10-114-138-129 ~]$ > [tsakai@ip-10-114-138-129 ~]$ ulimit -a > mylimit.2 > [tsakai@ip-10-114-138-129 ~]$ > [tsakai@ip-10-114-138-129 ~]$ diff mylimit.1 mylimit.2 > 6c6 > < max locked memory (kbytes, -l) 32 > --- >> max locked memory (kbytes, -l) unlimited > 8c8 > < open files (-n) 1024 > --- >> open files (-n) 4096 > 12c12 > < stack size (kbytes, -s) 8192 > --- >> stack size (kbytes, -s) unlimited > [tsakai@ip-10-114-138-129 ~]$ > [tsakai@ip-10-114-138-129 ~]$ # yes, I have the same ulimit parameters as > [tsakai@ip-10-114-138-129 ~]$ # Gus suggested. > [tsakai@ip-10-114-138-129 ~]$ > [tsakai@ip-10-114-138-129 ~]$ export LD_LIBRARY_PATH=/usr/local/lib > [tsakai@ip-10-114-138-129 ~]$ > [tsakai@ip-10-114-138-129 ~]$ env | grep LD_LIB > LD_LIBRARY_PATH=/usr/local/lib > [tsakai@ip-10-114-138-129 ~]$ > [tsakai@ip-10-114-138-129 ~]$ cat - > app.ac > -H ip-10-114-138-129.ec2.internal -np 1 Rscript /home/tsakai/fib.R 5 > -H ip-10-114-138-129.ec2.internal -np 1 Rscript /home/tsakai/fib.R 6 > [tsakai@ip-10-114-138-129 ~]$ > [tsakai@ip-10-114-138-129 ~]$ cat app.ac > -H ip-10-114-138-129.ec2.internal -np 1 Rscript /home/tsakai/fib.R 5 > -H ip-10-114-138-129.ec2.internal -np 1 Rscript /home/tsakai/fib.R 6 > [tsakai@ip-10-114-138-129 ~]$ > [tsakai@ip-10-114-138-129 ~]$ hostname > ip-10-114-138-129 > [tsakai@ip-10-114-138-129 ~]$ > [tsakai@ip-10-114-138-129 ~]$ # this run doesn't involve other node. > [tsakai@ip-10-114-138-129 ~]$ # just use this machine's cores. > [tsakai@ip-10-114-138-129 ~]$ # there are 2 cores on this machine. > [tsakai@ip-10-114-138-129 ~]$ > [tsakai@ip-10-114-138-129 ~]$ mpirun --app app.ac > -------------------------------------------------------------------------- > mpirun was unable to launch the specified application as it encountered an > error: > > Error: pipe function call failed when setting up I/O forwarding subsystem > Node: ip-10-114-138-129 > > while attempting to start process rank 0. > -------------------------------------------------------------------------- > [tsakai@ip-10-114-138-129 ~]$ > [tsakai@ip-10-114-138-129 ~]$ # I still get the same error! > [tsakai@ip-10-114-138-129 ~]$ > [tsakai@ip-10-114-138-129 ~]$ cat /proc/sys/fs/file-nr > 512 0 762674 > [tsakai@ip-10-114-138-129 ~]$ > [tsakai@ip-10-114-138-129 ~]$ # number of open files (512) is no where > [tsakai@ip-10-114-138-129 ~]$ # close to the limit, which is 4096 now. > [tsakai@ip-10-114-138-129 ~]$ # now let's run it as root. > [tsakai@ip-10-114-138-129 ~]$ > [tsakai@ip-10-114-138-129 ~]$ sudo su > bash-3.2# > bash-3.2# env | grep LD_LIBR > LD_LIBRARY_PATH=/usr/local/lib > bash-3.2# > bash-3.2# pwd > /home/tsakai > bash-3.2# > bash-3.2# mpirun --app ./app.ac > 5 ip-10-114-138-129 > 8 ip-10-114-138-129 > bash-3.2# > bash-3.2# # that's correct result! > bash-3.2# > bash-3.2# cat /proc/sys/fs/file-nr > 512 0 762674 > bash-3.2# > bash-3.2# # this shows that mpirun didn't leave any > bash-3.2# # oepn file behind, I think. That's good. > bash-3.2# > bash-3.2# exit > exit > [tsakai@ip-10-114-138-129 ~]$ > [tsakai@ip-10-114-138-129 ~]$ exit > logout > [tsakai@vixen ec2]$ > > Had it been the case, it failed both as root and as user > tsakai, I can conclude that either the virtualized environment > is disagreeable with openmpi OR there is something wrong with > what I am trying to do. But what kills me is that it *does* > work when run by root. Why pipe system call fails on user > tsakai and not on root is something I don't understand. > > BTW, here is the same test (using a single machine) in local > environment (i.e., no virtualized environment): > > [tsakai@vixen Rmpi]$ cat app.ac2 > -H vixen -np 1 Rscript /Users/tsakai/Notes/R/parallel/Rmpi/fib.R 5 > -H vixen -np 1 Rscript /Users/tsakai/Notes/R/parallel/Rmpi/fib.R 6 > [tsakai@vixen Rmpi]$ > [tsakai@vixen Rmpi]$ mpirun --app app.ac2 > 5 vixen.egcrc.org > 8 vixen.egcrc.org > [tsakai@vixen Rmpi]$ > > I am running out of stones to turn over for now and maybe it's > a good time to go to bed. :) > > I would appreciate it if you come up with a different things > to try. > > Many thanks for your help. > > Regards, > > Tena > > > On 2/11/11 7:45 PM, "Gus Correa" <g...@ldeo.columbia.edu> wrote: > >> Hi Tena >> >> We setup the cluster nodes to run MPI programs >> with stacksize unlimited, >> memlock unlimited, >> 4096 max open files, >> to avoid crashing on edge cases. >> This is kind of typical for HPC, MPI, number crunching. >> >> However, some are quite big codes, >> and from what you said yours is not (or not yet). >> >> Your stack limit sounds quite small, but when >> we had problems with stack the result was a segmentation fault. >> 1024 files I guess is a default for 32 bit Linux distributions, >> but some programs break there. >> >> If you want to do this, put these lines on the bottom >> of /etc/security/limits.conf: >> >> # End of file >> * - memlock -1 >> * - stack -1 >> * - nofile 4096 >> >> I don't think you should give unlimited number of processes to >> regular users; keep this privilege to root (which is where >> the two have different limits). >> >> You may want to monitor /proc/sys/fs/file-nr while the program runs. >> The first number is the actual number of open files. >> Top or vmstat also help see how you are doing in terms of memory, >> although you suggested these are (small?) test programs, unlikely to run >> out of memory. >> >> If you are using two nodes, check the same stuff on the other node too. >> Also, the IP number you checked now is not the same as in your >> message with the MPI failure/errors. >> Not sure if I understand which computers we're talking about, >> or where these computers are (at Amazon?), >> or if they change depending on each session you use to run your programs, >> if they are identical machines with the same limits or if they differ. >> >> One of the error messages mentions LD_LIBRARY_PATH. >> Is it set to point to the OpenMPI lib directory? >> Remember, OpenMPI requires both PATH and LD_LIBRARY_PATH properly set. >> >> I hope this helps, although I am afraid I may be missing the point. >> >> Gus Correa >> >> Tena Sakai wrote: >>> Hi Gus, >>> >>> Thank you for your tips. >>> >>> I didn't find any smoking gun or anything comes close. >>> Here's the upshot: >>> >>> [tsakai@ip-10-114-239-188 ~]$ ulimit -a >>> core file size (blocks, -c) 0 >>> data seg size (kbytes, -d) unlimited >>> scheduling priority (-e) 0 >>> file size (blocks, -f) unlimited >>> pending signals (-i) 61504 >>> max locked memory (kbytes, -l) 32 >>> max memory size (kbytes, -m) unlimited >>> open files (-n) 1024 >>> pipe size (512 bytes, -p) 8 >>> POSIX message queues (bytes, -q) 819200 >>> real-time priority (-r) 0 >>> stack size (kbytes, -s) 8192 >>> cpu time (seconds, -t) unlimited >>> max user processes (-u) 61504 >>> virtual memory (kbytes, -v) unlimited >>> file locks (-x) unlimited >>> [tsakai@ip-10-114-239-188 ~]$ >>> [tsakai@ip-10-114-239-188 ~]$ sudo su >>> bash-3.2# >>> bash-3.2# ulimit -a >>> core file size (blocks, -c) 0 >>> data seg size (kbytes, -d) unlimited >>> scheduling priority (-e) 0 >>> file size (blocks, -f) unlimited >>> pending signals (-i) 61504 >>> max locked memory (kbytes, -l) 32 >>> max memory size (kbytes, -m) unlimited >>> open files (-n) 1024 >>> pipe size (512 bytes, -p) 8 >>> POSIX message queues (bytes, -q) 819200 >>> real-time priority (-r) 0 >>> stack size (kbytes, -s) 8192 >>> cpu time (seconds, -t) unlimited >>> max user processes (-u) unlimited >>> virtual memory (kbytes, -v) unlimited >>> file locks (-x) unlimited >>> bash-3.2# >>> bash-3.2# >>> bash-3.2# ulimit -a > root_ulimit-a >>> bash-3.2# exit >>> [tsakai@ip-10-114-239-188 ~]$ >>> [tsakai@ip-10-114-239-188 ~]$ ulimit -a > tsakai_ulimit-a >>> [tsakai@ip-10-114-239-188 ~]$ >>> [tsakai@ip-10-114-239-188 ~]$ diff root_ulimit-a tsakai_ulimit-a >>> 14c14 >>> < max user processes (-u) unlimited >>> --- >>>> max user processes (-u) 61504 >>> [tsakai@ip-10-114-239-188 ~]$ >>> [tsakai@ip-10-114-239-188 ~]$ cat /proc/sys/fs/file-nr >>> /proc/sys/fs/file-max >>> 480 0 762674 >>> 762674 >>> [tsakai@ip-10-114-239-188 ~]$ >>> [tsakai@ip-10-114-239-188 ~]$ sudo su >>> bash-3.2# >>> bash-3.2# cat /proc/sys/fs/file-nr /proc/sys/fs/file-max >>> 512 0 762674 >>> 762674 >>> bash-3.2# exit >>> exit >>> [tsakai@ip-10-114-239-188 ~]$ >>> [tsakai@ip-10-114-239-188 ~]$ >>> [tsakai@ip-10-114-239-188 ~]$ sysctl -a |grep fs.file-max >>> -bash: sysctl: command not found >>> [tsakai@ip-10-114-239-188 ~]$ >>> [tsakai@ip-10-114-239-188 ~]$ /sbin/!! >>> /sbin/sysctl -a |grep fs.file-max >>> error: permission denied on key 'kernel.cad_pid' >>> error: permission denied on key 'kernel.cap-bound' >>> fs.file-max = 762674 >>> [tsakai@ip-10-114-239-188 ~]$ >>> [tsakai@ip-10-114-239-188 ~]$ sudo /sbin/sysctl -a | grep fs.file-max >>> fs.file-max = 762674 >>> [tsakai@ip-10-114-239-188 ~]$ >>> >>> I see a bit of difference between root and tsakai, but I cannot >>> believe such small difference results in somewhat a catastrophic >>> failure as I have reported. Would you agree with me? >>> >>> Regards, >>> >>> Tena >>> >>> On 2/11/11 6:06 PM, "Gus Correa" <g...@ldeo.columbia.edu> wrote: >>> >>>> Hi Tena >>>> >>>> Please read one answer inline. >>>> >>>> Tena Sakai wrote: >>>>> Hi Jeff, >>>>> Hi Gus, >>>>> >>>>> Thanks for your replies. >>>>> >>>>> I have pretty much ruled out PATH issues by setting tsakai's PATH >>>>> as identical to that of root. In that setting I reproduced the >>>>> same result as before: root can run mpirun correctly and tsakai >>>>> cannot. >>>>> >>>>> I have also checked out permission on /tmp directory. tsakai has >>>>> no problem creating files under /tmp. >>>>> >>>>> I am trying to come up with a strategy to show that each and every >>>>> programs in the PATH has "world" executable permission. It is a >>>>> stone to turn over, but I am not holding my breath. >>>>> >>>>>> ... you are running out of file descriptors. Are file descriptors >>>>>> limited on a per-process basis, perchance? >>>>> I have never heard there is such restriction on Amazon EC2. There >>>>> are folks who keep running instances for a long, long time. Whereas >>>>> in my case, I launch 2 instances, check things out, and then turn >>>>> the instances off. (Given that the state of California has a huge >>>>> debts, our funding is very tight.) So, I really doubt that's the >>>>> case. I have run mpirun unsuccessfully as user tsakai and immediately >>>>> after successfully as root. Still, I would be happy if you can tell >>>>> me a way to tell number of file descriptors used or remmain. >>>>> >>>>> Your mentioned file descriptors made me think of something under >>>>> /dev. But I don't know exactly what I am fishing. Do you have >>>>> some suggestions? >>>>> >>>> 1) If the environment has anything to do with Linux, >>>> check: >>>> >>>> cat /proc/sys/fs/file-nr /proc/sys/fs/file-max >>>> >>>> >>>> or >>>> >>>> sysctl -a |grep fs.file-max >>>> >>>> This max can be set (fs.file-max=whatever_is_reasonable) >>>> in /etc/sysctl.conf >>>> >>>> See 'man sysctl' and 'man sysctl.conf' >>>> >>>> 2) Another possible source of limits. >>>> >>>> Check "ulimit -a" (bash) or "limit" (tcsh). >>>> >>>> If you need to change look at: >>>> >>>> /etc/security/limits.conf >>>> >>>> (See also 'man limits.conf') >>>> >>>> ** >>>> >>>> Since "root can but Tena cannot", >>>> I would check 2) first, >>>> as they are the 'per user/per group' limits, >>>> whereas 1) is kernel/system-wise. >>>> >>>> I hope this helps, >>>> Gus Correa >>>> >>>> PS - I know you are a wise and careful programmer, >>>> but here we had cases of programs that would >>>> fail because of too many files that were open and never closed, >>>> eventually exceeding the max available/permissible. >>>> So, it does happen. >>>> >>>>> I wish I could reproduce this (weired) behavior on a different >>>>> set of machines. I certainly cannot in my local environment. Sigh! >>>>> >>>>> Regards, >>>>> >>>>> Tena >>>>> >>>>> >>>>> On 2/11/11 3:17 PM, "Jeff Squyres (jsquyres)" <jsquy...@cisco.com> wrote: >>>>> >>>>>> It is concerning if the pipe system call fails - I can't think of why >>>>>> that >>>>>> would happen. Thats not usually a permissions issue but rather a deeper >>>>>> indication that something is either seriously wrong on your system or you >>>>>> are >>>>>> running out of file descriptors. Are file descriptors limited on a >>>>>> per-process >>>>>> basis, perchance? >>>>>> >>>>>> Sent from my PDA. No type good. >>>>>> >>>>>> On Feb 11, 2011, at 10:08 AM, "Gus Correa" <g...@ldeo.columbia.edu> >>>>>> wrote: >>>>>> >>>>>>> Hi Tena >>>>>>> >>>>>>> Since root can but you can't, >>>>>>> is is a directory permission problem perhaps? >>>>>>> Check the execution directory permission (on both machines, >>>>>>> if this is not NFS mounted dir). >>>>>>> I am not sure, but IIRR OpenMPI also uses /tmp for >>>>>>> under-the-hood stuff, worth checking permissions there also. >>>>>>> Just a naive guess. >>>>>>> >>>>>>> Congrats for all the progress with the cloudy MPI! >>>>>>> >>>>>>> Gus Correa >>>>>>> >>>>>>> Tena Sakai wrote: >>>>>>>> Hi, >>>>>>>> I have made a bit more progress. I think I can say ssh authenti- >>>>>>>> cation problem is behind me now. I am still having a problem running >>>>>>>> mpirun, but the latest discovery, which I can reproduce, is that >>>>>>>> I can run mpirun as root. Here's the session log: >>>>>>>> [tsakai@vixen ec2]$ 2ec2 ec2-184-73-104-242.compute-1.amazonaws.com >>>>>>>> Last login: Fri Feb 11 00:41:11 2011 from 10.100.243.195 >>>>>>>> [tsakai@ip-10-195-198-31 ~]$ >>>>>>>> [tsakai@ip-10-195-198-31 ~]$ ll >>>>>>>> total 8 >>>>>>>> -rw-rw-r-- 1 tsakai tsakai 274 Feb 11 00:47 app.ac >>>>>>>> -rwxr-xr-x 1 tsakai tsakai 379 Feb 11 00:48 fib.R >>>>>>>> [tsakai@ip-10-195-198-31 ~]$ >>>>>>>> [tsakai@ip-10-195-198-31 ~]$ ll .ssh >>>>>>>> total 16 >>>>>>>> -rw------- 1 tsakai tsakai 232 Feb 5 23:19 authorized_keys >>>>>>>> -rw------- 1 tsakai tsakai 102 Feb 11 00:34 config >>>>>>>> -rw-r--r-- 1 tsakai tsakai 1302 Feb 11 00:36 known_hosts >>>>>>>> -rw------- 1 tsakai tsakai 887 Feb 8 22:03 tsakai >>>>>>>> [tsakai@ip-10-195-198-31 ~]$ >>>>>>>> [tsakai@ip-10-195-198-31 ~]$ ssh ip-10-100-243-195.ec2.internal >>>>>>>> Last login: Fri Feb 11 00:36:20 2011 from 10.195.198.31 >>>>>>>> [tsakai@ip-10-100-243-195 ~]$ >>>>>>>> [tsakai@ip-10-100-243-195 ~]$ # I am on machine B >>>>>>>> [tsakai@ip-10-100-243-195 ~]$ hostname >>>>>>>> ip-10-100-243-195 >>>>>>>> [tsakai@ip-10-100-243-195 ~]$ >>>>>>>> [tsakai@ip-10-100-243-195 ~]$ ll >>>>>>>> total 8 >>>>>>>> -rw-rw-r-- 1 tsakai tsakai 274 Feb 11 00:44 app.ac >>>>>>>> -rwxr-xr-x 1 tsakai tsakai 379 Feb 11 00:47 fib.R >>>>>>>> [tsakai@ip-10-100-243-195 ~]$ >>>>>>>> [tsakai@ip-10-100-243-195 ~]$ >>>>>>>> [tsakai@ip-10-100-243-195 ~]$ cat app.ac >>>>>>>> -H ip-10-195-198-31.ec2.internal -np 1 Rscript /home/tsakai/fib.R 5 >>>>>>>> -H ip-10-195-198-31.ec2.internal -np 1 Rscript /home/tsakai/fib.R 6 >>>>>>>> -H ip-10-100-243-195.ec2.internal -np 1 Rscript /home/tsakai/fib.R 7 >>>>>>>> -H ip-10-100-243-195.ec2.internal -np 1 Rscript /home/tsakai/fib.R 8 >>>>>>>> [tsakai@ip-10-100-243-195 ~]$ >>>>>>>> [tsakai@ip-10-100-243-195 ~]$ # go back to machine A >>>>>>>> [tsakai@ip-10-100-243-195 ~]$ >>>>>>>> [tsakai@ip-10-100-243-195 ~]$ exit >>>>>>>> logout >>>>>>>> Connection to ip-10-100-243-195.ec2.internal closed. >>>>>>>> [tsakai@ip-10-195-198-31 ~]$ >>>>>>>> [tsakai@ip-10-195-198-31 ~]$ hostname >>>>>>>> ip-10-195-198-31 >>>>>>>> [tsakai@ip-10-195-198-31 ~]$ >>>>>>>> [tsakai@ip-10-195-198-31 ~]$ # Execute mpirun >>>>>>>> [tsakai@ip-10-195-198-31 ~]$ >>>>>>>> [tsakai@ip-10-195-198-31 ~]$ mpirun -app app.ac >>>>>>>> >>>>>>>> ------------------------------------------------------------------------ >>>>>>>> -- >>>>>>>> mpirun was unable to launch the specified application as it encountered >>>>>>>> an >>>>>>>> error: >>>>>>>> Error: pipe function call failed when setting up I/O forwarding >>>>>>>> subsystem >>>>>>>> Node: ip-10-195-198-31 >>>>>>>> while attempting to start process rank 0. >>>>>>>> >>>>>>>> ------------------------------------------------------------------------ >>>>>>>> -- >>>>>>>> [tsakai@ip-10-195-198-31 ~]$ >>>>>>>> [tsakai@ip-10-195-198-31 ~]$ # try it as root >>>>>>>> [tsakai@ip-10-195-198-31 ~]$ >>>>>>>> [tsakai@ip-10-195-198-31 ~]$ sudo su >>>>>>>> bash-3.2# >>>>>>>> bash-3.2# pwd >>>>>>>> /home/tsakai >>>>>>>> bash-3.2# >>>>>>>> bash-3.2# ls -l /root/.ssh/config >>>>>>>> -rw------- 1 root root 103 Feb 11 00:56 /root/.ssh/config >>>>>>>> bash-3.2# >>>>>>>> bash-3.2# cat /root/.ssh/config >>>>>>>> Host * >>>>>>>> IdentityFile /root/.ssh/.derobee/.kagi >>>>>>>> IdentitiesOnly yes >>>>>>>> BatchMode yes >>>>>>>> bash-3.2# >>>>>>>> bash-3.2# pwd >>>>>>>> /home/tsakai >>>>>>>> bash-3.2# >>>>>>>> bash-3.2# ls -l >>>>>>>> total 8 >>>>>>>> -rw-rw-r-- 1 tsakai tsakai 274 Feb 11 00:47 app.ac >>>>>>>> -rwxr-xr-x 1 tsakai tsakai 379 Feb 11 00:48 fib.R >>>>>>>> bash-3.2# >>>>>>>> bash-3.2# # now is the time for mpirun >>>>>>>> bash-3.2# >>>>>>>> bash-3.2# mpirun --app ./app.ac >>>>>>>> 13 ip-10-100-243-195 >>>>>>>> 21 ip-10-100-243-195 >>>>>>>> 5 ip-10-195-198-31 >>>>>>>> 8 ip-10-195-198-31 >>>>>>>> bash-3.2# >>>>>>>> bash-3.2# # It works (being root)! >>>>>>>> bash-3.2# >>>>>>>> bash-3.2# exit >>>>>>>> exit >>>>>>>> [tsakai@ip-10-195-198-31 ~]$ >>>>>>>> [tsakai@ip-10-195-198-31 ~]$ # try it one more time as tsakai >>>>>>>> [tsakai@ip-10-195-198-31 ~]$ >>>>>>>> [tsakai@ip-10-195-198-31 ~]$ mpirun --app app.ac >>>>>>>> >>>>>>>> ------------------------------------------------------------------------ >>>>>>>> -- >>>>>>>> mpirun was unable to launch the specified application as it encountered >>>>>>>> an >>>>>>>> error: >>>>>>>> Error: pipe function call failed when setting up I/O forwarding >>>>>>>> subsystem >>>>>>>> Node: ip-10-195-198-31 >>>>>>>> while attempting to start process rank 0. >>>>>>>> >>>>>>>> ------------------------------------------------------------------------ >>>>>>>> -- >>>>>>>> [tsakai@ip-10-195-198-31 ~]$ >>>>>>>> [tsakai@ip-10-195-198-31 ~]$ # I don't get it. >>>>>>>> [tsakai@ip-10-195-198-31 ~]$ >>>>>>>> [tsakai@ip-10-195-198-31 ~]$ exit >>>>>>>> logout >>>>>>>> [tsakai@vixen ec2]$ >>>>>>>> So, why does it say "pipe function call failed when setting up >>>>>>>> I/O forwarding subsystem Node: ip-10-195-198-31" ? >>>>>>>> The node it is referring to is not the remote machine. It is >>>>>>>> What I call machine A. I first thought maybe this is a problem >>>>>>>> With PATH variable. But I don't think so. I compared root's >>>>>>>> Path to that of tsaki's and made them identical and retried. >>>>>>>> I got the same behavior. >>>>>>>> If you could enlighten me why this is happening, I would really >>>>>>>> Appreciate it. >>>>>>>> Thank you. >>>>>>>> Tena >>>>>>>> On 2/10/11 4:12 PM, "Tena Sakai" <tsa...@gallo.ucsf.edu> wrote: >>>>>>>>> Hi jeff, >>>>>>>>> >>>>>>>>> Thanks for the firewall tip. I tried it while allowing all tip >>>>>>>>> traffic >>>>>>>>> and got interesting and preplexing result. Here's what's interesting >>>>>>>>> (BTW, I got rid of "LogLevel DEBUG3" from ./ssh/config on this run): >>>>>>>>> >>>>>>>>> [tsakai@ip-10-203-21-132 ~]$ >>>>>>>>> [tsakai@ip-10-203-21-132 ~]$ mpirun --app app.ac2 >>>>>>>>> Host key verification failed. >>>>>>>>> >>>>>>>>> >>> ------------------------------------------------------------------------->>>> >>>>> >>> - >>>>>>>>> A daemon (pid 2743) died unexpectedly with status 255 while >>>>>>>>> attempting >>>>>>>>> to launch so we are aborting. >>>>>>>>> >>>>>>>>> There may be more information reported by the environment (see >>>>>>>>> above). >>>>>>>>> >>>>>>>>> This may be because the daemon was unable to find all the needed >>>>>>>>> shared >>>>>>>>> libraries on the remote node. You may set your LD_LIBRARY_PATH to >>>>>>>>> have >>>>>>>>> the >>>>>>>>> location of the shared libraries on the remote nodes and this will >>>>>>>>> automatically be forwarded to the remote nodes. >>>>>>>>> >>>>>>>>> >>> ------------------------------------------------------------------------->>>> >>>>> >>> - >>>>>>>>> >>> ------------------------------------------------------------------------->>>> >>>>> >>> - >>>>>>>>> mpirun noticed that the job aborted, but has no info as to the >>>>>>>>> process >>>>>>>>> that caused that situation. >>>>>>>>> >>>>>>>>> >>> ------------------------------------------------------------------------->>>> >>>>> >>> - >>>>>>>>> mpirun: clean termination accomplished >>>>>>>>> >>>>>>>>> [tsakai@ip-10-203-21-132 ~]$ >>>>>>>>> [tsakai@ip-10-203-21-132 ~]$ env | grep LD_LIB >>>>>>>>> [tsakai@ip-10-203-21-132 ~]$ >>>>>>>>> [tsakai@ip-10-203-21-132 ~]$ # Let's set LD_LIBRARY_PATH to >>>>>>>>> /usr/local/lib >>>>>>>>> [tsakai@ip-10-203-21-132 ~]$ >>>>>>>>> [tsakai@ip-10-203-21-132 ~]$ >>>>>>>>> [tsakai@ip-10-203-21-132 ~]$ export LD_LIBRARY_PATH='/usr/local/lib' >>>>>>>>> [tsakai@ip-10-203-21-132 ~]$ >>>>>>>>> [tsakai@ip-10-203-21-132 ~]$ # I better to this on machine B as well >>>>>>>>> [tsakai@ip-10-203-21-132 ~]$ >>>>>>>>> [tsakai@ip-10-203-21-132 ~]$ ssh -i tsakai ip-10-195-171-159 >>>>>>>>> Warning: Identity file tsakai not accessible: No such file or >>>>>>>>> directory. >>>>>>>>> Last login: Thu Feb 10 18:31:20 2011 from 10.203.21.132 >>>>>>>>> [tsakai@ip-10-195-171-159 ~]$ >>>>>>>>> [tsakai@ip-10-195-171-159 ~]$ export LD_LIBRARY_PATH='/usr/local/lib' >>>>>>>>> [tsakai@ip-10-195-171-159 ~]$ >>>>>>>>> [tsakai@ip-10-195-171-159 ~]$ env | grep LD_LIB >>>>>>>>> LD_LIBRARY_PATH=/usr/local/lib >>>>>>>>> [tsakai@ip-10-195-171-159 ~]$ >>>>>>>>> [tsakai@ip-10-195-171-159 ~]$ # OK, now go bak to machine A >>>>>>>>> [tsakai@ip-10-195-171-159 ~]$ exit >>>>>>>>> logout >>>>>>>>> Connection to ip-10-195-171-159 closed. >>>>>>>>> [tsakai@ip-10-203-21-132 ~]$ >>>>>>>>> [tsakai@ip-10-203-21-132 ~]$ hostname >>>>>>>>> ip-10-203-21-132 >>>>>>>>> [tsakai@ip-10-203-21-132 ~]$ # try mpirun again >>>>>>>>> [tsakai@ip-10-203-21-132 ~]$ >>>>>>>>> [tsakai@ip-10-203-21-132 ~]$ mpirun --app app.ac2 >>>>>>>>> Host key verification failed. >>>>>>>>> >>>>>>>>> >>> ------------------------------------------------------------------------->>>> >>>>> >>> - >>>>>>>>> A daemon (pid 2789) died unexpectedly with status 255 while >>>>>>>>> attempting >>>>>>>>> to launch so we are aborting. >>>>>>>>> >>>>>>>>> There may be more information reported by the environment (see >>>>>>>>> above). >>>>>>>>> >>>>>>>>> This may be because the daemon was unable to find all the needed >>>>>>>>> shared >>>>>>>>> libraries on the remote node. You may set your LD_LIBRARY_PATH to >>>>>>>>> have >>>>>>>>> the >>>>>>>>> location of the shared libraries on the remote nodes and this will >>>>>>>>> automatically be forwarded to the remote nodes. >>>>>>>>> >>>>>>>>> >>> ------------------------------------------------------------------------->>>> >>>>> >>> - >>>>>>>>> >>> ------------------------------------------------------------------------->>>> >>>>> >>> - >>>>>>>>> mpirun noticed that the job aborted, but has no info as to the >>>>>>>>> process >>>>>>>>> that caused that situation. >>>>>>>>> >>>>>>>>> >>> ------------------------------------------------------------------------->>>> >>>>> >>> - >>>>>>>>> mpirun: clean termination accomplished >>>>>>>>> >>>>>>>>> [tsakai@ip-10-203-21-132 ~]$ >>>>>>>>> [tsakai@ip-10-203-21-132 ~]$ # I thought openmpi library was in >>>>>>>>> /usr/local/lib... >>>>>>>>> [tsakai@ip-10-203-21-132 ~]$ >>>>>>>>> [tsakai@ip-10-203-21-132 ~]$ ll -t /usr/local/lib | less >>>>>>>>> total 16604 >>>>>>>>> lrwxrwxrwx 1 root root 16 Feb 8 23:06 libfuse.so -> >>>>>>>>> libfuse.so.2.8.5 >>>>>>>>> lrwxrwxrwx 1 root root 16 Feb 8 23:06 libfuse.so.2 -> >>>>>>>>> libfuse.so.2.8.5 >>>>>>>>> lrwxrwxrwx 1 root root 25 Feb 8 23:06 libmca_common_sm.so -> >>>>>>>>> libmca_common_sm.so.1.0.0 >>>>>>>>> lrwxrwxrwx 1 root root 25 Feb 8 23:06 libmca_common_sm.so.1 -> >>>>>>>>> libmca_common_sm.so.1.0.0 >>>>>>>>> lrwxrwxrwx 1 root root 15 Feb 8 23:06 libmpi.so -> >>>>>>>>> libmpi.so.0.0.2 >>>>>>>>> lrwxrwxrwx 1 root root 15 Feb 8 23:06 libmpi.so.0 -> >>>>>>>>> libmpi.so.0.0.2 >>>>>>>>> lrwxrwxrwx 1 root root 19 Feb 8 23:06 libmpi_cxx.so -> >>>>>>>>> libmpi_cxx.so.0.0.1 >>>>>>>>> lrwxrwxrwx 1 root root 19 Feb 8 23:06 libmpi_cxx.so.0 -> >>>>>>>>> libmpi_cxx.so.0.0.1 >>>>>>>>> lrwxrwxrwx 1 root root 19 Feb 8 23:06 libmpi_f77.so -> >>>>>>>>> libmpi_f77.so.0.0.1 >>>>>>>>> lrwxrwxrwx 1 root root 19 Feb 8 23:06 libmpi_f77.so.0 -> >>>>>>>>> libmpi_f77.so.0.0.1 >>>>>>>>> lrwxrwxrwx 1 root root 19 Feb 8 23:06 libmpi_f90.so -> >>>>>>>>> libmpi_f90.so.0.0.1 >>>>>>>>> lrwxrwxrwx 1 root root 19 Feb 8 23:06 libmpi_f90.so.0 -> >>>>>>>>> libmpi_f90.so.0.0.1 >>>>>>>>> lrwxrwxrwx 1 root root 20 Feb 8 23:06 libopen-pal.so -> >>>>>>>>> libopen-pal.so.0.0.0 >>>>>>>>> lrwxrwxrwx 1 root root 20 Feb 8 23:06 libopen-pal.so.0 -> >>>>>>>>> libopen-pal.so.0.0.0 >>>>>>>>> lrwxrwxrwx 1 root root 20 Feb 8 23:06 libopen-rte.so -> >>>>>>>>> libopen-rte.so.0.0.0 >>>>>>>>> lrwxrwxrwx 1 root root 20 Feb 8 23:06 libopen-rte.so.0 -> >>>>>>>>> libopen-rte.so.0.0.0 >>>>>>>>> lrwxrwxrwx 1 root root 26 Feb 8 23:06 libopenmpi_malloc.so -> >>>>>>>>> libopenmpi_malloc.so.0.0.0 >>>>>>>>> lrwxrwxrwx 1 root root 26 Feb 8 23:06 libopenmpi_malloc.so.0 -> >>>>>>>>> libopenmpi_malloc.so.0.0.0 >>>>>>>>> lrwxrwxrwx 1 root root 20 Feb 8 23:06 libulockmgr.so -> >>>>>>>>> libulockmgr.so.1.0.1 >>>>>>>>> lrwxrwxrwx 1 root root 20 Feb 8 23:06 libulockmgr.so.1 -> >>>>>>>>> libulockmgr.so.1.0.1 >>>>>>>>> lrwxrwxrwx 1 root root 16 Feb 8 23:06 libxml2.so -> >>>>>>>>> libxml2.so.2.7.2 >>>>>>>>> lrwxrwxrwx 1 root root 16 Feb 8 23:06 libxml2.so.2 -> >>>>>>>>> libxml2.so.2.7.2 >>>>>>>>> -rw-r--r-- 1 root root 385912 Jan 26 01:00 libvt.a >>>>>>>>> [tsakai@ip-10-203-21-132 ~]$ >>>>>>>>> [tsakai@ip-10-203-21-132 ~]$ # Now, I am really confused... >>>>>>>>> [tsakai@ip-10-203-21-132 ~]$ >>>>>>>>> >>>>>>>>> Do you know why it's complaining about shared libraries? >>>>>>>>> >>>>>>>>> Thank you. >>>>>>>>> >>>>>>>>> Tena >>>>>>>>> >>>>>>>>> >>>>>>>>> On 2/10/11 1:05 PM, "Jeff Squyres" <jsquy...@cisco.com> wrote: >>>>>>>>> >>>>>>>>>> Your prior mails were about ssh issues, but this one sounds like you >>>>>>>>>> might >>>>>>>>>> have firewall issues. >>>>>>>>>> >>>>>>>>>> That is, the "orted" command attempts to open a TCP socket back to >>>>>>>>>> mpirun >>>>>>>>>> for >>>>>>>>>> various command and control reasons. If it is blocked from doing so >>>>>>>>>> by >>>>>>>>>> a >>>>>>>>>> firewall, Open MPI won't run. In general, you can either disable >>>>>>>>>> your >>>>>>>>>> firewall or you can setup a trust relationship for TCP connections >>>>>>>>>> within >>>>>>>>>> your >>>>>>>>>> cluster. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Feb 10, 2011, at 1:03 PM, Tena Sakai wrote: >>>>>>>>>> >>>>>>>>>>> Hi Reuti, >>>>>>>>>>> >>>>>>>>>>> Thanks for suggesting "LogLevel DEBUG3." I did so and complete >>>>>>>>>>> session is captured in the attached file. >>>>>>>>>>> >>>>>>>>>>> What I did is much similar to what I have done before: verify >>>>>>>>>>> that ssh works and then run mpirun command. In my a bit lengthy >>>>>>>>>>> session log, there are two responses from "LogLevel DEBUG3." First >>>>>>>>>>> from an scp invocation and then from mpirun invocation. They both >>>>>>>>>>> say >>>>>>>>>>> debug1: Authentication succeeded (publickey). >>>>>>>>>>> >>>>>>>>>>>> From mpirun invocation, I see a line: >>>>>>>>>>> debug1: Sending command: orted --daemonize -mca ess env -mca >>>>>>>>>>> orte_ess_jobid 3344891904 -mca orte_ess_vpid 1 -mca >>>>>>>>>>> orte_ess_num_procs >>>>>>>>>>> 2 --hnp-uri "3344891904.0;tcp://10.194.95.239:54256" >>>>>>>>>>> The IP address at the end of the line is indeed that of machine B. >>>>>>>>>>> After that there was hanging and I controlled-C out of it, which >>>>>>>>>>> gave me more lines. But the lines after >>>>>>>>>>> debug1: Sending command: orted bla bla bla >>>>>>>>>>> doesn't look good to me. But, in truth, I have no idea what they >>>>>>>>>>> mean. >>>>>>>>>>> >>>>>>>>>>> If you could shed some light, I would appreciate it very much. >>>>>>>>>>> >>>>>>>>>>> Regards, >>>>>>>>>>> >>>>>>>>>>> Tena >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On 2/10/11 10:57 AM, "Reuti" <re...@staff.uni-marburg.de> wrote: >>>>>>>>>>> >>>>>>>>>>>> Hi, >>>>>>>>>>>> >>>>>>>>>>>> Am 10.02.2011 um 19:11 schrieb Tena Sakai: >>>>>>>>>>>> >>>>>>>>>>>>> your local machine is Linux like, but the execution hosts >>>>>>>>>>>>> are Macs? I saw the /Users/tsakai/... in your output. >>>>>>>>>>>>> No, my environment is entirely linux. The path to my home >>>>>>>>>>>>> directory on one host (blitzen) has been known as /Users/tsakai, >>>>>>>>>>>>> despite it is an nfs mount from vixen (which is known to >>>>>>>>>>>>> itself as /home/tsakai). For historical reasons, I have >>>>>>>>>>>>> chosen to give a symbolic link named /Users to vixen's /Home, >>>>>>>>>>>>> so that I can use consistent path for both vixen and blitzen. >>>>>>>>>>>> okay. Sometimes the protection of the home directory must be >>>>>>>>>>>> adjusted >>>>>>>>>>>> too, >>>>>>>>>>>> but >>>>>>>>>>>> as you can do it from the command line this shouldn't be an issue. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>>> Is this a private cluster (or at least private interfaces)? >>>>>>>>>>>>> It would also be an option to use hostbased authentication, >>>>>>>>>>>>> which will avoid setting any known_hosts file or passphraseless >>>>>>>>>>>>> ssh-keys for each user. >>>>>>>>>>>>> No, it is not a private cluster. It is Amazon EC2. When I >>>>>>>>>>>>> Ssh from my local machine (vixen) I use its public interface, >>>>>>>>>>>>> but to address from one amazon cluster node to the other I >>>>>>>>>>>>> use nodes' private dns names: domU-12-31-39-07-35-21 and >>>>>>>>>>>>> domU-12-31-39-06-74-E2. Both public and private dns names >>>>>>>>>>>>> change from a launch to another. I am using passphrasesless >>>>>>>>>>>>> ssh-keys for authentication in all cases, i.e., from vixen to >>>>>>>>>>>>> Amazon node A, from amazon node A to amazon node B, and from >>>>>>>>>>>>> Amazon node B back to A. (Please see my initail post. There >>>>>>>>>>>>> is a session dialogue for this.) They all work without authen- >>>>>>>>>>>>> tication dialogue, except a brief initial dialogue: >>>>>>>>>>>>> The authenticity of host 'domu-xx-xx-xx-xx-xx-x (10.xx.xx.xx)' >>>>>>>>>>>>> can't be established. >>>>>>>>>>>>> RSA key fingerprint is >>>>>>>>>>>>> e3:ad:75:b1:a4:63:7f:0f:c4:0b:10:71:f3:2f:21:81. >>>>>>>>>>>>> Are you sure you want to continue connecting (yes/no)? >>>>>>>>>>>>> to which I say "yes." >>>>>>>>>>>>> But I am unclear with what you mean by "hostbased authentication"? >>>>>>>>>>>>> Doesn't that mean with password? If so, it is not an option. >>>>>>>>>>>> No. It's convenient inside a private cluster as it won't fill each >>>>>>>>>>>> users' >>>>>>>>>>>> known_hosts file and you don't need to create any ssh-keys. But >>>>>>>>>>>> when >>>>>>>>>>>> the >>>>>>>>>>>> hostname changes every time it might also create new hostkeys. It >>>>>>>>>>>> uses >>>>>>>>>>>> hostkeys (private and public), this way it works for all users. >>>>>>>>>>>> Just >>>>>>>>>>>> for >>>>>>>>>>>> reference: >>>>>>>>>>>> >>>>>>>>>>>> http://arc.liv.ac.uk/SGE/howto/hostbased-ssh.html >>>>>>>>>>>> >>>>>>>>>>>> You could look into it later. >>>>>>>>>>>> >>>>>>>>>>>> == >>>>>>>>>>>> >>>>>>>>>>>> - Can you try to use a command when connecting from A to B? E.g. >>>>>>>>>>>> ssh >>>>>>>>>>>> `domU-12-31-39-06-74-E2 ls`. Is this working too? >>>>>>>>>>>> >>>>>>>>>>>> - What about putting: >>>>>>>>>>>> >>>>>>>>>>>> LogLevel DEBUG3 >>>>>>>>>>>> >>>>>>>>>>>> In your ~/.ssh/config. Maybe we can see what he's trying to >>>>>>>>>>>> negotiate >>>>>>>>>>>> before >>>>>>>>>>>> it fails in verbose mode. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> -- Reuti >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>>> Regards, >>>>>>>>>>>>> >>>>>>>>>>>>> Tena >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> On 2/10/11 2:27 AM, "Reuti" <re...@staff.uni-marburg.de> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>> Hi, >>>>>>>>>>>>> >>>>>>>>>>>>> your local machine is Linux like, but the execution hosts are >>>>>>>>>>>>> Macs? >>>>>>>>>>>>> I >>>>>>>>>>>>> saw >>>>>>>>>>>>> the >>>>>>>>>>>>> /Users/tsakai/... in your output. >>>>>>>>>>>>> >>>>>>>>>>>>> a) executing a command on them is also working, e.g.: ssh >>>>>>>>>>>>> domU-12-31-39-07-35-21 ls >>>>>>>>>>>>> >>>>>>>>>>>>> Am 10.02.2011 um 07:08 schrieb Tena Sakai: >>>>>>>>>>>>> >>>>>>>>>>>>> Hi, >>>>>>>>>>>>> >>>>>>>>>>>>> I have made a bit of progress(?)... >>>>>>>>>>>>> I made a config file in my .ssh directory on the cloud. It looks >>>>>>>>>>>>> like: >>>>>>>>>>>>> # machine A >>>>>>>>>>>>> Host domU-12-31-39-07-35-21.compute-1.internal >>>>>>>>>>>>> This is just an abbreviation or nickname above. To use the >>>>>>>>>>>>> specified >>>>>>>>>>>>> settings, >>>>>>>>>>>>> it's necessary to specify exactly this name. When the settings are >>>>>>>>>>>>> the >>>>>>>>>>>>> same >>>>>>>>>>>>> anyway for all machines, you can use: >>>>>>>>>>>>> >>>>>>>>>>>>> Host * >>>>>>>>>>>>> IdentityFile /home/tsakai/.ssh/tsakai >>>>>>>>>>>>> IdentitiesOnly yes >>>>>>>>>>>>> BatchMode yes >>>>>>>>>>>>> >>>>>>>>>>>>> instead. >>>>>>>>>>>>> >>>>>>>>>>>>> Is this a private cluster (or at least private interfaces)? It >>>>>>>>>>>>> would >>>>>>>>>>>>> also >>>>>>>>>>>>> be >>>>>>>>>>>>> an option to use hostbased authentication, which will avoid >>>>>>>>>>>>> setting >>>>>>>>>>>>> any >>>>>>>>>>>>> known_hosts file or passphraseless ssh-keys for each user. >>>>>>>>>>>>> >>>>>>>>>>>>> -- Reuti >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> HostName domU-12-31-39-07-35-21 >>>>>>>>>>>>> BatchMode yes >>>>>>>>>>>>> IdentityFile /home/tsakai/.ssh/tsakai >>>>>>>>>>>>> ChallengeResponseAuthentication no >>>>>>>>>>>>> IdentitiesOnly yes >>>>>>>>>>>>> >>>>>>>>>>>>> # machine B >>>>>>>>>>>>> Host domU-12-31-39-06-74-E2.compute-1.internal >>>>>>>>>>>>> HostName domU-12-31-39-06-74-E2 >>>>>>>>>>>>> BatchMode yes >>>>>>>>>>>>> IdentityFile /home/tsakai/.ssh/tsakai >>>>>>>>>>>>> ChallengeResponseAuthentication no >>>>>>>>>>>>> IdentitiesOnly yes >>>>>>>>>>>>> >>>>>>>>>>>>> This file exists on both machine A and machine B. >>>>>>>>>>>>> >>>>>>>>>>>>> Now When I issue mpirun command as below: >>>>>>>>>>>>> [tsakai@domU-12-31-39-06-74-E2 ~]$ mpirun -app app.ac2 >>>>>>>>>>>>> >>>>>>>>>>>>> It hungs. I control-C out of it and I get: >>>>>>>>>>>>> mpirun: killing job... >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>> ------------------------------------------------------------------------ >>>>>>>> -> >>>>>>>>> - >>>>>>>>>>>>> mpirun noticed that the job aborted, but has no info as to the >>>>>>>>>>>>> process >>>>>>>>>>>>> that caused that situation. >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>> ------------------------------------------------------------------------ >>>>>>>> -> >>>>>>>>> - >>>>>>>> ------------------------------------------------------------------------ >>>>>>>> -> >>>>>>>>> - >>>>>>>>>>>>> mpirun was unable to cleanly terminate the daemons on the nodes >>>>>>>>>>>>> shown >>>>>>>>>>>>> below. Additional manual cleanup may be required - please refer to >>>>>>>>>>>>> the "orte-clean" tool for assistance. >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>> ------------------------------------------------------------------------ >>>>>>>> -> >>>>>>>>> - >>>>>>>>>>>>> domU-12-31-39-07-35-21.compute-1.internal - daemon did not >>>>>>>>>>>>> report >>>>>>>>>>>>> back when launched >>>>>>>>>>>>> >>>>>>>>>>>>> Am I making progress? >>>>>>>>>>>>> >>>>>>>>>>>>> Does this mean I am past authentication and something else is the >>>>>>>>>>>>> problem? >>>>>>>>>>>>> Does someone have an example .ssh/config file I can look at? >>>>>>>>>>>>> There >>>>>>>>>>>>> are >>>>>>>>>>>>> so >>>>>>>>>>>>> many keyword-argument paris for this config file and I would like >>>>>>>>>>>>> to >>>>>>>>>>>>> look >>>>>>>>>>>>> at >>>>>>>>>>>>> some very basic one that works. >>>>>>>>>>>>> >>>>>>>>>>>>> Thank you. >>>>>>>>>>>>> >>>>>>>>>>>>> Tena Sakai >>>>>>>>>>>>> tsa...@gallo.ucsf.edu >>>>>>>>>>>>> >>>>>>>>>>>>> On 2/9/11 7:52 PM, "Tena Sakai" <tsa...@gallo.ucsf.edu> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>> Hi >>>>>>>>>>>>> >>>>>>>>>>>>> I have an app.ac1 file like below: >>>>>>>>>>>>> [tsakai@vixen local]$ cat app.ac1 >>>>>>>>>>>>> -H vixen.egcrc.org -np 1 Rscript >>>>>>>>>>>>> /Users/tsakai/Notes/R/parallel/Rmpi/local/fib.R 5 >>>>>>>>>>>>> -H vixen.egcrc.org -np 1 Rscript >>>>>>>>>>>>> /Users/tsakai/Notes/R/parallel/Rmpi/local/fib.R 6 >>>>>>>>>>>>> -H blitzen.egcrc.org -np 1 Rscript >>>>>>>>>>>>> /Users/tsakai/Notes/R/parallel/Rmpi/local/fib.R 7 >>>>>>>>>>>>> -H blitzen.egcrc.org -np 1 Rscript >>>>>>>>>>>>> /Users/tsakai/Notes/R/parallel/Rmpi/local/fib.R 8 >>>>>>>>>>>>> >>>>>>>>>>>>> The program I run is >>>>>>>>>>>>> Rscript /Users/tsakai/Notes/R/parallel/Rmpi/local/fib.R x >>>>>>>>>>>>> Where x is [5..8]. The machines vixen and blitzen each run 2 >>>>>>>>>>>>> runs. >>>>>>>>>>>>> >>>>>>>>>>>>> Here’s the program fib.R: >>>>>>>>>>>>> [ tsakai@vixen local]$ cat fib.R >>>>>>>>>>>>> # fib() computes, given index n, fibonacci number iteratively >>>>>>>>>>>>> # here's the first dozen sequence (indexed from 0..11) >>>>>>>>>>>>> # 1, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89 >>>>>>>>>>>>> >>>>>>>>>>>>> fib <- function( n ) { >>>>>>>>>>>>> a <- 0 >>>>>>>>>>>>> b <- 1 >>>>>>>>>>>>> for ( i in 1:n ) { >>>>>>>>>>>>> t <- b >>>>>>>>>>>>> b <- a >>>>>>>>>>>>> a <- a + t >>>>>>>>>>>>> } >>>>>>>>>>>>> a >>>>>>>>>>>>> >>>>>>>>>>>>> arg <- commandArgs( TRUE ) >>>>>>>>>>>>> myHost <- system( 'hostname', intern=TRUE ) >>>>>>>>>>>>> cat( fib(arg), myHost, '\n' ) >>>>>>>>>>>>> >>>>>>>>>>>>> It reads an argument from command line and produces a fibonacci >>>>>>>>>>>>> number >>>>>>>>>>>>> that >>>>>>>>>>>>> corresponds to that index, followed by the machine name. Pretty >>>>>>>>>>>>> simple >>>>>>>>>>>>> stuff. >>>>>>>>>>>>> >>>>>>>>>>>>> Here’s the run output: >>>>>>>>>>>>> [tsakai@vixen local]$ mpirun -app app.ac1 >>>>>>>>>>>>> 5 vixen.egcrc.org >>>>>>>>>>>>> 8 vixen.egcrc.org >>>>>>>>>>>>> 13 blitzen.egcrc.org >>>>>>>>>>>>> 21 blitzen.egcrc.org >>>>>>>>>>>>> >>>>>>>>>>>>> Which is exactly what I expect. So far so good. >>>>>>>>>>>>> >>>>>>>>>>>>> Now I want to run the same thing on cloud. I launch 2 instances >>>>>>>>>>>>> of >>>>>>>>>>>>> the >>>>>>>>>>>>> same >>>>>>>>>>>>> virtual machine, to which I get to by: >>>>>>>>>>>>> [tsakai@vixen local]$ ssh –A I ~/.ssh/tsakai >>>>>>>>>>>>> machine-instance-A-public-dns >>>>>>>>>>>>> >>>>>>>>>>>>> Now I am on machine A: >>>>>>>>>>>>> [tsakai@domU-12-31-39-00-D1-F2 ~]$ >>>>>>>>>>>>> [tsakai@domU-12-31-39-00-D1-F2 ~]$ # and I can go to machine B >>>>>>>>>>>>> without >>>>>>>>>>>>> password authentication, >>>>>>>>>>>>> [tsakai@domU-12-31-39-00-D1-F2 ~]$ # i.e., use public/private key >>>>>>>>>>>>> [tsakai@domU-12-31-39-00-D1-F2 ~]$ >>>>>>>>>>>>> [tsakai@domU-12-31-39-00-D1-F2 ~]$ hostname >>>>>>>>>>>>> domU-12-31-39-00-D1-F2 >>>>>>>>>>>>> [tsakai@domU-12-31-39-00-D1-F2 ~]$ ssh -i .ssh/tsakai >>>>>>>>>>>>> domU-12-31-39-0C-C8-01 >>>>>>>>>>>>> Last login: Wed Feb 9 20:51:48 2011 from 10.254.214.4 >>>>>>>>>>>>> [tsakai@domU-12-31-39-0C-C8-01 ~]$ >>>>>>>>>>>>> [tsakai@domU-12-31-39-0C-C8-01 ~]$ # I am now on machine B >>>>>>>>>>>>> [tsakai@domU-12-31-39-0C-C8-01 ~]$ hostname >>>>>>>>>>>>> domU-12-31-39-0C-C8-01 >>>>>>>>>>>>> [tsakai@domU-12-31-39-0C-C8-01 ~]$ >>>>>>>>>>>>> [tsakai@domU-12-31-39-0C-C8-01 ~]$ # now show I can get to machine >>>>>>>>>>>>> A >>>>>>>>>>>>> without using password >>>>>>>>>>>>> [tsakai@domU-12-31-39-0C-C8-01 ~]$ >>>>>>>>>>>>> [tsakai@domU-12-31-39-0C-C8-01 ~]$ ssh -i .ssh/tsakai >>>>>>>>>>>>> domU-12-31-39-00-D1-F2 >>>>>>>>>>>>> The authenticity of host 'domu-12-31-39-00-d1-f2 (10.254.214.4)' >>>>>>>>>>>>> can't >>>>>>>>>>>>> be established. >>>>>>>>>>>>> RSA key fingerprint is >>>>>>>>>>>>> e3:ad:75:b1:a4:63:7f:0f:c4:0b:10:71:f3:2f:21:81. >>>>>>>>>>>>> Are you sure you want to continue connecting (yes/no)? yes >>>>>>>>>>>>> Warning: Permanently added 'domu-12-31-39-00-d1-f2' (RSA) to the >>>>>>>>>>>>> list >>>>>>>>>>>>> of >>>>>>>>>>>>> known hosts. >>>>>>>>>>>>> Last login: Wed Feb 9 20:49:34 2011 from 10.215.203.239 >>>>>>>>>>>>> [tsakai@domU-12-31-39-00-D1-F2 ~]$ >>>>>>>>>>>>> [tsakai@domU-12-31-39-00-D1-F2 ~]$ hostname >>>>>>>>>>>>> domU-12-31-39-00-D1-F2 >>>>>>>>>>>>> [tsakai@domU-12-31-39-00-D1-F2 ~]$ >>>>>>>>>>>>> [tsakai@domU-12-31-39-00-D1-F2 ~]$ exit >>>>>>>>>>>>> logout >>>>>>>>>>>>> Connection to domU-12-31-39-00-D1-F2 closed. >>>>>>>>>>>>> [tsakai@domU-12-31-39-0C-C8-01 ~]$ >>>>>>>>>>>>> [tsakai@domU-12-31-39-0C-C8-01 ~]$ exit >>>>>>>>>>>>> logout >>>>>>>>>>>>> Connection to domU-12-31-39-0C-C8-01 closed. >>>>>>>>>>>>> [tsakai@domU-12-31-39-00-D1-F2 ~]$ >>>>>>>>>>>>> [tsakai@domU-12-31-39-00-D1-F2 ~]$ # back at machine A >>>>>>>>>>>>> [tsakai@domU-12-31-39-00-D1-F2 ~]$ hostname >>>>>>>>>>>>> domU-12-31-39-00-D1-F2 >>>>>>>>>>>>> >>>>>>>>>>>>> As you can see, neither machine uses password for authentication; >>>>>>>>>>>>> it >>>>>>>>>>>>> uses >>>>>>>>>>>>> public/private key pairs. There is no problem (that I can see) >>>>>>>>>>>>> for >>>>>>>>>>>>> ssh >>>>>>>>>>>>> invocation >>>>>>>>>>>>> from one machine to the other. This is so because I have a copy >>>>>>>>>>>>> of >>>>>>>>>>>>> public >>>>>>>>>>>>> key >>>>>>>>>>>>> and a copy of private key on each instance. >>>>>>>>>>>>> >>>>>>>>>>>>> The app.ac file is identical, except the node names: >>>>>>>>>>>>> [tsakai@domU-12-31-39-00-D1-F2 ~]$ cat app.ac1 >>>>>>>>>>>>> -H domU-12-31-39-00-D1-F2 -np 1 Rscript /home/tsakai/fib.R 5 >>>>>>>>>>>>> -H domU-12-31-39-00-D1-F2 -np 1 Rscript /home/tsakai/fib.R 6 >>>>>>>>>>>>> -H domU-12-31-39-0C-C8-01 -np 1 Rscript /home/tsakai/fib.R 7 >>>>>>>>>>>>> -H domU-12-31-39-0C-C8-01 -np 1 Rscript /home/tsakai/fib.R 8 >>>>>>>>>>>>> >>>>>>>>>>>>> Here’s what happens with mpirun: >>>>>>>>>>>>> >>>>>>>>>>>>> [tsakai@domU-12-31-39-00-D1-F2 ~]$ mpirun -app app.ac1 >>>>>>>>>>>>> tsakai@domu-12-31-39-0c-c8-01's password: >>>>>>>>>>>>> Permission denied, please try again. >>>>>>>>>>>>> tsakai@domu-12-31-39-0c-c8-01's password: mpirun: killing job... >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>> -----------------------------------------------------------------------> >>>>>>>>>> >>>>>>>> - >>>>>>>>>>>>> -- >>>>>>>>>>>>> mpirun noticed that the job aborted, but has no info as to the >>>>>>>>>>>>> process >>>>>>>>>>>>> that caused that situation. >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>> -----------------------------------------------------------------------> >>>>>>>>>> >>>>>>>> - >>>>>>>>>>>>> -- >>>>>>>>>>>>> >>>>>>>>>>>>> mpirun: clean termination accomplished >>>>>>>>>>>>> >>>>>>>>>>>>> [tsakai@domU-12-31-39-00-D1-F2 ~]$ >>>>>>>>>>>>> >>>>>>>>>>>>> Mpirun (or somebody else?) asks me password, which I don’t have. >>>>>>>>>>>>> I end up typing control-C. >>>>>>>>>>>>> >>>>>>>>>>>>> Here’s my question: >>>>>>>>>>>>> How can I get past authentication by mpirun where there is no >>>>>>>>>>>>> password? >>>>>>>>>>>>> >>>>>>>>>>>>> I would appreciate your help/insight greatly. >>>>>>>>>>>>> >>>>>>>>>>>>> Thank you. >>>>>>>>>>>>> >>>>>>>>>>>>> Tena Sakai >>>>>>>>>>>>> tsa...@gallo.ucsf.edu >>>> _______________________________________________ >>>> users mailing list >>>> us...@open-mpi.org >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >>> >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users