Hi Ralph, Yes, I have. In openmpi email archive I found a few threads related to EC2, but nothing relevant to what I am experiencing. On EC2 discussion list, I find a few mention of openmpi, out of which one case maybe I can be of help, but nothing that is relevant to my situation.
Regards, Tena On 2/12/11 6:06 AM, "Ralph Castain" <r...@open-mpi.org> wrote: > Have you searched the email archive and/or web for openmpi and Amazon cloud? > Others have previously worked through many of these problems for that > environment - might be worth a look to see if someone already solved this, or > at least a contact point for someone who is already running in that > environment. > > IIRC, there are some unique problems with running on that platform. > > > On Feb 12, 2011, at 12:38 AM, Tena Sakai wrote: > >> Hi Gus, >> >> Thank you for all your suggestions. >> >> I fixed the limits as you suggested and ran the test and >> I am still getting the same failure. More on that in a >> bit. But here is a bit of my response to what you mentioned. >> >>> the IP number you checked now is not the same as in your >>> message with the MPI failure/errors. >>> Not sure if I understand which computers we're talking about, >>> or where these computers are (at Amazon?), >>> or if they change depending on each session you use to run your programs, >>> if they are identical machines with the same limits or if they differ. >> >> Everything I mentioned in last 2-3 days is on Amazon EC2 cloud. I >> have no problem running the same thing locally (vixen is my local >> machine): >> >> [tsakai@vixen Rmpi]$ cat app.ac1 >> -H vixen -np 1 Rscript /Users/tsakai/Notes/R/parallel/Rmpi/fib.R 5 >> -H vixen -np 1 Rscript /Users/tsakai/Notes/R/parallel/Rmpi/fib.R 6 >> -H blitzen -np 1 Rscript /Users/tsakai/Notes/R/parallel/Rmpi/fib.R 7 >> -H blitzen -np 1 Rscript /Users/tsakai/Notes/R/parallel/Rmpi/fib.R 8 >> [tsakai@vixen Rmpi]$ >> [tsakai@vixen Rmpi]$ mpirun --app app.ac1 >> 5 vixen.egcrc.org >> 8 vixen.egcrc.org >> 13 blitzen.egcrc.org >> 21 blitzen.egcrc.org >> [tsakai@vixen Rmpi]$ # these lines are correct result. >> [tsakai@vixen Rmpi]$ >> >> With Amazon EC2, where the strange behavior happens, is a virtualized >> environment. They charge by hours. I launch an instance of a machine >> when I need it and I shut them down when I am done. Each time I get >> different IP addresses (2 per instance, one on internal network and >> the other for public interface). That is why I don't show consistent >> ip address or dns. Every time I shutdown the machine, what I did on >> that instance disappears and on next instance I have to recreate it >> from scratch --case in point is ~/home/.ssh/config--, which is what >> I have been doing (unless I take 'snapshot' of the image and save it >> to a persistent storage (and doing snapshot is a bit of work)). >> >>> One of the error messages mentions LD_LIBRARY_PATH. >>> Is it set to point to the OpenMPI lib directory? >>> Remember, OpenMPI requires both PATH and LD_LIBRARY_PATH properly >>> set. >> >> Yes, I have been setting LD_LIBRARY_PATH manually every time, because >> I have neglected to put it into my bash startup file as part of AMI >> (Amazon Machine Image) building. >> >> Now what I have done is get onto an instance as tsakai, save output >> from 'ulimit -a', set /etc/security/limits.conf parameters as you >> suggest, get off and re-log onto the instance (thereby activating >> those ulimit parameters), and ran the same (actually simpler) test, >> as tsakai and as root. >> >> [tsakai@vixen Rmpi]$ >> [tsakai@vixen Rmpi]$ # 2ec2 below is a script/wrapper around ssh to >> [tsakai@vixen Rmpi]$ # make ssh invocation line shorter. >> [tsakai@vixen Rmpi]$ >> [tsakai@vixen ec2]$ 2ec2 ec2-50-16-55-64.compute-1.amazonaws.com >> The authenticity of host 'ec2-50-16-55-64.compute-1.amazonaws.com >> (50.16.55.64)' can't be established. >> RSA key fingerprint is e3:ad:75:b1:a4:63:7f:0f:c4:0b:10:71:f3:2f:21:81. >> Are you sure you want to continue connecting (yes/no)? yes >> Last login: Tue Feb 8 22:52:54 2011 from 10.201.197.188 >> [tsakai@ip-10-114-138-129 ~]$ >> [tsakai@ip-10-114-138-129 ~]$ ulimit -a > mylimit.1 >> [tsakai@ip-10-114-138-129 ~]$ >> [tsakai@ip-10-114-138-129 ~]$ sudo su >> bash-3.2# >> bash-3.2# cat - >> /etc/security/limits.conf >> * - memlock -1 >> * - stack -1 >> * - nofile 4096 >> bash-3.2# >> bash-3.2# tail /etc/security/limits.conf >> #@student hard nproc 20 >> #@faculty soft nproc 20 >> #@faculty hard nproc 50 >> #ftp hard nproc 0 >> #@student - maxlogins 4 >> >> # End of file >> * - memlock -1 >> * - stack -1 >> * - nofile 4096 >> bash-3.2# >> bash-3.2# exit >> exit >> [tsakai@ip-10-114-138-129 ~]$ >> [tsakai@ip-10-114-138-129 ~]$ # logout and log back in to activate the >> [tsakai@ip-10-114-138-129 ~]$ # new setting. >> [tsakai@ip-10-114-138-129 ~]$ exit >> logout >> [tsakai@vixen ec2]$ >> [tsakai@vixen ec2]$ # I am back on vixen and about to relogging back onto >> [tsakai@vixen ec2]$ # the instance which is still running. >> [tsakai@vixen ec2]$ >> [tsakai@vixen ec2]$ 2ec2 ec2-50-16-55-64.compute-1.amazonaws.com >> Last login: Fri Feb 11 23:50:47 2011 from 63.193.205.1 >> [tsakai@ip-10-114-138-129 ~]$ >> [tsakai@ip-10-114-138-129 ~]$ ulimit -a > mylimit.2 >> [tsakai@ip-10-114-138-129 ~]$ >> [tsakai@ip-10-114-138-129 ~]$ diff mylimit.1 mylimit.2 >> 6c6 >> < max locked memory (kbytes, -l) 32 >> --- >>> max locked memory (kbytes, -l) unlimited >> 8c8 >> < open files (-n) 1024 >> --- >>> open files (-n) 4096 >> 12c12 >> < stack size (kbytes, -s) 8192 >> --- >>> stack size (kbytes, -s) unlimited >> [tsakai@ip-10-114-138-129 ~]$ >> [tsakai@ip-10-114-138-129 ~]$ # yes, I have the same ulimit parameters as >> [tsakai@ip-10-114-138-129 ~]$ # Gus suggested. >> [tsakai@ip-10-114-138-129 ~]$ >> [tsakai@ip-10-114-138-129 ~]$ export LD_LIBRARY_PATH=/usr/local/lib >> [tsakai@ip-10-114-138-129 ~]$ >> [tsakai@ip-10-114-138-129 ~]$ env | grep LD_LIB >> LD_LIBRARY_PATH=/usr/local/lib >> [tsakai@ip-10-114-138-129 ~]$ >> [tsakai@ip-10-114-138-129 ~]$ cat - > app.ac >> -H ip-10-114-138-129.ec2.internal -np 1 Rscript /home/tsakai/fib.R 5 >> -H ip-10-114-138-129.ec2.internal -np 1 Rscript /home/tsakai/fib.R 6 >> [tsakai@ip-10-114-138-129 ~]$ >> [tsakai@ip-10-114-138-129 ~]$ cat app.ac >> -H ip-10-114-138-129.ec2.internal -np 1 Rscript /home/tsakai/fib.R 5 >> -H ip-10-114-138-129.ec2.internal -np 1 Rscript /home/tsakai/fib.R 6 >> [tsakai@ip-10-114-138-129 ~]$ >> [tsakai@ip-10-114-138-129 ~]$ hostname >> ip-10-114-138-129 >> [tsakai@ip-10-114-138-129 ~]$ >> [tsakai@ip-10-114-138-129 ~]$ # this run doesn't involve other node. >> [tsakai@ip-10-114-138-129 ~]$ # just use this machine's cores. >> [tsakai@ip-10-114-138-129 ~]$ # there are 2 cores on this machine. >> [tsakai@ip-10-114-138-129 ~]$ >> [tsakai@ip-10-114-138-129 ~]$ mpirun --app app.ac >> -------------------------------------------------------------------------- >> mpirun was unable to launch the specified application as it encountered an >> error: >> >> Error: pipe function call failed when setting up I/O forwarding subsystem >> Node: ip-10-114-138-129 >> >> while attempting to start process rank 0. >> -------------------------------------------------------------------------- >> [tsakai@ip-10-114-138-129 ~]$ >> [tsakai@ip-10-114-138-129 ~]$ # I still get the same error! >> [tsakai@ip-10-114-138-129 ~]$ >> [tsakai@ip-10-114-138-129 ~]$ cat /proc/sys/fs/file-nr >> 512 0 762674 >> [tsakai@ip-10-114-138-129 ~]$ >> [tsakai@ip-10-114-138-129 ~]$ # number of open files (512) is no where >> [tsakai@ip-10-114-138-129 ~]$ # close to the limit, which is 4096 now. >> [tsakai@ip-10-114-138-129 ~]$ # now let's run it as root. >> [tsakai@ip-10-114-138-129 ~]$ >> [tsakai@ip-10-114-138-129 ~]$ sudo su >> bash-3.2# >> bash-3.2# env | grep LD_LIBR >> LD_LIBRARY_PATH=/usr/local/lib >> bash-3.2# >> bash-3.2# pwd >> /home/tsakai >> bash-3.2# >> bash-3.2# mpirun --app ./app.ac >> 5 ip-10-114-138-129 >> 8 ip-10-114-138-129 >> bash-3.2# >> bash-3.2# # that's correct result! >> bash-3.2# >> bash-3.2# cat /proc/sys/fs/file-nr >> 512 0 762674 >> bash-3.2# >> bash-3.2# # this shows that mpirun didn't leave any >> bash-3.2# # oepn file behind, I think. That's good. >> bash-3.2# >> bash-3.2# exit >> exit >> [tsakai@ip-10-114-138-129 ~]$ >> [tsakai@ip-10-114-138-129 ~]$ exit >> logout >> [tsakai@vixen ec2]$ >> >> Had it been the case, it failed both as root and as user >> tsakai, I can conclude that either the virtualized environment >> is disagreeable with openmpi OR there is something wrong with >> what I am trying to do. But what kills me is that it *does* >> work when run by root. Why pipe system call fails on user >> tsakai and not on root is something I don't understand. >> >> BTW, here is the same test (using a single machine) in local >> environment (i.e., no virtualized environment): >> >> [tsakai@vixen Rmpi]$ cat app.ac2 >> -H vixen -np 1 Rscript /Users/tsakai/Notes/R/parallel/Rmpi/fib.R 5 >> -H vixen -np 1 Rscript /Users/tsakai/Notes/R/parallel/Rmpi/fib.R 6 >> [tsakai@vixen Rmpi]$ >> [tsakai@vixen Rmpi]$ mpirun --app app.ac2 >> 5 vixen.egcrc.org >> 8 vixen.egcrc.org >> [tsakai@vixen Rmpi]$ >> >> I am running out of stones to turn over for now and maybe it's >> a good time to go to bed. :) >> >> I would appreciate it if you come up with a different things >> to try. >> >> Many thanks for your help. >> >> Regards, >> >> Tena >> >> >> On 2/11/11 7:45 PM, "Gus Correa" <g...@ldeo.columbia.edu> wrote: >> >>> Hi Tena >>> >>> We setup the cluster nodes to run MPI programs >>> with stacksize unlimited, >>> memlock unlimited, >>> 4096 max open files, >>> to avoid crashing on edge cases. >>> This is kind of typical for HPC, MPI, number crunching. >>> >>> However, some are quite big codes, >>> and from what you said yours is not (or not yet). >>> >>> Your stack limit sounds quite small, but when >>> we had problems with stack the result was a segmentation fault. >>> 1024 files I guess is a default for 32 bit Linux distributions, >>> but some programs break there. >>> >>> If you want to do this, put these lines on the bottom >>> of /etc/security/limits.conf: >>> >>> # End of file >>> * - memlock -1 >>> * - stack -1 >>> * - nofile 4096 >>> >>> I don't think you should give unlimited number of processes to >>> regular users; keep this privilege to root (which is where >>> the two have different limits). >>> >>> You may want to monitor /proc/sys/fs/file-nr while the program runs. >>> The first number is the actual number of open files. >>> Top or vmstat also help see how you are doing in terms of memory, >>> although you suggested these are (small?) test programs, unlikely to run >>> out of memory. >>> >>> If you are using two nodes, check the same stuff on the other node too. >>> Also, the IP number you checked now is not the same as in your >>> message with the MPI failure/errors. >>> Not sure if I understand which computers we're talking about, >>> or where these computers are (at Amazon?), >>> or if they change depending on each session you use to run your programs, >>> if they are identical machines with the same limits or if they differ. >>> >>> One of the error messages mentions LD_LIBRARY_PATH. >>> Is it set to point to the OpenMPI lib directory? >>> Remember, OpenMPI requires both PATH and LD_LIBRARY_PATH properly set. >>> >>> I hope this helps, although I am afraid I may be missing the point. >>> >>> Gus Correa >>> >>> Tena Sakai wrote: >>>> Hi Gus, >>>> >>>> Thank you for your tips. >>>> >>>> I didn't find any smoking gun or anything comes close. >>>> Here's the upshot: >>>> >>>> [tsakai@ip-10-114-239-188 ~]$ ulimit -a >>>> core file size (blocks, -c) 0 >>>> data seg size (kbytes, -d) unlimited >>>> scheduling priority (-e) 0 >>>> file size (blocks, -f) unlimited >>>> pending signals (-i) 61504 >>>> max locked memory (kbytes, -l) 32 >>>> max memory size (kbytes, -m) unlimited >>>> open files (-n) 1024 >>>> pipe size (512 bytes, -p) 8 >>>> POSIX message queues (bytes, -q) 819200 >>>> real-time priority (-r) 0 >>>> stack size (kbytes, -s) 8192 >>>> cpu time (seconds, -t) unlimited >>>> max user processes (-u) 61504 >>>> virtual memory (kbytes, -v) unlimited >>>> file locks (-x) unlimited >>>> [tsakai@ip-10-114-239-188 ~]$ >>>> [tsakai@ip-10-114-239-188 ~]$ sudo su >>>> bash-3.2# >>>> bash-3.2# ulimit -a >>>> core file size (blocks, -c) 0 >>>> data seg size (kbytes, -d) unlimited >>>> scheduling priority (-e) 0 >>>> file size (blocks, -f) unlimited >>>> pending signals (-i) 61504 >>>> max locked memory (kbytes, -l) 32 >>>> max memory size (kbytes, -m) unlimited >>>> open files (-n) 1024 >>>> pipe size (512 bytes, -p) 8 >>>> POSIX message queues (bytes, -q) 819200 >>>> real-time priority (-r) 0 >>>> stack size (kbytes, -s) 8192 >>>> cpu time (seconds, -t) unlimited >>>> max user processes (-u) unlimited >>>> virtual memory (kbytes, -v) unlimited >>>> file locks (-x) unlimited >>>> bash-3.2# >>>> bash-3.2# >>>> bash-3.2# ulimit -a > root_ulimit-a >>>> bash-3.2# exit >>>> [tsakai@ip-10-114-239-188 ~]$ >>>> [tsakai@ip-10-114-239-188 ~]$ ulimit -a > tsakai_ulimit-a >>>> [tsakai@ip-10-114-239-188 ~]$ >>>> [tsakai@ip-10-114-239-188 ~]$ diff root_ulimit-a tsakai_ulimit-a >>>> 14c14 >>>> < max user processes (-u) unlimited >>>> --- >>>>> max user processes (-u) 61504 >>>> [tsakai@ip-10-114-239-188 ~]$ >>>> [tsakai@ip-10-114-239-188 ~]$ cat /proc/sys/fs/file-nr >>>> /proc/sys/fs/file-max >>>> 480 0 762674 >>>> 762674 >>>> [tsakai@ip-10-114-239-188 ~]$ >>>> [tsakai@ip-10-114-239-188 ~]$ sudo su >>>> bash-3.2# >>>> bash-3.2# cat /proc/sys/fs/file-nr /proc/sys/fs/file-max >>>> 512 0 762674 >>>> 762674 >>>> bash-3.2# exit >>>> exit >>>> [tsakai@ip-10-114-239-188 ~]$ >>>> [tsakai@ip-10-114-239-188 ~]$ >>>> [tsakai@ip-10-114-239-188 ~]$ sysctl -a |grep fs.file-max >>>> -bash: sysctl: command not found >>>> [tsakai@ip-10-114-239-188 ~]$ >>>> [tsakai@ip-10-114-239-188 ~]$ /sbin/!! >>>> /sbin/sysctl -a |grep fs.file-max >>>> error: permission denied on key 'kernel.cad_pid' >>>> error: permission denied on key 'kernel.cap-bound' >>>> fs.file-max = 762674 >>>> [tsakai@ip-10-114-239-188 ~]$ >>>> [tsakai@ip-10-114-239-188 ~]$ sudo /sbin/sysctl -a | grep fs.file-max >>>> fs.file-max = 762674 >>>> [tsakai@ip-10-114-239-188 ~]$ >>>> >>>> I see a bit of difference between root and tsakai, but I cannot >>>> believe such small difference results in somewhat a catastrophic >>>> failure as I have reported. Would you agree with me? >>>> >>>> Regards, >>>> >>>> Tena >>>> >>>> On 2/11/11 6:06 PM, "Gus Correa" <g...@ldeo.columbia.edu> wrote: >>>> >>>>> Hi Tena >>>>> >>>>> Please read one answer inline. >>>>> >>>>> Tena Sakai wrote: >>>>>> Hi Jeff, >>>>>> Hi Gus, >>>>>> >>>>>> Thanks for your replies. >>>>>> >>>>>> I have pretty much ruled out PATH issues by setting tsakai's PATH >>>>>> as identical to that of root. In that setting I reproduced the >>>>>> same result as before: root can run mpirun correctly and tsakai >>>>>> cannot. >>>>>> >>>>>> I have also checked out permission on /tmp directory. tsakai has >>>>>> no problem creating files under /tmp. >>>>>> >>>>>> I am trying to come up with a strategy to show that each and every >>>>>> programs in the PATH has "world" executable permission. It is a >>>>>> stone to turn over, but I am not holding my breath. >>>>>> >>>>>>> ... you are running out of file descriptors. Are file descriptors >>>>>>> limited on a per-process basis, perchance? >>>>>> I have never heard there is such restriction on Amazon EC2. There >>>>>> are folks who keep running instances for a long, long time. Whereas >>>>>> in my case, I launch 2 instances, check things out, and then turn >>>>>> the instances off. (Given that the state of California has a huge >>>>>> debts, our funding is very tight.) So, I really doubt that's the >>>>>> case. I have run mpirun unsuccessfully as user tsakai and immediately >>>>>> after successfully as root. Still, I would be happy if you can tell >>>>>> me a way to tell number of file descriptors used or remmain. >>>>>> >>>>>> Your mentioned file descriptors made me think of something under >>>>>> /dev. But I don't know exactly what I am fishing. Do you have >>>>>> some suggestions? >>>>>> >>>>> 1) If the environment has anything to do with Linux, >>>>> check: >>>>> >>>>> cat /proc/sys/fs/file-nr /proc/sys/fs/file-max >>>>> >>>>> >>>>> or >>>>> >>>>> sysctl -a |grep fs.file-max >>>>> >>>>> This max can be set (fs.file-max=whatever_is_reasonable) >>>>> in /etc/sysctl.conf >>>>> >>>>> See 'man sysctl' and 'man sysctl.conf' >>>>> >>>>> 2) Another possible source of limits. >>>>> >>>>> Check "ulimit -a" (bash) or "limit" (tcsh). >>>>> >>>>> If you need to change look at: >>>>> >>>>> /etc/security/limits.conf >>>>> >>>>> (See also 'man limits.conf') >>>>> >>>>> ** >>>>> >>>>> Since "root can but Tena cannot", >>>>> I would check 2) first, >>>>> as they are the 'per user/per group' limits, >>>>> whereas 1) is kernel/system-wise. >>>>> >>>>> I hope this helps, >>>>> Gus Correa >>>>> >>>>> PS - I know you are a wise and careful programmer, >>>>> but here we had cases of programs that would >>>>> fail because of too many files that were open and never closed, >>>>> eventually exceeding the max available/permissible. >>>>> So, it does happen. >>>>> >>>>>> I wish I could reproduce this (weired) behavior on a different >>>>>> set of machines. I certainly cannot in my local environment. Sigh! >>>>>> >>>>>> Regards, >>>>>> >>>>>> Tena >>>>>> >>>>>> >>>>>> On 2/11/11 3:17 PM, "Jeff Squyres (jsquyres)" <jsquy...@cisco.com> wrote: >>>>>> >>>>>>> It is concerning if the pipe system call fails - I can't think of why >>>>>>> that >>>>>>> would happen. Thats not usually a permissions issue but rather a deeper >>>>>>> indication that something is either seriously wrong on your system or >>>>>>> you >>>>>>> are >>>>>>> running out of file descriptors. Are file descriptors limited on a >>>>>>> per-process >>>>>>> basis, perchance? >>>>>>> >>>>>>> Sent from my PDA. No type good. >>>>>>> >>>>>>> On Feb 11, 2011, at 10:08 AM, "Gus Correa" <g...@ldeo.columbia.edu> >>>>>>> wrote: >>>>>>> >>>>>>>> Hi Tena >>>>>>>> >>>>>>>> Since root can but you can't, >>>>>>>> is is a directory permission problem perhaps? >>>>>>>> Check the execution directory permission (on both machines, >>>>>>>> if this is not NFS mounted dir). >>>>>>>> I am not sure, but IIRR OpenMPI also uses /tmp for >>>>>>>> under-the-hood stuff, worth checking permissions there also. >>>>>>>> Just a naive guess. >>>>>>>> >>>>>>>> Congrats for all the progress with the cloudy MPI! >>>>>>>> >>>>>>>> Gus Correa >>>>>>>> >>>>>>>> Tena Sakai wrote: >>>>>>>>> Hi, >>>>>>>>> I have made a bit more progress. I think I can say ssh authenti- >>>>>>>>> cation problem is behind me now. I am still having a problem running >>>>>>>>> mpirun, but the latest discovery, which I can reproduce, is that >>>>>>>>> I can run mpirun as root. Here's the session log: >>>>>>>>> [tsakai@vixen ec2]$ 2ec2 ec2-184-73-104-242.compute-1.amazonaws.com >>>>>>>>> Last login: Fri Feb 11 00:41:11 2011 from 10.100.243.195 >>>>>>>>> [tsakai@ip-10-195-198-31 ~]$ >>>>>>>>> [tsakai@ip-10-195-198-31 ~]$ ll >>>>>>>>> total 8 >>>>>>>>> -rw-rw-r-- 1 tsakai tsakai 274 Feb 11 00:47 app.ac >>>>>>>>> -rwxr-xr-x 1 tsakai tsakai 379 Feb 11 00:48 fib.R >>>>>>>>> [tsakai@ip-10-195-198-31 ~]$ >>>>>>>>> [tsakai@ip-10-195-198-31 ~]$ ll .ssh >>>>>>>>> total 16 >>>>>>>>> -rw------- 1 tsakai tsakai 232 Feb 5 23:19 authorized_keys >>>>>>>>> -rw------- 1 tsakai tsakai 102 Feb 11 00:34 config >>>>>>>>> -rw-r--r-- 1 tsakai tsakai 1302 Feb 11 00:36 known_hosts >>>>>>>>> -rw------- 1 tsakai tsakai 887 Feb 8 22:03 tsakai >>>>>>>>> [tsakai@ip-10-195-198-31 ~]$ >>>>>>>>> [tsakai@ip-10-195-198-31 ~]$ ssh ip-10-100-243-195.ec2.internal >>>>>>>>> Last login: Fri Feb 11 00:36:20 2011 from 10.195.198.31 >>>>>>>>> [tsakai@ip-10-100-243-195 ~]$ >>>>>>>>> [tsakai@ip-10-100-243-195 ~]$ # I am on machine B >>>>>>>>> [tsakai@ip-10-100-243-195 ~]$ hostname >>>>>>>>> ip-10-100-243-195 >>>>>>>>> [tsakai@ip-10-100-243-195 ~]$ >>>>>>>>> [tsakai@ip-10-100-243-195 ~]$ ll >>>>>>>>> total 8 >>>>>>>>> -rw-rw-r-- 1 tsakai tsakai 274 Feb 11 00:44 app.ac >>>>>>>>> -rwxr-xr-x 1 tsakai tsakai 379 Feb 11 00:47 fib.R >>>>>>>>> [tsakai@ip-10-100-243-195 ~]$ >>>>>>>>> [tsakai@ip-10-100-243-195 ~]$ >>>>>>>>> [tsakai@ip-10-100-243-195 ~]$ cat app.ac >>>>>>>>> -H ip-10-195-198-31.ec2.internal -np 1 Rscript /home/tsakai/fib.R 5 >>>>>>>>> -H ip-10-195-198-31.ec2.internal -np 1 Rscript /home/tsakai/fib.R 6 >>>>>>>>> -H ip-10-100-243-195.ec2.internal -np 1 Rscript /home/tsakai/fib.R 7 >>>>>>>>> -H ip-10-100-243-195.ec2.internal -np 1 Rscript /home/tsakai/fib.R 8 >>>>>>>>> [tsakai@ip-10-100-243-195 ~]$ >>>>>>>>> [tsakai@ip-10-100-243-195 ~]$ # go back to machine A >>>>>>>>> [tsakai@ip-10-100-243-195 ~]$ >>>>>>>>> [tsakai@ip-10-100-243-195 ~]$ exit >>>>>>>>> logout >>>>>>>>> Connection to ip-10-100-243-195.ec2.internal closed. >>>>>>>>> [tsakai@ip-10-195-198-31 ~]$ >>>>>>>>> [tsakai@ip-10-195-198-31 ~]$ hostname >>>>>>>>> ip-10-195-198-31 >>>>>>>>> [tsakai@ip-10-195-198-31 ~]$ >>>>>>>>> [tsakai@ip-10-195-198-31 ~]$ # Execute mpirun >>>>>>>>> [tsakai@ip-10-195-198-31 ~]$ >>>>>>>>> [tsakai@ip-10-195-198-31 ~]$ mpirun -app app.ac >>>>>>>>> >>>>>>>>> ---------------------------------------------------------------------- >>>>>>>>> -- >>>>>>>>> -- >>>>>>>>> mpirun was unable to launch the specified application as it >>>>>>>>> encountered >>>>>>>>> an >>>>>>>>> error: >>>>>>>>> Error: pipe function call failed when setting up I/O forwarding >>>>>>>>> subsystem >>>>>>>>> Node: ip-10-195-198-31 >>>>>>>>> while attempting to start process rank 0. >>>>>>>>> >>>>>>>>> ---------------------------------------------------------------------- >>>>>>>>> -- >>>>>>>>> -- >>>>>>>>> [tsakai@ip-10-195-198-31 ~]$ >>>>>>>>> [tsakai@ip-10-195-198-31 ~]$ # try it as root >>>>>>>>> [tsakai@ip-10-195-198-31 ~]$ >>>>>>>>> [tsakai@ip-10-195-198-31 ~]$ sudo su >>>>>>>>> bash-3.2# >>>>>>>>> bash-3.2# pwd >>>>>>>>> /home/tsakai >>>>>>>>> bash-3.2# >>>>>>>>> bash-3.2# ls -l /root/.ssh/config >>>>>>>>> -rw------- 1 root root 103 Feb 11 00:56 /root/.ssh/config >>>>>>>>> bash-3.2# >>>>>>>>> bash-3.2# cat /root/.ssh/config >>>>>>>>> Host * >>>>>>>>> IdentityFile /root/.ssh/.derobee/.kagi >>>>>>>>> IdentitiesOnly yes >>>>>>>>> BatchMode yes >>>>>>>>> bash-3.2# >>>>>>>>> bash-3.2# pwd >>>>>>>>> /home/tsakai >>>>>>>>> bash-3.2# >>>>>>>>> bash-3.2# ls -l >>>>>>>>> total 8 >>>>>>>>> -rw-rw-r-- 1 tsakai tsakai 274 Feb 11 00:47 app.ac >>>>>>>>> -rwxr-xr-x 1 tsakai tsakai 379 Feb 11 00:48 fib.R >>>>>>>>> bash-3.2# >>>>>>>>> bash-3.2# # now is the time for mpirun >>>>>>>>> bash-3.2# >>>>>>>>> bash-3.2# mpirun --app ./app.ac >>>>>>>>> 13 ip-10-100-243-195 >>>>>>>>> 21 ip-10-100-243-195 >>>>>>>>> 5 ip-10-195-198-31 >>>>>>>>> 8 ip-10-195-198-31 >>>>>>>>> bash-3.2# >>>>>>>>> bash-3.2# # It works (being root)! >>>>>>>>> bash-3.2# >>>>>>>>> bash-3.2# exit >>>>>>>>> exit >>>>>>>>> [tsakai@ip-10-195-198-31 ~]$ >>>>>>>>> [tsakai@ip-10-195-198-31 ~]$ # try it one more time as tsakai >>>>>>>>> [tsakai@ip-10-195-198-31 ~]$ >>>>>>>>> [tsakai@ip-10-195-198-31 ~]$ mpirun --app app.ac >>>>>>>>> >>>>>>>>> ---------------------------------------------------------------------- >>>>>>>>> -- >>>>>>>>> -- >>>>>>>>> mpirun was unable to launch the specified application as it >>>>>>>>> encountered >>>>>>>>> an >>>>>>>>> error: >>>>>>>>> Error: pipe function call failed when setting up I/O forwarding >>>>>>>>> subsystem >>>>>>>>> Node: ip-10-195-198-31 >>>>>>>>> while attempting to start process rank 0. >>>>>>>>> >>>>>>>>> ---------------------------------------------------------------------- >>>>>>>>> -- >>>>>>>>> -- >>>>>>>>> [tsakai@ip-10-195-198-31 ~]$ >>>>>>>>> [tsakai@ip-10-195-198-31 ~]$ # I don't get it. >>>>>>>>> [tsakai@ip-10-195-198-31 ~]$ >>>>>>>>> [tsakai@ip-10-195-198-31 ~]$ exit >>>>>>>>> logout >>>>>>>>> [tsakai@vixen ec2]$ >>>>>>>>> So, why does it say "pipe function call failed when setting up >>>>>>>>> I/O forwarding subsystem Node: ip-10-195-198-31" ? >>>>>>>>> The node it is referring to is not the remote machine. It is >>>>>>>>> What I call machine A. I first thought maybe this is a problem >>>>>>>>> With PATH variable. But I don't think so. I compared root's >>>>>>>>> Path to that of tsaki's and made them identical and retried. >>>>>>>>> I got the same behavior. >>>>>>>>> If you could enlighten me why this is happening, I would really >>>>>>>>> Appreciate it. >>>>>>>>> Thank you. >>>>>>>>> Tena >>>>>>>>> On 2/10/11 4:12 PM, "Tena Sakai" <tsa...@gallo.ucsf.edu> wrote: >>>>>>>>>> Hi jeff, >>>>>>>>>> >>>>>>>>>> Thanks for the firewall tip. I tried it while allowing all tip >>>>>>>>>> traffic >>>>>>>>>> and got interesting and preplexing result. Here's what's interesting >>>>>>>>>> (BTW, I got rid of "LogLevel DEBUG3" from ./ssh/config on this run): >>>>>>>>>> >>>>>>>>>> [tsakai@ip-10-203-21-132 ~]$ >>>>>>>>>> [tsakai@ip-10-203-21-132 ~]$ mpirun --app app.ac2 >>>>>>>>>> Host key verification failed. >>>>>>>>>> >>>>>>>>>> >>>> ------------------------------------------------------------------------->> >>>> >> >>>>>> >>>> - >>>>>>>>>> A daemon (pid 2743) died unexpectedly with status 255 while >>>>>>>>>> attempting >>>>>>>>>> to launch so we are aborting. >>>>>>>>>> >>>>>>>>>> There may be more information reported by the environment (see >>>>>>>>>> above). >>>>>>>>>> >>>>>>>>>> This may be because the daemon was unable to find all the needed >>>>>>>>>> shared >>>>>>>>>> libraries on the remote node. You may set your LD_LIBRARY_PATH to >>>>>>>>>> have >>>>>>>>>> the >>>>>>>>>> location of the shared libraries on the remote nodes and this will >>>>>>>>>> automatically be forwarded to the remote nodes. >>>>>>>>>> >>>>>>>>>> >>>> ------------------------------------------------------------------------->> >>>> >> >>>>>> >>>> - >>>>>>>>>> >>>> ------------------------------------------------------------------------->> >>>> >> >>>>>> >>>> - >>>>>>>>>> mpirun noticed that the job aborted, but has no info as to the >>>>>>>>>> process >>>>>>>>>> that caused that situation. >>>>>>>>>> >>>>>>>>>> >>>> ------------------------------------------------------------------------->> >>>> >> >>>>>> >>>> - >>>>>>>>>> mpirun: clean termination accomplished >>>>>>>>>> >>>>>>>>>> [tsakai@ip-10-203-21-132 ~]$ >>>>>>>>>> [tsakai@ip-10-203-21-132 ~]$ env | grep LD_LIB >>>>>>>>>> [tsakai@ip-10-203-21-132 ~]$ >>>>>>>>>> [tsakai@ip-10-203-21-132 ~]$ # Let's set LD_LIBRARY_PATH to >>>>>>>>>> /usr/local/lib >>>>>>>>>> [tsakai@ip-10-203-21-132 ~]$ >>>>>>>>>> [tsakai@ip-10-203-21-132 ~]$ >>>>>>>>>> [tsakai@ip-10-203-21-132 ~]$ export LD_LIBRARY_PATH='/usr/local/lib' >>>>>>>>>> [tsakai@ip-10-203-21-132 ~]$ >>>>>>>>>> [tsakai@ip-10-203-21-132 ~]$ # I better to this on machine B as well >>>>>>>>>> [tsakai@ip-10-203-21-132 ~]$ >>>>>>>>>> [tsakai@ip-10-203-21-132 ~]$ ssh -i tsakai ip-10-195-171-159 >>>>>>>>>> Warning: Identity file tsakai not accessible: No such file or >>>>>>>>>> directory. >>>>>>>>>> Last login: Thu Feb 10 18:31:20 2011 from 10.203.21.132 >>>>>>>>>> [tsakai@ip-10-195-171-159 ~]$ >>>>>>>>>> [tsakai@ip-10-195-171-159 ~]$ export >>>>>>>>>> LD_LIBRARY_PATH='/usr/local/lib' >>>>>>>>>> [tsakai@ip-10-195-171-159 ~]$ >>>>>>>>>> [tsakai@ip-10-195-171-159 ~]$ env | grep LD_LIB >>>>>>>>>> LD_LIBRARY_PATH=/usr/local/lib >>>>>>>>>> [tsakai@ip-10-195-171-159 ~]$ >>>>>>>>>> [tsakai@ip-10-195-171-159 ~]$ # OK, now go bak to machine A >>>>>>>>>> [tsakai@ip-10-195-171-159 ~]$ exit >>>>>>>>>> logout >>>>>>>>>> Connection to ip-10-195-171-159 closed. >>>>>>>>>> [tsakai@ip-10-203-21-132 ~]$ >>>>>>>>>> [tsakai@ip-10-203-21-132 ~]$ hostname >>>>>>>>>> ip-10-203-21-132 >>>>>>>>>> [tsakai@ip-10-203-21-132 ~]$ # try mpirun again >>>>>>>>>> [tsakai@ip-10-203-21-132 ~]$ >>>>>>>>>> [tsakai@ip-10-203-21-132 ~]$ mpirun --app app.ac2 >>>>>>>>>> Host key verification failed. >>>>>>>>>> >>>>>>>>>> >>>> ------------------------------------------------------------------------->> >>>> >> >>>>>> >>>> - >>>>>>>>>> A daemon (pid 2789) died unexpectedly with status 255 while >>>>>>>>>> attempting >>>>>>>>>> to launch so we are aborting. >>>>>>>>>> >>>>>>>>>> There may be more information reported by the environment (see >>>>>>>>>> above). >>>>>>>>>> >>>>>>>>>> This may be because the daemon was unable to find all the needed >>>>>>>>>> shared >>>>>>>>>> libraries on the remote node. You may set your LD_LIBRARY_PATH to >>>>>>>>>> have >>>>>>>>>> the >>>>>>>>>> location of the shared libraries on the remote nodes and this will >>>>>>>>>> automatically be forwarded to the remote nodes. >>>>>>>>>> >>>>>>>>>> >>>> ------------------------------------------------------------------------->> >>>> >> >>>>>> >>>> - >>>>>>>>>> >>>> ------------------------------------------------------------------------->> >>>> >> >>>>>> >>>> - >>>>>>>>>> mpirun noticed that the job aborted, but has no info as to the >>>>>>>>>> process >>>>>>>>>> that caused that situation. >>>>>>>>>> >>>>>>>>>> >>>> ------------------------------------------------------------------------->> >>>> >> >>>>>> >>>> - >>>>>>>>>> mpirun: clean termination accomplished >>>>>>>>>> >>>>>>>>>> [tsakai@ip-10-203-21-132 ~]$ >>>>>>>>>> [tsakai@ip-10-203-21-132 ~]$ # I thought openmpi library was in >>>>>>>>>> /usr/local/lib... >>>>>>>>>> [tsakai@ip-10-203-21-132 ~]$ >>>>>>>>>> [tsakai@ip-10-203-21-132 ~]$ ll -t /usr/local/lib | less >>>>>>>>>> total 16604 >>>>>>>>>> lrwxrwxrwx 1 root root 16 Feb 8 23:06 libfuse.so -> >>>>>>>>>> libfuse.so.2.8.5 >>>>>>>>>> lrwxrwxrwx 1 root root 16 Feb 8 23:06 libfuse.so.2 -> >>>>>>>>>> libfuse.so.2.8.5 >>>>>>>>>> lrwxrwxrwx 1 root root 25 Feb 8 23:06 libmca_common_sm.so -> >>>>>>>>>> libmca_common_sm.so.1.0.0 >>>>>>>>>> lrwxrwxrwx 1 root root 25 Feb 8 23:06 libmca_common_sm.so.1 -> >>>>>>>>>> libmca_common_sm.so.1.0.0 >>>>>>>>>> lrwxrwxrwx 1 root root 15 Feb 8 23:06 libmpi.so -> >>>>>>>>>> libmpi.so.0.0.2 >>>>>>>>>> lrwxrwxrwx 1 root root 15 Feb 8 23:06 libmpi.so.0 -> >>>>>>>>>> libmpi.so.0.0.2 >>>>>>>>>> lrwxrwxrwx 1 root root 19 Feb 8 23:06 libmpi_cxx.so -> >>>>>>>>>> libmpi_cxx.so.0.0.1 >>>>>>>>>> lrwxrwxrwx 1 root root 19 Feb 8 23:06 libmpi_cxx.so.0 -> >>>>>>>>>> libmpi_cxx.so.0.0.1 >>>>>>>>>> lrwxrwxrwx 1 root root 19 Feb 8 23:06 libmpi_f77.so -> >>>>>>>>>> libmpi_f77.so.0.0.1 >>>>>>>>>> lrwxrwxrwx 1 root root 19 Feb 8 23:06 libmpi_f77.so.0 -> >>>>>>>>>> libmpi_f77.so.0.0.1 >>>>>>>>>> lrwxrwxrwx 1 root root 19 Feb 8 23:06 libmpi_f90.so -> >>>>>>>>>> libmpi_f90.so.0.0.1 >>>>>>>>>> lrwxrwxrwx 1 root root 19 Feb 8 23:06 libmpi_f90.so.0 -> >>>>>>>>>> libmpi_f90.so.0.0.1 >>>>>>>>>> lrwxrwxrwx 1 root root 20 Feb 8 23:06 libopen-pal.so -> >>>>>>>>>> libopen-pal.so.0.0.0 >>>>>>>>>> lrwxrwxrwx 1 root root 20 Feb 8 23:06 libopen-pal.so.0 -> >>>>>>>>>> libopen-pal.so.0.0.0 >>>>>>>>>> lrwxrwxrwx 1 root root 20 Feb 8 23:06 libopen-rte.so -> >>>>>>>>>> libopen-rte.so.0.0.0 >>>>>>>>>> lrwxrwxrwx 1 root root 20 Feb 8 23:06 libopen-rte.so.0 -> >>>>>>>>>> libopen-rte.so.0.0.0 >>>>>>>>>> lrwxrwxrwx 1 root root 26 Feb 8 23:06 libopenmpi_malloc.so -> >>>>>>>>>> libopenmpi_malloc.so.0.0.0 >>>>>>>>>> lrwxrwxrwx 1 root root 26 Feb 8 23:06 libopenmpi_malloc.so.0 >>>>>>>>>> -> >>>>>>>>>> libopenmpi_malloc.so.0.0.0 >>>>>>>>>> lrwxrwxrwx 1 root root 20 Feb 8 23:06 libulockmgr.so -> >>>>>>>>>> libulockmgr.so.1.0.1 >>>>>>>>>> lrwxrwxrwx 1 root root 20 Feb 8 23:06 libulockmgr.so.1 -> >>>>>>>>>> libulockmgr.so.1.0.1 >>>>>>>>>> lrwxrwxrwx 1 root root 16 Feb 8 23:06 libxml2.so -> >>>>>>>>>> libxml2.so.2.7.2 >>>>>>>>>> lrwxrwxrwx 1 root root 16 Feb 8 23:06 libxml2.so.2 -> >>>>>>>>>> libxml2.so.2.7.2 >>>>>>>>>> -rw-r--r-- 1 root root 385912 Jan 26 01:00 libvt.a >>>>>>>>>> [tsakai@ip-10-203-21-132 ~]$ >>>>>>>>>> [tsakai@ip-10-203-21-132 ~]$ # Now, I am really confused... >>>>>>>>>> [tsakai@ip-10-203-21-132 ~]$ >>>>>>>>>> >>>>>>>>>> Do you know why it's complaining about shared libraries? >>>>>>>>>> >>>>>>>>>> Thank you. >>>>>>>>>> >>>>>>>>>> Tena >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On 2/10/11 1:05 PM, "Jeff Squyres" <jsquy...@cisco.com> wrote: >>>>>>>>>> >>>>>>>>>>> Your prior mails were about ssh issues, but this one sounds like you >>>>>>>>>>> might >>>>>>>>>>> have firewall issues. >>>>>>>>>>> >>>>>>>>>>> That is, the "orted" command attempts to open a TCP socket back to >>>>>>>>>>> mpirun >>>>>>>>>>> for >>>>>>>>>>> various command and control reasons. If it is blocked from doing so >>>>>>>>>>> by >>>>>>>>>>> a >>>>>>>>>>> firewall, Open MPI won't run. In general, you can either disable >>>>>>>>>>> your >>>>>>>>>>> firewall or you can setup a trust relationship for TCP connections >>>>>>>>>>> within >>>>>>>>>>> your >>>>>>>>>>> cluster. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Feb 10, 2011, at 1:03 PM, Tena Sakai wrote: >>>>>>>>>>> >>>>>>>>>>>> Hi Reuti, >>>>>>>>>>>> >>>>>>>>>>>> Thanks for suggesting "LogLevel DEBUG3." I did so and complete >>>>>>>>>>>> session is captured in the attached file. >>>>>>>>>>>> >>>>>>>>>>>> What I did is much similar to what I have done before: verify >>>>>>>>>>>> that ssh works and then run mpirun command. In my a bit lengthy >>>>>>>>>>>> session log, there are two responses from "LogLevel DEBUG3." First >>>>>>>>>>>> from an scp invocation and then from mpirun invocation. They both >>>>>>>>>>>> say >>>>>>>>>>>> debug1: Authentication succeeded (publickey). >>>>>>>>>>>> >>>>>>>>>>>> From mpirun invocation, I see a line: >>>>>>>>>>>> debug1: Sending command: orted --daemonize -mca ess env -mca >>>>>>>>>>>> orte_ess_jobid 3344891904 -mca orte_ess_vpid 1 -mca >>>>>>>>>>>> orte_ess_num_procs >>>>>>>>>>>> 2 --hnp-uri "3344891904.0;tcp://10.194.95.239:54256" >>>>>>>>>>>> The IP address at the end of the line is indeed that of machine B. >>>>>>>>>>>> After that there was hanging and I controlled-C out of it, which >>>>>>>>>>>> gave me more lines. But the lines after >>>>>>>>>>>> debug1: Sending command: orted bla bla bla >>>>>>>>>>>> doesn't look good to me. But, in truth, I have no idea what they >>>>>>>>>>>> mean. >>>>>>>>>>>> >>>>>>>>>>>> If you could shed some light, I would appreciate it very much. >>>>>>>>>>>> >>>>>>>>>>>> Regards, >>>>>>>>>>>> >>>>>>>>>>>> Tena >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On 2/10/11 10:57 AM, "Reuti" <re...@staff.uni-marburg.de> wrote: >>>>>>>>>>>> >>>>>>>>>>>> Hi, >>>>>>>>>>>> >>>>>>>>>>>> Am 10.02.2011 um 19:11 schrieb Tena Sakai: >>>>>>>>>>>> >>>>>>>>>>>> your local machine is Linux like, but the execution hosts >>>>>>>>>>>> are Macs? I saw the /Users/tsakai/... in your output. >>>>>>>>>>>> No, my environment is entirely linux. The path to my home >>>>>>>>>>>> directory on one host (blitzen) has been known as /Users/tsakai, >>>>>>>>>>>> despite it is an nfs mount from vixen (which is known to >>>>>>>>>>>> itself as /home/tsakai). For historical reasons, I have >>>>>>>>>>>> chosen to give a symbolic link named /Users to vixen's /Home, >>>>>>>>>>>> so that I can use consistent path for both vixen and blitzen. >>>>>>>>>>>> okay. Sometimes the protection of the home directory must be >>>>>>>>>>>> adjusted >>>>>>>>>>>> too, >>>>>>>>>>>> but >>>>>>>>>>>> as you can do it from the command line this shouldn't be an issue. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> Is this a private cluster (or at least private interfaces)? >>>>>>>>>>>> It would also be an option to use hostbased authentication, >>>>>>>>>>>> which will avoid setting any known_hosts file or passphraseless >>>>>>>>>>>> ssh-keys for each user. >>>>>>>>>>>> No, it is not a private cluster. It is Amazon EC2. When I >>>>>>>>>>>> Ssh from my local machine (vixen) I use its public interface, >>>>>>>>>>>> but to address from one amazon cluster node to the other I >>>>>>>>>>>> use nodes' private dns names: domU-12-31-39-07-35-21 and >>>>>>>>>>>> domU-12-31-39-06-74-E2. Both public and private dns names >>>>>>>>>>>> change from a launch to another. I am using passphrasesless >>>>>>>>>>>> ssh-keys for authentication in all cases, i.e., from vixen to >>>>>>>>>>>> Amazon node A, from amazon node A to amazon node B, and from >>>>>>>>>>>> Amazon node B back to A. (Please see my initail post. There >>>>>>>>>>>> is a session dialogue for this.) They all work without authen- >>>>>>>>>>>> tication dialogue, except a brief initial dialogue: >>>>>>>>>>>> The authenticity of host 'domu-xx-xx-xx-xx-xx-x (10.xx.xx.xx)' >>>>>>>>>>>> can't be established. >>>>>>>>>>>> RSA key fingerprint is >>>>>>>>>>>> e3:ad:75:b1:a4:63:7f:0f:c4:0b:10:71:f3:2f:21:81. >>>>>>>>>>>> Are you sure you want to continue connecting (yes/no)? >>>>>>>>>>>> to which I say "yes." >>>>>>>>>>>> But I am unclear with what you mean by "hostbased authentication"? >>>>>>>>>>>> Doesn't that mean with password? If so, it is not an option. >>>>>>>>>>>> No. It's convenient inside a private cluster as it won't fill each >>>>>>>>>>>> users' >>>>>>>>>>>> known_hosts file and you don't need to create any ssh-keys. But >>>>>>>>>>>> when >>>>>>>>>>>> the >>>>>>>>>>>> hostname changes every time it might also create new hostkeys. It >>>>>>>>>>>> uses >>>>>>>>>>>> hostkeys (private and public), this way it works for all users. >>>>>>>>>>>> Just >>>>>>>>>>>> for >>>>>>>>>>>> reference: >>>>>>>>>>>> >>>>>>>>>>>> http://arc.liv.ac.uk/SGE/howto/hostbased-ssh.html >>>>>>>>>>>> >>>>>>>>>>>> You could look into it later. >>>>>>>>>>>> >>>>>>>>>>>> == >>>>>>>>>>>> >>>>>>>>>>>> - Can you try to use a command when connecting from A to B? E.g. >>>>>>>>>>>> ssh >>>>>>>>>>>> `domU-12-31-39-06-74-E2 ls`. Is this working too? >>>>>>>>>>>> >>>>>>>>>>>> - What about putting: >>>>>>>>>>>> >>>>>>>>>>>> LogLevel DEBUG3 >>>>>>>>>>>> >>>>>>>>>>>> In your ~/.ssh/config. Maybe we can see what he's trying to >>>>>>>>>>>> negotiate >>>>>>>>>>>> before >>>>>>>>>>>> it fails in verbose mode. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> -- Reuti >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> Regards, >>>>>>>>>>>> >>>>>>>>>>>> Tena >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On 2/10/11 2:27 AM, "Reuti" <re...@staff.uni-marburg.de> wrote: >>>>>>>>>>>> >>>>>>>>>>>> Hi, >>>>>>>>>>>> >>>>>>>>>>>> your local machine is Linux like, but the execution hosts are Macs? >>>>>>>>>>>> I >>>>>>>>>>>> saw >>>>>>>>>>>> the >>>>>>>>>>>> /Users/tsakai/... in your output. >>>>>>>>>>>> >>>>>>>>>>>> a) executing a command on them is also working, e.g.: ssh >>>>>>>>>>>> domU-12-31-39-07-35-21 ls >>>>>>>>>>>> >>>>>>>>>>>> Am 10.02.2011 um 07:08 schrieb Tena Sakai: >>>>>>>>>>>> >>>>>>>>>>>> Hi, >>>>>>>>>>>> >>>>>>>>>>>> I have made a bit of progress(?)... >>>>>>>>>>>> I made a config file in my .ssh directory on the cloud. It looks >>>>>>>>>>>> like: >>>>>>>>>>>> # machine A >>>>>>>>>>>> Host domU-12-31-39-07-35-21.compute-1.internal >>>>>>>>>>>> This is just an abbreviation or nickname above. To use the >>>>>>>>>>>> specified >>>>>>>>>>>> settings, >>>>>>>>>>>> it's necessary to specify exactly this name. When the settings are >>>>>>>>>>>> the >>>>>>>>>>>> same >>>>>>>>>>>> anyway for all machines, you can use: >>>>>>>>>>>> >>>>>>>>>>>> Host * >>>>>>>>>>>> IdentityFile /home/tsakai/.ssh/tsakai >>>>>>>>>>>> IdentitiesOnly yes >>>>>>>>>>>> BatchMode yes >>>>>>>>>>>> >>>>>>>>>>>> instead. >>>>>>>>>>>> >>>>>>>>>>>> Is this a private cluster (or at least private interfaces)? It >>>>>>>>>>>> would >>>>>>>>>>>> also >>>>>>>>>>>> be >>>>>>>>>>>> an option to use hostbased authentication, which will avoid setting >>>>>>>>>>>> any >>>>>>>>>>>> known_hosts file or passphraseless ssh-keys for each user. >>>>>>>>>>>> >>>>>>>>>>>> -- Reuti >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> HostName domU-12-31-39-07-35-21 >>>>>>>>>>>> BatchMode yes >>>>>>>>>>>> IdentityFile /home/tsakai/.ssh/tsakai >>>>>>>>>>>> ChallengeResponseAuthentication no >>>>>>>>>>>> IdentitiesOnly yes >>>>>>>>>>>> >>>>>>>>>>>> # machine B >>>>>>>>>>>> Host domU-12-31-39-06-74-E2.compute-1.internal >>>>>>>>>>>> HostName domU-12-31-39-06-74-E2 >>>>>>>>>>>> BatchMode yes >>>>>>>>>>>> IdentityFile /home/tsakai/.ssh/tsakai >>>>>>>>>>>> ChallengeResponseAuthentication no >>>>>>>>>>>> IdentitiesOnly yes >>>>>>>>>>>> >>>>>>>>>>>> This file exists on both machine A and machine B. >>>>>>>>>>>> >>>>>>>>>>>> Now When I issue mpirun command as below: >>>>>>>>>>>> [tsakai@domU-12-31-39-06-74-E2 ~]$ mpirun -app app.ac2 >>>>>>>>>>>> >>>>>>>>>>>> It hungs. I control-C out of it and I get: >>>>>>>>>>>> mpirun: killing job... >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>> ---------------------------------------------------------------------- >>>>>>>>> -- >>>>>>>>> -> >>>>>>>>>> - >>>>>>>>>>>> mpirun noticed that the job aborted, but has no info as to the >>>>>>>>>>>> process >>>>>>>>>>>> that caused that situation. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>> ---------------------------------------------------------------------- >>>>>>>>> -- >>>>>>>>> -> >>>>>>>>>> - >>>>>>>>> ---------------------------------------------------------------------- >>>>>>>>> -- >>>>>>>>> -> >>>>>>>>>> - >>>>>>>>>>>> mpirun was unable to cleanly terminate the daemons on the nodes >>>>>>>>>>>> shown >>>>>>>>>>>> below. Additional manual cleanup may be required - please refer to >>>>>>>>>>>> the "orte-clean" tool for assistance. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>> ---------------------------------------------------------------------- >>>>>>>>> -- >>>>>>>>> -> >>>>>>>>>> - >>>>>>>>>>>> domU-12-31-39-07-35-21.compute-1.internal - daemon did not >>>>>>>>>>>> report >>>>>>>>>>>> back when launched >>>>>>>>>>>> >>>>>>>>>>>> Am I making progress? >>>>>>>>>>>> >>>>>>>>>>>> Does this mean I am past authentication and something else is the >>>>>>>>>>>> problem? >>>>>>>>>>>> Does someone have an example .ssh/config file I can look at? There >>>>>>>>>>>> are >>>>>>>>>>>> so >>>>>>>>>>>> many keyword-argument paris for this config file and I would like >>>>>>>>>>>> to >>>>>>>>>>>> look >>>>>>>>>>>> at >>>>>>>>>>>> some very basic one that works. >>>>>>>>>>>> >>>>>>>>>>>> Thank you. >>>>>>>>>>>> >>>>>>>>>>>> Tena Sakai >>>>>>>>>>>> tsa...@gallo.ucsf.edu >>>>>>>>>>>> >>>>>>>>>>>> On 2/9/11 7:52 PM, "Tena Sakai" <tsa...@gallo.ucsf.edu> wrote: >>>>>>>>>>>> >>>>>>>>>>>> Hi >>>>>>>>>>>> >>>>>>>>>>>> I have an app.ac1 file like below: >>>>>>>>>>>> [tsakai@vixen local]$ cat app.ac1 >>>>>>>>>>>> -H vixen.egcrc.org -np 1 Rscript >>>>>>>>>>>> /Users/tsakai/Notes/R/parallel/Rmpi/local/fib.R 5 >>>>>>>>>>>> -H vixen.egcrc.org -np 1 Rscript >>>>>>>>>>>> /Users/tsakai/Notes/R/parallel/Rmpi/local/fib.R 6 >>>>>>>>>>>> -H blitzen.egcrc.org -np 1 Rscript >>>>>>>>>>>> /Users/tsakai/Notes/R/parallel/Rmpi/local/fib.R 7 >>>>>>>>>>>> -H blitzen.egcrc.org -np 1 Rscript >>>>>>>>>>>> /Users/tsakai/Notes/R/parallel/Rmpi/local/fib.R 8 >>>>>>>>>>>> >>>>>>>>>>>> The program I run is >>>>>>>>>>>> Rscript /Users/tsakai/Notes/R/parallel/Rmpi/local/fib.R x >>>>>>>>>>>> Where x is [5..8]. The machines vixen and blitzen each run 2 runs. >>>>>>>>>>>> >>>>>>>>>>>> Here¹s the program fib.R: >>>>>>>>>>>> [ tsakai@vixen local]$ cat fib.R >>>>>>>>>>>> # fib() computes, given index n, fibonacci number iteratively >>>>>>>>>>>> # here's the first dozen sequence (indexed from 0..11) >>>>>>>>>>>> # 1, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89 >>>>>>>>>>>> >>>>>>>>>>>> fib <- function( n ) { >>>>>>>>>>>> a <- 0 >>>>>>>>>>>> b <- 1 >>>>>>>>>>>> for ( i in 1:n ) { >>>>>>>>>>>> t <- b >>>>>>>>>>>> b <- a >>>>>>>>>>>> a <- a + t >>>>>>>>>>>> } >>>>>>>>>>>> a >>>>>>>>>>>> >>>>>>>>>>>> arg <- commandArgs( TRUE ) >>>>>>>>>>>> myHost <- system( 'hostname', intern=TRUE ) >>>>>>>>>>>> cat( fib(arg), myHost, '\n' ) >>>>>>>>>>>> >>>>>>>>>>>> It reads an argument from command line and produces a fibonacci >>>>>>>>>>>> number >>>>>>>>>>>> that >>>>>>>>>>>> corresponds to that index, followed by the machine name. Pretty >>>>>>>>>>>> simple >>>>>>>>>>>> stuff. >>>>>>>>>>>> >>>>>>>>>>>> Here¹s the run output: >>>>>>>>>>>> [tsakai@vixen local]$ mpirun -app app.ac1 >>>>>>>>>>>> 5 vixen.egcrc.org >>>>>>>>>>>> 8 vixen.egcrc.org >>>>>>>>>>>> 13 blitzen.egcrc.org >>>>>>>>>>>> 21 blitzen.egcrc.org >>>>>>>>>>>> >>>>>>>>>>>> Which is exactly what I expect. So far so good. >>>>>>>>>>>> >>>>>>>>>>>> Now I want to run the same thing on cloud. I launch 2 instances of >>>>>>>>>>>> the >>>>>>>>>>>> same >>>>>>>>>>>> virtual machine, to which I get to by: >>>>>>>>>>>> [tsakai@vixen local]$ ssh A I ~/.ssh/tsakai >>>>>>>>>>>> machine-instance-A-public-dns >>>>>>>>>>>> >>>>>>>>>>>> Now I am on machine A: >>>>>>>>>>>> [tsakai@domU-12-31-39-00-D1-F2 ~]$ >>>>>>>>>>>> [tsakai@domU-12-31-39-00-D1-F2 ~]$ # and I can go to machine B >>>>>>>>>>>> without >>>>>>>>>>>> password authentication, >>>>>>>>>>>> [tsakai@domU-12-31-39-00-D1-F2 ~]$ # i.e., use public/private key >>>>>>>>>>>> [tsakai@domU-12-31-39-00-D1-F2 ~]$ >>>>>>>>>>>> [tsakai@domU-12-31-39-00-D1-F2 ~]$ hostname >>>>>>>>>>>> domU-12-31-39-00-D1-F2 >>>>>>>>>>>> [tsakai@domU-12-31-39-00-D1-F2 ~]$ ssh -i .ssh/tsakai >>>>>>>>>>>> domU-12-31-39-0C-C8-01 >>>>>>>>>>>> Last login: Wed Feb 9 20:51:48 2011 from 10.254.214.4 >>>>>>>>>>>> [tsakai@domU-12-31-39-0C-C8-01 ~]$ >>>>>>>>>>>> [tsakai@domU-12-31-39-0C-C8-01 ~]$ # I am now on machine B >>>>>>>>>>>> [tsakai@domU-12-31-39-0C-C8-01 ~]$ hostname >>>>>>>>>>>> domU-12-31-39-0C-C8-01 >>>>>>>>>>>> [tsakai@domU-12-31-39-0C-C8-01 ~]$ >>>>>>>>>>>> [tsakai@domU-12-31-39-0C-C8-01 ~]$ # now show I can get to machine >>>>>>>>>>>> A >>>>>>>>>>>> without using password >>>>>>>>>>>> [tsakai@domU-12-31-39-0C-C8-01 ~]$ >>>>>>>>>>>> [tsakai@domU-12-31-39-0C-C8-01 ~]$ ssh -i .ssh/tsakai >>>>>>>>>>>> domU-12-31-39-00-D1-F2 >>>>>>>>>>>> The authenticity of host 'domu-12-31-39-00-d1-f2 (10.254.214.4)' >>>>>>>>>>>> can't >>>>>>>>>>>> be established. >>>>>>>>>>>> RSA key fingerprint is >>>>>>>>>>>> e3:ad:75:b1:a4:63:7f:0f:c4:0b:10:71:f3:2f:21:81. >>>>>>>>>>>> Are you sure you want to continue connecting (yes/no)? yes >>>>>>>>>>>> Warning: Permanently added 'domu-12-31-39-00-d1-f2' (RSA) to the >>>>>>>>>>>> list >>>>>>>>>>>> of >>>>>>>>>>>> known hosts. >>>>>>>>>>>> Last login: Wed Feb 9 20:49:34 2011 from 10.215.203.239 >>>>>>>>>>>> [tsakai@domU-12-31-39-00-D1-F2 ~]$ >>>>>>>>>>>> [tsakai@domU-12-31-39-00-D1-F2 ~]$ hostname >>>>>>>>>>>> domU-12-31-39-00-D1-F2 >>>>>>>>>>>> [tsakai@domU-12-31-39-00-D1-F2 ~]$ >>>>>>>>>>>> [tsakai@domU-12-31-39-00-D1-F2 ~]$ exit >>>>>>>>>>>> logout >>>>>>>>>>>> Connection to domU-12-31-39-00-D1-F2 closed. >>>>>>>>>>>> [tsakai@domU-12-31-39-0C-C8-01 ~]$ >>>>>>>>>>>> [tsakai@domU-12-31-39-0C-C8-01 ~]$ exit >>>>>>>>>>>> logout >>>>>>>>>>>> Connection to domU-12-31-39-0C-C8-01 closed. >>>>>>>>>>>> [tsakai@domU-12-31-39-00-D1-F2 ~]$ >>>>>>>>>>>> [tsakai@domU-12-31-39-00-D1-F2 ~]$ # back at machine A >>>>>>>>>>>> [tsakai@domU-12-31-39-00-D1-F2 ~]$ hostname >>>>>>>>>>>> domU-12-31-39-00-D1-F2 >>>>>>>>>>>> >>>>>>>>>>>> As you can see, neither machine uses password for authentication; >>>>>>>>>>>> it >>>>>>>>>>>> uses >>>>>>>>>>>> public/private key pairs. There is no problem (that I can see) for >>>>>>>>>>>> ssh >>>>>>>>>>>> invocation >>>>>>>>>>>> from one machine to the other. This is so because I have a copy of >>>>>>>>>>>> public >>>>>>>>>>>> key >>>>>>>>>>>> and a copy of private key on each instance. >>>>>>>>>>>> >>>>>>>>>>>> The app.ac file is identical, except the node names: >>>>>>>>>>>> [tsakai@domU-12-31-39-00-D1-F2 ~]$ cat app.ac1 >>>>>>>>>>>> -H domU-12-31-39-00-D1-F2 -np 1 Rscript /home/tsakai/fib.R 5 >>>>>>>>>>>> -H domU-12-31-39-00-D1-F2 -np 1 Rscript /home/tsakai/fib.R 6 >>>>>>>>>>>> -H domU-12-31-39-0C-C8-01 -np 1 Rscript /home/tsakai/fib.R 7 >>>>>>>>>>>> -H domU-12-31-39-0C-C8-01 -np 1 Rscript /home/tsakai/fib.R 8 >>>>>>>>>>>> >>>>>>>>>>>> Here¹s what happens with mpirun: >>>>>>>>>>>> >>>>>>>>>>>> [tsakai@domU-12-31-39-00-D1-F2 ~]$ mpirun -app app.ac1 >>>>>>>>>>>> tsakai@domu-12-31-39-0c-c8-01's password: >>>>>>>>>>>> Permission denied, please try again. >>>>>>>>>>>> tsakai@domu-12-31-39-0c-c8-01's password: mpirun: killing job... >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>> ---------------------------------------------------------------------- >>>>>>>>> -> >>>>>>>>>>> >>>>>>>>> - >>>>>>>>>>>> -- >>>>>>>>>>>> mpirun noticed that the job aborted, but has no info as to the >>>>>>>>>>>> process >>>>>>>>>>>> that caused that situation. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>> ---------------------------------------------------------------------- >>>>>>>>> -> >>>>>>>>>>> >>>>>>>>> - >>>>>>>>>>>> -- >>>>>>>>>>>> >>>>>>>>>>>> mpirun: clean termination accomplished >>>>>>>>>>>> >>>>>>>>>>>> [tsakai@domU-12-31-39-00-D1-F2 ~]$ >>>>>>>>>>>> >>>>>>>>>>>> Mpirun (or somebody else?) asks me password, which I don¹t have. >>>>>>>>>>>> I end up typing control-C. >>>>>>>>>>>> >>>>>>>>>>>> Here¹s my question: >>>>>>>>>>>> How can I get past authentication by mpirun where there is no >>>>>>>>>>>> password? >>>>>>>>>>>> >>>>>>>>>>>> I would appreciate your help/insight greatly. >>>>>>>>>>>> >>>>>>>>>>>> Thank you. >>>>>>>>>>>> >>>>>>>>>>>> Tena Sakai >>>>>>>>>>>> tsa...@gallo.ucsf.edu >>>>> _______________________________________________ >>>>> users mailing list >>>>> us...@open-mpi.org >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>> >>>> >>>> _______________________________________________ >>>> users mailing list >>>> us...@open-mpi.org >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users