Hi Jeff, I have chosen to call this thread "This must be ssh problem, but I can't figure out what it is..." It turns out that's really wrong!
EC2 allows users to create what is called security group. A security group is made of one or more security rules, which is basically a port based firewall rule specification. (Networking is not my forte and I might use wrong terminology, but I am trying to convey concept.) I had created a security group "intra." I opened ssh port from 0 to 65535, and launched instances (I unleashed 2 at a time in a same geography zone) each belonging to the group intra. So, here, ssh is a security rule of a security group intra. A field for each rule is "source." I had different settings for the source field, but what I had been failing to do is to have this field known by the name of the group, namely intra. By doing so, each instance that belongs to this group can get to each other. I have not exhausted all the tests I have in mind, but so far it looks promising. I will expand my test to wider set tomorrow. Many thanks for your guidance all along. In a week or two, I look forward to put together a mini "how-to openMPI on cloud". Regards, Tena On 2/17/11 6:52 AM, "Jeff Squyres" <jsquy...@cisco.com> wrote: > On Feb 16, 2011, at 6:17 PM, Tena Sakai wrote: > >> For now, may I point out something I noticed out of the >> DEBUG3 Output last night? >> >> I found this line: >> >>> debug1: Sending command: orted --daemonize -mca ess env -mca >>> orte_ess_jobid 125566976 -mca orte_ess_vpid 1 -mca orte_ess_num_procs 2 >>> --hnp-uri "125566976.0;tcp://10.96.118.236:56064" > > What this means is that ssh sent the "orted ..." command to the remote side. > > As Gus mentioned, "orted" is the "Open MPI Run-Time Environment daemon" -- > it's a helper thingy that mpirun launches on the remote nodes before launching > your actual application. All those parameters (from --daemonize through > ...:56064") are options for orted. > > All of that gorp is considered internal to Open MPI -- most people never see > that stuff. > >> Followed by: >> >>> debug2: channel 0: request exec confirm 1 >>> debug2: fd 3 setting TCP_NODELAY >>> debug2: callback done >>> debug2: channel 0: open confirm rwindow 0 rmax 32768 >>> debug3: Wrote 272 bytes for a total of 1893 >>> debug2: channel 0: rcvd adjust 2097152 >>> debug2: channel_input_status_confirm: type 99 id 0 > > This is just more status information about the ssh connection; it doesn't > really have any direct relation to Open MPI. > > I don't know offhand if ssh displays the ack that a command successfully ran. > If you're not convinced that it did, then login to the other node while the > command is hung and run a ps to see if the orted is actually running or not. > I *suspect* that it is running, but that it's just hung for some reason. > > ----- > > Here's some suggestions to try debugging: > > On your new linux AMI instances (some of this may be redundant with what you > did already): > > - ensure that firewalling is disabled on all instances > > - ensure that your .bashrc (or whatever startup file is relevant to your > shell) is set to prefix PATH and LD_LIBRARY_PATH to your Open MPI > installation. Ensure the *PREFIX* these variables to guarantee that you don't > get interference from already-installed versions of Open MPI (e.g., if Open > MPI is installed by default on your AMI and you weren't aware of it) > > - setup a simple, per-user SSH key, perhaps something like this: > > A$ rm -rf $HOME/.ssh > (remove what you had before; let's just start over) > > A$ ssh-keygen -t dsa > (hit enter to accept all defaults and set no passphrase) > > A$ cd $HOME/.ssh > A$ cp id_dsa.pub authorized_keys > A$ chmod 644 authorized_keys > A$ ssh othernode > (login to node B) > > B$ ssh-keygen -t dsa > (hit enter to accept all defaults and set no passphrase; just to create > $HOME/.ssh with the right permissions, etc.) > > B$ scp @firstnode:.ssh/id_dsa\* . > (enter your password on A -- we're overwriting all the files here) > > B$ cp id_dsa.pub authorized_keys > B$ chmod 644 authorized_keys > > Now you should be able to ssh from one node to the other without passwords: > > A$ ssh othernode hostname > B > A$ > > and > > B$ ssh firstnode hostname > A > B$ > > Don't just test with "ssh othernode" -- test with "ssh othernode <command>" to > ensure that non-interactive logins work properly. That's what Open MPI will > use under the covers. > > - Now ensure that PATH and LD_LIBRARY_PATH are set for non-interactive ssh > sessions (i.e., some .bashrc's will exit "early" if they detect that it is a > non-interactive session). For example: > > A$ ssh othernode env | grep -i path > > Ensure that the output shows the path and ld_library_path locations for Open > MPI at the beginning of those variables. To go for the gold, you can try > this, too: > > A$ ssh othernode which ompi_info > (if all paths are set right, this should show the ompi_info of your 1.4.3 > install) > A$ ssh othernode ompi_info > (should show all the info about your 1.4.3 install) > > - If all the above works, then test with a simple, non-MPI application across > both nodes: > > A$ mpirun --host firstnode,othernode -np 2 hostname > A > B > A$ > > - When that works, you should be able to test with a simple MPI application > (e.g., the examples/ring_c.c file in the Open MPI distribution): > > A$ cd /path/to/open/mpi/source > A$ cd examples > A$ make > ... > A$ scp ring_c @othernode:/path/to/open/mpi/source/examples > ... > A$ mpirun --host firstnode,othernode -np 4 ring_c > > Make sense?