Hi Tena

I hope somebody more knowledgeable in ssh
takes a look at the debug3 session log that you included.

I can't see if/where/why ssh is failing for you in EC2.

See other answers inline, please.

Tena Sakai wrote:
Hi Gus,

Thank you again for your reply.

A slight difference is that on vixen and dashen you ran the
MPI hostname tests as a regular user, not as root, right?
Not sure if this will make much of a difference,
but it may be worth trying to run it as a regular user in EC2 also.
I general most people avoid running user applications (MPI programs
included) as root.
Mostly for safety, but I wonder if there are any
implications in the 'rootly powers'
regarding the under-the-hood processes that OpenMPI
launches along with the actual user programs.

Yes, between vixen and dahser I was doing the test as user tsakai,
not as root.  But the reason I wanted to do this test as root is
to show that it fails as regular user (generating pipe system
call failed error), whereas as root it would succeed, as it did
on Friday.

Sorry again.
I even wrote "root can and Tena cannot", then I forgot.
Too many tasks at the same time, too much context-switching ...

The ami has not changed.  The last change on the ami
was last Tuesday.  As such I don't understand this inconsistent
behavior.  I have lots of notes from previous sessions and I
consulted different successful session logs to replicate what I
saw Friday, but with no success.

Having spent days and not getting anywhere, I decided to take a
different approach.  I instantiated a linux ami that was built by
Amazon, which feels like centos/fedora-based.  I downloaded gcc
and c++, plus openMPI 1.4.3.  After I got openMPI running, I
created an account for user tsakai, uploaded my public key, re-logged
in as user tsakai, and ran the same test.  Surprisingly (or not?) it
generated the same result.  I.e., I cannot run the same mpirun
command when there is a remote instance involved, but on itself
mpirun runs fine.  So, I am feeling that this has to be an ssh
authentication problem.  I looked at man page for ssh and ssh_config
and cannot figure out what I am doing wrong.  I put in "LogLevel
DEBUG3" line and it generated lots of lines, in which I found a
line:
  debug1: Authentication succeeded (publickey).
Then I see a bunch of lines that look like:
  debug3: Ignored env XXXXXXX
and mpirun hangs.  Here is the session log:


Ssh on our clusters uses host-based authentication.
I think Reuti sent you his page about it:
http://arc.liv.ac.uk/SGE/howto/hostbased-ssh.html

However, I believe OpenMPI shouldn't care which ssh authentication
mechanism is used, as long as it works passwordless.

As for ssh configuration, ours is pretty standard:

1) We don't have 'IdentitiesOnly yes' (default is 'no'),
but use standard identity file names id_rsa, etc.
I think you are just telling ssh to use the specific identity
file you named.
I don't know if this may cause the problem, but who knows?

2) We don't have 'BatchMode yes' set.

3) We have the GSS authentication set

GSSAPIAuthentication yes

4) The locale environment variables are also passed
(may not be crucial):

SendEnv LANG LC_CTYPE LC_NUMERIC LC_TIME LC_COLLATE LC_MONETARY LC_MESSAGES
        SendEnv LC_PAPER LC_NAME LC_ADDRESS LC_TELEPHONE LC_MEASUREMENT
        SendEnv LC_IDENTIFICATION LC_ALL

5) And X forwarding (you're not doing any X stuff, I suppose):

ForwardX11Trusted yes

6) However, you may want to check what is in your
/etc/ssh/ssh_config and /etc/ssh/sshd_config,
because some options may be already set there.

7) Take a look at 'man ssh[d]' and  'man ssh[d]_config' too.

***

Finally, if you are willing to, it may be worth to run the same
experiment (with debug3) on vixen @ dashen, just to compare what
comes out from the verbose ssh messages to what you see in EC2.
Perhaps it may help nail down the reason for failure.

Gus Correa



  [tsakai@vixen ec2]$
  [tsakai@vixen ec2]$ ssh -i $MYKEY
tsa...@ec2-50-17-24-195.compute-1.amazonaws.com
  Last login: Wed Feb 16 06:50:08 2011 from 63.193.205.1

         __|  __|_  )  Amazon Linux AMI
         _|  (     /     Beta
        ___|\___|___|

  See /usr/share/doc/amzn-ami/image-release-notes for latest release notes.
:-)
  [tsakai@domU-12-31-39-16-75-1E ~]$
  [tsakai@domU-12-31-39-16-75-1E ~]$ # show firewall is off
  [tsakai@domU-12-31-39-16-75-1E ~]$ service iptables status
  -bash: service: command not found
  [tsakai@domU-12-31-39-16-75-1E ~]$ sudo service iptables status
  iptables: Firewall is not running.
  [tsakai@domU-12-31-39-16-75-1E ~]$ # show I can go to inst B with no
password authentication
  [tsakai@domU-12-31-39-16-75-1E ~]$ ssh
domU-12-31-39-16-4E-4C.compute-1.internal
  Last login: Wed Feb 16 06:53:14 2011 from
domu-12-31-39-16-75-1e.compute-1.internal

         __|  __|_  )  Amazon Linux AMI
         _|  (     /     Beta
        ___|\___|___|

  See /usr/share/doc/amzn-ami/image-release-notes for latest release notes.
:-)
  [tsakai@domU-12-31-39-16-4E-4C ~]$
  [tsakai@domU-12-31-39-16-4E-4C ~]$ # also back to inst A
  [tsakai@domU-12-31-39-16-4E-4C ~]$
  [tsakai@domU-12-31-39-16-4E-4C ~]$ ssh
domU-12-31-39-16-75-1E.compute-1.internal
  Last login: Wed Feb 16 06:58:33 2011 from 63.193.205.1

         __|  __|_  )  Amazon Linux AMI
         _|  (     /     Beta
        ___|\___|___|

  See /usr/share/doc/amzn-ami/image-release-notes for latest release notes.
:-)
  [tsakai@domU-12-31-39-16-75-1E ~]$
  [tsakai@domU-12-31-39-16-75-1E ~]$ # OK
  [tsakai@domU-12-31-39-16-75-1E ~]$ # back to inst B
  [tsakai@domU-12-31-39-16-75-1E ~]$ exit
  logout
  Connection to domU-12-31-39-16-75-1E.compute-1.internal closed.
  [tsakai@domU-12-31-39-16-4E-4C ~]$
  [tsakai@domU-12-31-39-16-4E-4C ~]$ env | grep LD_LIB
  LD_LIBRARY_PATH=:/usr/local/lib
  [tsakai@domU-12-31-39-16-4E-4C ~]$ # show no firewall on inst B
  [tsakai@domU-12-31-39-16-4E-4C ~]$ sudo service iptables status
  iptables: Firewall is not running.
  [tsakai@domU-12-31-39-16-4E-4C ~]$
  [tsakai@domU-12-31-39-16-4E-4C ~]$ # go back to inst A
  [tsakai@domU-12-31-39-16-4E-4C ~]$ exit
  logout
  Connection to domU-12-31-39-16-4E-4C.compute-1.internal closed.
  [tsakai@domU-12-31-39-16-75-1E ~]$
  [tsakai@domU-12-31-39-16-75-1E ~]$ env | grep LD_LIB
  LD_LIBRARY_PATH=:/usr/local/lib
  [tsakai@domU-12-31-39-16-75-1E ~]$
  [tsakai@domU-12-31-39-16-75-1E ~]$ cat app.ac
  -H domU-12-31-39-16-75-1E.compute-1.internal -np 1 /bin/hostname
  -H domU-12-31-39-16-75-1E.compute-1.internal -np 1 /bin/hostname
  -H domU-12-31-39-16-4E-4C.compute-1.internal -np 1 /bin/hostname
  -H domU-12-31-39-16-4E-4C.compute-1.internal -np 1 /bin/hostname
  [tsakai@domU-12-31-39-16-75-1E ~]$
  [tsakai@domU-12-31-39-16-75-1E ~]$ # top 2 are inst A (this machine);
bottom 2 are remote inst (inst B)
  [tsakai@domU-12-31-39-16-75-1E ~]$
  [tsakai@domU-12-31-39-16-75-1E ~]$ mpirun -app app.ac
  ^Cmpirun: killing job...

  --------------------------------------------------------------------------
  mpirun noticed that the job aborted, but has no info as to the process
  that caused that situation.
  --------------------------------------------------------------------------
  --------------------------------------------------------------------------
  mpirun was unable to cleanly terminate the daemons on the nodes shown
  below. Additional manual cleanup may be required - please refer to
  the "orte-clean" tool for assistance.
  --------------------------------------------------------------------------
        domU-12-31-39-16-4E-4C.compute-1.internal - daemon did not report
back when launched
  [tsakai@domU-12-31-39-16-75-1E ~]$
  [tsakai@domU-12-31-39-16-75-1E ~]$ # *** daemon did not report back when
launched ***
  [tsakai@domU-12-31-39-16-75-1E ~]$
  [tsakai@domU-12-31-39-16-75-1E ~]$
  [tsakai@domU-12-31-39-16-75-1E ~]$ cat app.ac2
  -H domU-12-31-39-16-75-1E.compute-1.internal -np 1 /bin/hostname
  -H domU-12-31-39-16-75-1E.compute-1.internal -np 1 /bin/hostname
  [tsakai@domU-12-31-39-16-75-1E ~]$
  [tsakai@domU-12-31-39-16-75-1E ~]$ # they refer to this instance (inst A)
  [tsakai@domU-12-31-39-16-75-1E ~]$ mpirun -app app.ac2
  domU-12-31-39-16-75-1E
  domU-12-31-39-16-75-1E
  [tsakai@domU-12-31-39-16-75-1E ~]$
  [tsakai@domU-12-31-39-16-75-1E ~]$ # that's no problem
  [tsakai@domU-12-31-39-16-75-1E ~]$
  [tsakai@domU-12-31-39-16-75-1E ~]$ cd .ssh
  [tsakai@domU-12-31-39-16-75-1E .ssh]$
  [tsakai@domU-12-31-39-16-75-1E .ssh]$ cat config
  Host *
        IdentityFile /home/tsakai/.ssh/tsakai
        IdentitiesOnly yes
        BatchMode yes
  [tsakai@domU-12-31-39-16-75-1E .ssh]$
  [tsakai@domU-12-31-39-16-75-1E .ssh]$ mv config config.svd
  [tsakai@domU-12-31-39-16-75-1E .ssh]$
  [tsakai@domU-12-31-39-16-75-1E .ssh]$ cat config.svd > config
  [tsakai@domU-12-31-39-16-75-1E .ssh]$
  [tsakai@domU-12-31-39-16-75-1E .ssh]$ ll config
  -rw-rw-r-- 1 tsakai tsakai 81 Feb 16 07:06 config
  [tsakai@domU-12-31-39-16-75-1E .ssh]$
  [tsakai@domU-12-31-39-16-75-1E .ssh]$ chmod 600 config
  [tsakai@domU-12-31-39-16-75-1E .ssh]$
  [tsakai@domU-12-31-39-16-75-1E .ssh]$ cat config
  Host *
        IdentityFile /home/tsakai/.ssh/tsakai
        IdentitiesOnly yes
        BatchMode yes
  [tsakai@domU-12-31-39-16-75-1E .ssh]$
  [tsakai@domU-12-31-39-16-75-1E .ssh]$ cat - >> config
        LogLevel DEBUG3
  [tsakai@domU-12-31-39-16-75-1E .ssh]$
  [tsakai@domU-12-31-39-16-75-1E .ssh]$ cat config
  Host *
        IdentityFile /home/tsakai/.ssh/tsakai
        IdentitiesOnly yes
        BatchMode yes
        LogLevel DEBUG3
  [tsakai@domU-12-31-39-16-75-1E .ssh]$
  [tsakai@domU-12-31-39-16-75-1E .ssh]$ ll config
  -rw------- 1 tsakai tsakai 98 Feb 16 07:07 config
  [tsakai@domU-12-31-39-16-75-1E .ssh]$
  [tsakai@domU-12-31-39-16-75-1E .ssh]$ cd ..
  [tsakai@domU-12-31-39-16-75-1E ~]$
  [tsakai@domU-12-31-39-16-75-1E ~]$ mpirun -app app.ac
  debug2: ssh_connect: needpriv 0
  debug1: Connecting to domU-12-31-39-16-4E-4C.compute-1.internal
[10.96.77.182] port 22.
  debug1: Connection established.
  debug3: Not a RSA1 key file /home/tsakai/.ssh/tsakai.
  debug2: key_type_from_name: unknown key type '-----BEGIN'
  debug3: key_read: missing keytype
  debug3: key_read: missing whitespace
  debug3: key_read: missing whitespace
  debug3: key_read: missing whitespace
  debug3: key_read: missing whitespace
  debug3: key_read: missing whitespace
  debug3: key_read: missing whitespace
  debug3: key_read: missing whitespace
  debug3: key_read: missing whitespace
  debug3: key_read: missing whitespace
  debug3: key_read: missing whitespace
  debug3: key_read: missing whitespace
  debug3: key_read: missing whitespace
  debug3: key_read: missing whitespace
  debug2: key_type_from_name: unknown key type '-----END'
  debug3: key_read: missing keytype
  debug1: identity file /home/tsakai/.ssh/tsakai type -1
  debug1: Remote protocol version 2.0, remote software version OpenSSH_5.3
  debug1: match: OpenSSH_5.3 pat OpenSSH*
  debug1: Enabling compatibility mode for protocol 2.0
  debug1: Local version string SSH-2.0-OpenSSH_5.3
  debug2: fd 3 setting O_NONBLOCK
  debug1: SSH2_MSG_KEXINIT sent
  debug3: Wrote 792 bytes for a total of 813
  debug1: SSH2_MSG_KEXINIT received
  debug2: kex_parse_kexinit:
diffie-hellman-group-exchange-sha256,diffie-hellman-group-exchange-sha1,diff
ie-hellman-group14-sha1,diffie-hellman-group1-sha1
  debug2: kex_parse_kexinit: ssh-rsa,ssh-dss
  debug2: kex_parse_kexinit:
aes128-ctr,aes192-ctr,aes256-ctr,arcfour256,arcfour128,aes128-cbc,3des-cbc,b
lowfish-cbc,cast128-cbc,aes192-cbc,aes256-cbc,arcfour,rijndael-cbc@lysator.l
iu.se
  debug2: kex_parse_kexinit:
aes128-ctr,aes192-ctr,aes256-ctr,arcfour256,arcfour128,aes128-cbc,3des-cbc,b
lowfish-cbc,cast128-cbc,aes192-cbc,aes256-cbc,arcfour,rijndael-cbc@lysator.l
iu.se
  debug2: kex_parse_kexinit:
hmac-md5,hmac-sha1,umac...@openssh.com,hmac-ripemd160,hmac-ripemd160@openssh
.com,hmac-sha1-96,hmac-md5-96
  debug2: kex_parse_kexinit:
hmac-md5,hmac-sha1,umac...@openssh.com,hmac-ripemd160,hmac-ripemd160@openssh
.com,hmac-sha1-96,hmac-md5-96
  debug2: kex_parse_kexinit: none,z...@openssh.com,zlib
  debug2: kex_parse_kexinit: none,z...@openssh.com,zlib
  debug2: kex_parse_kexinit:
  debug2: kex_parse_kexinit:
  debug2: kex_parse_kexinit: first_kex_follows 0
  debug2: kex_parse_kexinit: reserved 0
  debug2: kex_parse_kexinit:
diffie-hellman-group-exchange-sha256,diffie-hellman-group-exchange-sha1,diff
ie-hellman-group14-sha1,diffie-hellman-group1-sha1
  debug2: kex_parse_kexinit: ssh-rsa,ssh-dss
  debug2: kex_parse_kexinit:
aes128-ctr,aes192-ctr,aes256-ctr,arcfour256,arcfour128,aes128-cbc,3des-cbc,b
lowfish-cbc,cast128-cbc,aes192-cbc,aes256-cbc,arcfour,rijndael-cbc@lysator.l
iu.se
  debug2: kex_parse_kexinit:
aes128-ctr,aes192-ctr,aes256-ctr,arcfour256,arcfour128,aes128-cbc,3des-cbc,b
lowfish-cbc,cast128-cbc,aes192-cbc,aes256-cbc,arcfour,rijndael-cbc@lysator.l
iu.se
  debug2: kex_parse_kexinit:
hmac-md5,hmac-sha1,umac...@openssh.com,hmac-ripemd160,hmac-ripemd160@openssh
.com,hmac-sha1-96,hmac-md5-96
  debug2: kex_parse_kexinit:
hmac-md5,hmac-sha1,umac...@openssh.com,hmac-ripemd160,hmac-ripemd160@openssh
.com,hmac-sha1-96,hmac-md5-96
  debug2: kex_parse_kexinit: none,z...@openssh.com
  debug2: kex_parse_kexinit: none,z...@openssh.com
  debug2: kex_parse_kexinit:
  debug2: kex_parse_kexinit:
  debug2: kex_parse_kexinit: first_kex_follows 0
  debug2: kex_parse_kexinit: reserved 0
  debug2: mac_setup: found hmac-md5
  debug1: kex: server->client aes128-ctr hmac-md5 none
  debug2: mac_setup: found hmac-md5
  debug1: kex: client->server aes128-ctr hmac-md5 none
  debug1: SSH2_MSG_KEX_DH_GEX_REQUEST(1024<1024<8192) sent
  debug1: expecting SSH2_MSG_KEX_DH_GEX_GROUP
  debug3: Wrote 24 bytes for a total of 837
  debug2: dh_gen_key: priv key bits set: 125/256
  debug2: bits set: 489/1024
  debug1: SSH2_MSG_KEX_DH_GEX_INIT sent
  debug1: expecting SSH2_MSG_KEX_DH_GEX_REPLY
  debug3: Wrote 144 bytes for a total of 981
  debug3: check_host_in_hostfile: filename /home/tsakai/.ssh/known_hosts
  debug3: check_host_in_hostfile: match line 1
  debug3: check_host_in_hostfile: filename /home/tsakai/.ssh/known_hosts
  debug3: check_host_in_hostfile: match line 1
  debug1: Host 'domu-12-31-39-16-4e-4c.compute-1.internal' is known and
matches the RSA host key.
  debug1: Found key in /home/tsakai/.ssh/known_hosts:1
  debug2: bits set: 491/1024
  debug1: ssh_rsa_verify: signature correct
  debug2: kex_derive_keys
  debug2: set_newkeys: mode 1
  debug1: SSH2_MSG_NEWKEYS sent
  debug1: expecting SSH2_MSG_NEWKEYS
  debug3: Wrote 16 bytes for a total of 997
  debug2: set_newkeys: mode 0
  debug1: SSH2_MSG_NEWKEYS received
  debug1: SSH2_MSG_SERVICE_REQUEST sent
  debug3: Wrote 48 bytes for a total of 1045
  debug2: service_accept: ssh-userauth
  debug1: SSH2_MSG_SERVICE_ACCEPT received
  debug2: key: /home/tsakai/.ssh/tsakai ((nil))
  debug3: Wrote 64 bytes for a total of 1109
  debug1: Authentications that can continue: publickey
  debug3: start over, passed a different list publickey
  debug3: preferred gssapi-with-mic,publickey
  debug3: authmethod_lookup publickey
  debug3: remaining preferred: ,publickey
  debug3: authmethod_is_enabled publickey
  debug1: Next authentication method: publickey
  debug1: Trying private key: /home/tsakai/.ssh/tsakai
  debug1: read PEM private key done: type RSA
  debug3: sign_and_send_pubkey
  debug2: we sent a publickey packet, wait for reply
  debug3: Wrote 384 bytes for a total of 1493
  debug1: Authentication succeeded (publickey).
  debug2: fd 4 setting O_NONBLOCK
  debug1: channel 0: new [client-session]
  debug3: ssh_session2_open: channel_new: 0
  debug2: channel 0: send open
  debug1: Requesting no-more-sessi...@openssh.com
  debug1: Entering interactive session.
  debug3: Wrote 128 bytes for a total of 1621
  debug2: callback start
  debug2: client_session2_setup: id 0
  debug1: Sending environment.
  debug3: Ignored env HOSTNAME
  debug3: Ignored env TERM
  debug3: Ignored env SHELL
  debug3: Ignored env HISTSIZE
  debug3: Ignored env EC2_AMITOOL_HOME
  debug3: Ignored env SSH_CLIENT
  debug3: Ignored env SSH_TTY
  debug3: Ignored env USER
  debug3: Ignored env LD_LIBRARY_PATH
  debug3: Ignored env LS_COLORS
  debug3: Ignored env EC2_HOME
  debug3: Ignored env MAIL
  debug3: Ignored env PATH
  debug3: Ignored env INPUTRC
  debug3: Ignored env PWD
  debug3: Ignored env JAVA_HOME
  debug1: Sending env LANG = en_US.UTF-8
  debug2: channel 0: request env confirm 0
  debug3: Ignored env AWS_CLOUDWATCH_HOME
  debug3: Ignored env AWS_IAM_HOME
  debug3: Ignored env SHLVL
  debug3: Ignored env HOME
  debug3: Ignored env AWS_PATH
  debug3: Ignored env AWS_AUTO_SCALING_HOME
  debug3: Ignored env LOGNAME
  debug3: Ignored env AWS_ELB_HOME
  debug3: Ignored env SSH_CONNECTION
  debug3: Ignored env LESSOPEN
  debug3: Ignored env AWS_RDS_HOME
  debug3: Ignored env G_BROKEN_FILENAMES
  debug3: Ignored env _
  debug3: Ignored env OLDPWD
  debug3: Ignored env OMPI_MCA_plm
  debug1: Sending command:  orted --daemonize -mca ess env -mca
orte_ess_jobid 125566976 -mca orte_ess_vpid 1 -mca orte_ess_num_procs 2
--hnp-uri "125566976.0;tcp://10.96.118.236:56064"
  debug2: channel 0: request exec confirm 1
  debug2: fd 3 setting TCP_NODELAY
  debug2: callback done
  debug2: channel 0: open confirm rwindow 0 rmax 32768
  debug3: Wrote 272 bytes for a total of 1893
  debug2: channel 0: rcvd adjust 2097152
  debug2: channel_input_status_confirm: type 99 id 0
  debug2: exec request accepted on channel 0
  debug2: channel 0: read<=0 rfd 4 len 0
  debug2: channel 0: read failed
  debug2: channel 0: close_read
  debug2: channel 0: input open -> drain
  debug2: channel 0: ibuf empty
  debug2: channel 0: send eof
  debug2: channel 0: input drain -> closed
  debug3: Wrote 32 bytes for a total of 1925
  debug2: channel 0: rcvd eof
  debug2: channel 0: output open -> drain
  debug2: channel 0: obuf empty
  debug2: channel 0: close_write
  debug2: channel 0: output drain -> closed
  debug1: client_input_channel_req: channel 0 rtype exit-status reply 0
  debug2: channel 0: rcvd close
  debug3: channel 0: will not send data after close
  debug2: channel 0: almost dead
  debug2: channel 0: gc: notify user
  debug2: channel 0: gc: user detached
  debug2: channel 0: send close
  debug2: channel 0: is dead
  debug2: channel 0: garbage collecting
  debug1: channel 0: free: client-session, nchannels 1
  debug3: channel 0: status: The following connections are open:
    #0 client-session (t4 r0 i3/0 o3/0 fd -1/-1 cfd -1)

  debug3: channel 0: close_fds r -1 w -1 e 6 c -1
  debug3: Wrote 32 bytes for a total of 1957
  debug3: Wrote 64 bytes for a total of 2021
  debug1: fd 0 clearing O_NONBLOCK
  Transferred: sent 1840, received 1896 bytes, in 0.1 seconds
  Bytes per second: sent 18384.8, received 18944.3
  debug1: Exit status 0
  # it is hanging; I am about to issue control-C
  ^Cmpirun: killing job...

  --------------------------------------------------------------------------
  mpirun noticed that the job aborted, but has no info as to the process
  that caused that situation.
  --------------------------------------------------------------------------
  --------------------------------------------------------------------------
  mpirun was unable to cleanly terminate the daemons on the nodes shown
  below. Additional manual cleanup may be required - please refer to
  the "orte-clean" tool for assistance.
  --------------------------------------------------------------------------
        domU-12-31-39-16-4E-4C.compute-1.internal - daemon did not report
back when launched
  [tsakai@domU-12-31-39-16-75-1E ~]$
  [tsakai@domU-12-31-39-16-75-1E ~]$ # it says the same thing, i.e.,
  [tsakai@domU-12-31-39-16-75-1E ~]$ # daemon did not report back when
launched
  [tsakai@domU-12-31-39-16-75-1E ~]$
  [tsakai@domU-12-31-39-16-75-1E ~]$ # what does that mean?
  [tsakai@domU-12-31-39-16-75-1E ~]$ # ssh doesn't say anything alarming...
  [tsakai@domU-12-31-39-16-75-1E ~]$
  [tsakai@domU-12-31-39-16-75-1E ~]$ # I give up
  [tsakai@domU-12-31-39-16-75-1E ~]$
  [tsakai@domU-12-31-39-16-75-1E ~]$ exit
  logout
  [tsakai@vixen ec2]$
  [tsakai@vixen ec2]$

Do you see anything strange?

One final question: On ssh man page, it mentions a few environmental
varialbles.  SSH_ASKPASS, SSH_AUTH_SOCK, SSH_CONNECTION, etc.  Do
any of these matter as far as openMPI is concerned?

Thank you, Gus.

Regards,

Tena

On 2/15/11 5:09 PM, "Gus Correa" <g...@ldeo.columbia.edu> wrote:

Tena Sakai wrote:
Hi,

I am trying to reproduce what I was able to show last Friday on Amazon
EC2 instances, but I am having a problem.  What I was able to show last
Friday as root was with this command:
  mpirun ­app app.ac
with app.ac being:
  -H dns-entry-A ­np 1 (linux command)
  -H dns-entry-A ­np 1 (linux command)
  -H dns-entry-B ­np 1 (linux command)
  -H dns-entry-B ­np 1 (linux command)

Here¹s the config file in root¹s .ssh directory:
  Host *
        IdentityFile /root/.ssh/.derobee/.kagi
        IdentitiesOnly yes
        BatchMode yes

Yesterday and today I can¹t get this to work.  I made the last part of
app.ac
file simpler (it now says /bin/hostname).  Below is the session:

  -bash-3.2#
  -bash-3.2# # I am on instance A, host name for inst A is:
  -bash-3.2# hostname
  domU-12-31-39-09-CD-C2
  -bash-3.2#
  -bash-3.2# nslookup domU-12-31-39-09-CD-C2
  Server:               172.16.0.23
  Address:      172.16.0.23#53

  Non-authoritative answer:
  Name: domU-12-31-39-09-CD-C2.compute-1.internal
  Address: 10.210.210.48

  -bash-3.2# cd .ssh
  -bash-3.2#
  -bash-3.2# cat config
  Host *
          IdentityFile /root/.ssh/.derobee/.kagi
          IdentitiesOnly yes
          BatchMode yes
  -bash-3.2#
  -bash-3.2# ll config
  -rw-r--r-- 1 root root 103 Feb 15 17:18 config
  -bash-3.2#
  -bash-3.2# chmod 600 config
  -bash-3.2#
  -bash-3.2# # show I can go to inst B without password/passphrase
  -bash-3.2#
  -bash-3.2# ssh domU-12-31-39-09-E6-71.compute-1.internal
  Last login: Tue Feb 15 17:18:46 2011 from 10.210.210.48
  -bash-3.2#
  -bash-3.2# hostname
  domU-12-31-39-09-E6-71
  -bash-3.2#
  -bash-3.2# nslookup `hostname`
  Server:               172.16.0.23
  Address:      172.16.0.23#53

  Non-authoritative answer:
  Name: domU-12-31-39-09-E6-71.compute-1.internal
  Address: 10.210.233.123

  -bash-3.2# # and back to inst A is also no problem
  -bash-3.2#
  -bash-3.2# ssh domU-12-31-39-09-CD-C2.compute-1.internal
  Last login: Tue Feb 15 17:36:19 2011 from 63.193.205.1
  -bash-3.2#
  -bash-3.2# hostname
  domU-12-31-39-09-CD-C2
  -bash-3.2#
  -bash-3.2# # log out twice to go back to inst A
  -bash-3.2# exit
  logout
  Connection to domU-12-31-39-09-CD-C2.compute-1.internal closed.
  -bash-3.2#
  -bash-3.2# exit
  logout
  Connection to domU-12-31-39-09-E6-71.compute-1.internal closed.
  -bash-3.2#
  -bash-3.2# hostname
  domU-12-31-39-09-CD-C2
  -bash-3.2#
  -bash-3.2# cd ..
  -bash-3.2#
  -bash-3.2# pwd
  /root
  -bash-3.2#
  -bash-3.2# ll
  total 8
  -rw-r--r-- 1 root root 260 Feb 15 17:24 app.ac
  -rw-r--r-- 1 root root 130 Feb 15 17:34 app.ac2
  -bash-3.2#
  -bash-3.2# cat app.ac
  -H domU-12-31-39-09-CD-C2.compute-1.internal -np 1 /bin/hostname
  -H domU-12-31-39-09-CD-C2.compute-1.internal -np 1 /bin/hostname
  -H domU-12-31-39-09-E6-71.compute-1.internal -np 1 /bin/hostname
  -H domU-12-31-39-09-E6-71.compute-1.internal -np 1 /bin/hostname
  -bash-3.2#
  -bash-3.2# # when there is a remote machine (bottome 2 lines) it hangs
  -bash-3.2# mpirun -app app.ac
  mpirun: killing job...

  --------------------------------------------------------------------------
  mpirun noticed that the job aborted, but has no info as to the process
  that caused that situation.
  --------------------------------------------------------------------------
  --------------------------------------------------------------------------
  mpirun was unable to cleanly terminate the daemons on the nodes shown
  below. Additional manual cleanup may be required - please refer to
  the "orte-clean" tool for assistance.
  --------------------------------------------------------------------------
        domU-12-31-39-09-E6-71.compute-1.internal - daemon did not
report back when launched
  -bash-3.2#
  -bash-3.2# cat app.ac2
  -H domU-12-31-39-09-CD-C2.compute-1.internal -np 1 /bin/hostname
  -H domU-12-31-39-09-CD-C2.compute-1.internal -np 1 /bin/hostname
  -bash-3.2#
  -bash-3.2# # when there is no remote machine, then mpirun works:
  -bash-3.2# mpirun -app app.ac2
  domU-12-31-39-09-CD-C2
  domU-12-31-39-09-CD-C2
  -bash-3.2#
  -bash-3.2# hostname
  domU-12-31-39-09-CD-C2
  -bash-3.2#
  -bash-3.2# # this gotta be ssh problem....
  -bash-3.2#
  -bash-3.2# # show no firewall is used
  -bash-3.2# iptables --list
  Chain INPUT (policy ACCEPT)
   target     prot opt source               destination

  Chain FORWARD (policy ACCEPT)
  target     prot opt source               destination

  Chain OUTPUT (policy ACCEPT)
  target     prot opt source               destination
  -bash-3.2#
  -bash-3.2# exit
  logout
  [tsakai@vixen ec2]$

Would someone please point out what I am doing wrong?

Thank you.

Regards,

Tena

Hi Tena

Nothing wrong that I can see.
Just another couple of suggestions,
based on somewhat vague possibilities.

A slight difference is that on vixen and dashen you ran the
MPI hostname tests as a regular user, not as root, right?
Not sure if this will make much of a difference,
but it may be worth trying to run it as a regular user in EC2 also.
I general most people avoid running user applications (MPI programs
included) as root.
Mostly for safety, but I wonder if there are any
implications in the 'rootly powers'
regarding the under-the-hood processes that OpenMPI
launches along with the actual user programs.

This may make no difference either,
but you could do a 'service iptables status',
to see if the service is running, even though there are
no explicit iptable rules (as per your email).
If the service is not running you get
'Firewall is stopped.' (in CentOS).
I *think* 'iptables --list' loads the iptables module into the
kernel, as a side effect, whereas the service command does not.
So, it may be cleaner (safer?) to use the service version
instead of 'iptables --list'.
I don't know if it will make any difference,
but just in case, if the service is running,
why not do 'service iptables stop',
and perhaps also 'chkconfig iptables off' to be completely
free of iptables?

Gus Correa
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Reply via email to