Hi Gus,

I am starting to see the light at the other end of the tunnel.
As I wrote in reply to Jeff, it was not a ssh problem.  It was
a setting of user configurable firewall that Amazon calls
security group.  I need to expand my small tests to wider
set, but I think I can do that.  I will keep you posted in
coming days/weeks.

Many thanks for your help and dialog.  I really appreciate
your help and explanations.

Thank you!

Regards,

Tena


On 2/16/11 4:31 PM, "Gus Correa" <g...@ldeo.columbia.edu> wrote:

> Hi Tena
>
> Again, I think your EC2 session log with ssh debug3 level (below)
> should be looked at by somebody more knowledgeable in OpenMPI
> and in ssh that me.
> There must be some clue to what is going on there.
>
> Ssh experts, Jeff, Ralph, please help!
>
> Anyway ...
> AFAIK, 'orted' in the first line you selected/highlighted below,
> is the 'Openmpi Run Time Environment Daemon' ( ... the OpenMPI pros
> are authorized to send me to the galleys if it is not ...).
> So, orted is trying to do its thing, to create the conditions for your
> job to run across the two EC2 'instances'. (Gone are the naive
> days when these things were computers, each one on its box ...)
> This master or ceremonies' work of orted is done via tcp, and I guess
> 10.96.118.236 is the IP  (of computer B?),
> and 56064 is probably the port,
> where orted may be trying to open a socket.
> The bunch of -mca parameters are just what they are: MCA parameters
> (MCA=Modular Component Architecture of OpenMPI, and here I am risking to
> be shanghaied or ridiculed again ...).
> (You can learn more about the mca parameters with 'ompi_info -help'.)
> That is how in my ignorance I parse that line.
>
> So, from the computer/instance-A side orted gives the first kick,
> but somehow the ball never comes back from computer/instance-B.
> It's ping- without -pong.
> The same frustrating feeling I had when I was a kid and kicked the
> soccer ball on the neighbor's side and would never see it again.
>
> Cheers,
> Gus
>
> Tena Sakai wrote:
>> Hi Gus,
>>
>> Thank you for your reply and suggestions.
>>
>> I will follow up on these in a bit and will give you an
>> update.  Looking at what vixen and/or dasher generates
>> from DEBUG3 would be interesting.
>>
>> For now, may I point out something I noticed out of the
>> DEBUG3 Output last night?
>>
>> I found this line:
>>
>>>   debug1: Sending command:  orted --daemonize -mca ess env -mca
>>> orte_ess_jobid 125566976 -mca orte_ess_vpid 1 -mca orte_ess_num_procs 2
>>> --hnp-uri "125566976.0;tcp://10.96.118.236:56064"
>>
>> Followed by:
>>
>>>   debug2: channel 0: request exec confirm 1
>>>   debug2: fd 3 setting TCP_NODELAY
>>>   debug2: callback done
>>>   debug2: channel 0: open confirm rwindow 0 rmax 32768
>>>   debug3: Wrote 272 bytes for a total of 1893
>>>   debug2: channel 0: rcvd adjust 2097152
>>>   debug2: channel_input_status_confirm: type 99 id 0
>>
>> It appears, to my untrained eye/mind, a directive from instance A
>> to B was issued and then what happened?  I don't see that was
>> honored by the instance B.
>>
>> Can you please comment on this?
>>
>> Thank you.
>>
>> Regards,
>>
>> Tena
>>
>> On 2/16/11 1:34 PM, "Gus Correa" <g...@ldeo.columbia.edu> wrote:
>>
>>> Hi Tena
>>>
>>> I hope somebody more knowledgeable in ssh
>>> takes a look at the debug3 session log that you included.
>>>
>>> I can't see if/where/why ssh is failing for you in EC2.
>>>
>>> See other answers inline, please.
>>>
>>> Tena Sakai wrote:
>>>> Hi Gus,
>>>>
>>>> Thank you again for your reply.
>>>>
>>>>> A slight difference is that on vixen and dashen you ran the
>>>>> MPI hostname tests as a regular user, not as root, right?
>>>>> Not sure if this will make much of a difference,
>>>>> but it may be worth trying to run it as a regular user in EC2 also.
>>>>> I general most people avoid running user applications (MPI programs
>>>>> included) as root.
>>>>> Mostly for safety, but I wonder if there are any
>>>>> implications in the 'rootly powers'
>>>>> regarding the under-the-hood processes that OpenMPI
>>>>> launches along with the actual user programs.
>>>> Yes, between vixen and dahser I was doing the test as user tsakai,
>>>> not as root.  But the reason I wanted to do this test as root is
>>>> to show that it fails as regular user (generating pipe system
>>>> call failed error), whereas as root it would succeed, as it did
>>>> on Friday.
>>> Sorry again.
>>> I even wrote "root can and Tena cannot", then I forgot.
>>> Too many tasks at the same time, too much context-switching ...
>>>
>>>> The ami has not changed.  The last change on the ami
>>>> was last Tuesday.  As such I don't understand this inconsistent
>>>> behavior.  I have lots of notes from previous sessions and I
>>>> consulted different successful session logs to replicate what I
>>>> saw Friday, but with no success.
>>>>
>>>> Having spent days and not getting anywhere, I decided to take a
>>>> different approach.  I instantiated a linux ami that was built by
>>>> Amazon, which feels like centos/fedora-based.  I downloaded gcc
>>>> and c++, plus openMPI 1.4.3.  After I got openMPI running, I
>>>> created an account for user tsakai, uploaded my public key, re-logged
>>>> in as user tsakai, and ran the same test.  Surprisingly (or not?) it
>>>> generated the same result.  I.e., I cannot run the same mpirun
>>>> command when there is a remote instance involved, but on itself
>>>> mpirun runs fine.  So, I am feeling that this has to be an ssh
>>>> authentication problem.  I looked at man page for ssh and ssh_config
>>>> and cannot figure out what I am doing wrong.  I put in "LogLevel
>>>> DEBUG3" line and it generated lots of lines, in which I found a
>>>> line:
>>>>   debug1: Authentication succeeded (publickey).
>>>> Then I see a bunch of lines that look like:
>>>>   debug3: Ignored env XXXXXXX
>>>> and mpirun hangs.  Here is the session log:
>>>>
>>> Ssh on our clusters uses host-based authentication.
>>> I think Reuti sent you his page about it:
>>> http://arc.liv.ac.uk/SGE/howto/hostbased-ssh.html
>>>
>>> However, I believe OpenMPI shouldn't care which ssh authentication
>>> mechanism is used, as long as it works passwordless.
>>>
>>> As for ssh configuration, ours is pretty standard:
>>>
>>> 1) We don't have 'IdentitiesOnly yes' (default is 'no'),
>>> but use standard identity file names id_rsa, etc.
>>> I think you are just telling ssh to use the specific identity
>>> file you named.
>>> I don't know if this may cause the problem, but who knows?
>>>
>>> 2) We don't have 'BatchMode yes' set.
>>>
>>> 3) We have the GSS authentication set
>>>
>>> GSSAPIAuthentication yes
>>>
>>> 4) The locale environment variables are also passed
>>> (may not be crucial):
>>>
>>>         SendEnv LANG LC_CTYPE LC_NUMERIC LC_TIME LC_COLLATE LC_MONETARY
>>> LC_MESSAGES
>>>         SendEnv LC_PAPER LC_NAME LC_ADDRESS LC_TELEPHONE LC_MEASUREMENT
>>>         SendEnv LC_IDENTIFICATION LC_ALL
>>>
>>> 5) And X forwarding (you're not doing any X stuff, I suppose):
>>>
>>> ForwardX11Trusted yes
>>>
>>> 6) However, you may want to check what is in your
>>> /etc/ssh/ssh_config and /etc/ssh/sshd_config,
>>> because some options may be already set there.
>>>
>>> 7) Take a look at 'man ssh[d]' and  'man ssh[d]_config' too.
>>>
>>> ***
>>>
>>> Finally, if you are willing to, it may be worth to run the same
>>> experiment (with debug3) on vixen @ dashen, just to compare what
>>> comes out from the verbose ssh messages to what you see in EC2.
>>> Perhaps it may help nail down the reason for failure.
>>>
>>> Gus Correa
>>>
>>>
>>>
>>>>   [tsakai@vixen ec2]$
>>>>   [tsakai@vixen ec2]$ ssh -i $MYKEY
>>>> tsa...@ec2-50-17-24-195.compute-1.amazonaws.com
>>>>   Last login: Wed Feb 16 06:50:08 2011 from 63.193.205.1
>>>>
>>>>          __|  __|_  )  Amazon Linux AMI
>>>>          _|  (     /     Beta
>>>>         ___|\___|___|
>>>>
>>>>   See /usr/share/doc/amzn-ami/image-release-notes for latest release notes.
>>>> :-)
>>>>   [tsakai@domU-12-31-39-16-75-1E ~]$
>>>>   [tsakai@domU-12-31-39-16-75-1E ~]$ # show firewall is off
>>>>   [tsakai@domU-12-31-39-16-75-1E ~]$ service iptables status
>>>>   -bash: service: command not found
>>>>   [tsakai@domU-12-31-39-16-75-1E ~]$ sudo service iptables status
>>>>   iptables: Firewall is not running.
>>>>   [tsakai@domU-12-31-39-16-75-1E ~]$ # show I can go to inst B with no
>>>> password authentication
>>>>   [tsakai@domU-12-31-39-16-75-1E ~]$ ssh
>>>> domU-12-31-39-16-4E-4C.compute-1.internal
>>>>   Last login: Wed Feb 16 06:53:14 2011 from
>>>> domu-12-31-39-16-75-1e.compute-1.internal
>>>>
>>>>          __|  __|_  )  Amazon Linux AMI
>>>>          _|  (     /     Beta
>>>>         ___|\___|___|
>>>>
>>>>   See /usr/share/doc/amzn-ami/image-release-notes for latest release notes.
>>>> :-)
>>>>   [tsakai@domU-12-31-39-16-4E-4C ~]$
>>>>   [tsakai@domU-12-31-39-16-4E-4C ~]$ # also back to inst A
>>>>   [tsakai@domU-12-31-39-16-4E-4C ~]$
>>>>   [tsakai@domU-12-31-39-16-4E-4C ~]$ ssh
>>>> domU-12-31-39-16-75-1E.compute-1.internal
>>>>   Last login: Wed Feb 16 06:58:33 2011 from 63.193.205.1
>>>>
>>>>          __|  __|_  )  Amazon Linux AMI
>>>>          _|  (     /     Beta
>>>>         ___|\___|___|
>>>>
>>>>   See /usr/share/doc/amzn-ami/image-release-notes for latest release notes.
>>>> :-)
>>>>   [tsakai@domU-12-31-39-16-75-1E ~]$
>>>>   [tsakai@domU-12-31-39-16-75-1E ~]$ # OK
>>>>   [tsakai@domU-12-31-39-16-75-1E ~]$ # back to inst B
>>>>   [tsakai@domU-12-31-39-16-75-1E ~]$ exit
>>>>   logout
>>>>   Connection to domU-12-31-39-16-75-1E.compute-1.internal closed.
>>>>   [tsakai@domU-12-31-39-16-4E-4C ~]$
>>>>   [tsakai@domU-12-31-39-16-4E-4C ~]$ env | grep LD_LIB
>>>>   LD_LIBRARY_PATH=:/usr/local/lib
>>>>   [tsakai@domU-12-31-39-16-4E-4C ~]$ # show no firewall on inst B
>>>>   [tsakai@domU-12-31-39-16-4E-4C ~]$ sudo service iptables status
>>>>   iptables: Firewall is not running.
>>>>   [tsakai@domU-12-31-39-16-4E-4C ~]$
>>>>   [tsakai@domU-12-31-39-16-4E-4C ~]$ # go back to inst A
>>>>   [tsakai@domU-12-31-39-16-4E-4C ~]$ exit
>>>>   logout
>>>>   Connection to domU-12-31-39-16-4E-4C.compute-1.internal closed.
>>>>   [tsakai@domU-12-31-39-16-75-1E ~]$
>>>>   [tsakai@domU-12-31-39-16-75-1E ~]$ env | grep LD_LIB
>>>>   LD_LIBRARY_PATH=:/usr/local/lib
>>>>   [tsakai@domU-12-31-39-16-75-1E ~]$
>>>>   [tsakai@domU-12-31-39-16-75-1E ~]$ cat app.ac
>>>>   -H domU-12-31-39-16-75-1E.compute-1.internal -np 1 /bin/hostname
>>>>   -H domU-12-31-39-16-75-1E.compute-1.internal -np 1 /bin/hostname
>>>>   -H domU-12-31-39-16-4E-4C.compute-1.internal -np 1 /bin/hostname
>>>>   -H domU-12-31-39-16-4E-4C.compute-1.internal -np 1 /bin/hostname
>>>>   [tsakai@domU-12-31-39-16-75-1E ~]$
>>>>   [tsakai@domU-12-31-39-16-75-1E ~]$ # top 2 are inst A (this machine);
>>>> bottom 2 are remote inst (inst B)
>>>>   [tsakai@domU-12-31-39-16-75-1E ~]$
>>>>   [tsakai@domU-12-31-39-16-75-1E ~]$ mpirun -app app.ac
>>>>   ^Cmpirun: killing job...
>>>>
>>>>
>>>> --------------------------------------------------------------------------
>>>>   mpirun noticed that the job aborted, but has no info as to the process
>>>>   that caused that situation.
>>>>
>>>> --------------------------------------------------------------------------
>>>>
>>>> --------------------------------------------------------------------------
>>>>   mpirun was unable to cleanly terminate the daemons on the nodes shown
>>>>   below. Additional manual cleanup may be required - please refer to
>>>>   the "orte-clean" tool for assistance.
>>>>
>>>> --------------------------------------------------------------------------
>>>>         domU-12-31-39-16-4E-4C.compute-1.internal - daemon did not report
>>>> back when launched
>>>>   [tsakai@domU-12-31-39-16-75-1E ~]$
>>>>   [tsakai@domU-12-31-39-16-75-1E ~]$ # *** daemon did not report back when
>>>> launched ***
>>>>   [tsakai@domU-12-31-39-16-75-1E ~]$
>>>>   [tsakai@domU-12-31-39-16-75-1E ~]$
>>>>   [tsakai@domU-12-31-39-16-75-1E ~]$ cat app.ac2
>>>>   -H domU-12-31-39-16-75-1E.compute-1.internal -np 1 /bin/hostname
>>>>   -H domU-12-31-39-16-75-1E.compute-1.internal -np 1 /bin/hostname
>>>>   [tsakai@domU-12-31-39-16-75-1E ~]$
>>>>   [tsakai@domU-12-31-39-16-75-1E ~]$ # they refer to this instance (inst A)
>>>>   [tsakai@domU-12-31-39-16-75-1E ~]$ mpirun -app app.ac2
>>>>   domU-12-31-39-16-75-1E
>>>>   domU-12-31-39-16-75-1E
>>>>   [tsakai@domU-12-31-39-16-75-1E ~]$
>>>>   [tsakai@domU-12-31-39-16-75-1E ~]$ # that's no problem
>>>>   [tsakai@domU-12-31-39-16-75-1E ~]$
>>>>   [tsakai@domU-12-31-39-16-75-1E ~]$ cd .ssh
>>>>   [tsakai@domU-12-31-39-16-75-1E .ssh]$
>>>>   [tsakai@domU-12-31-39-16-75-1E .ssh]$ cat config
>>>>   Host *
>>>>         IdentityFile /home/tsakai/.ssh/tsakai
>>>>         IdentitiesOnly yes
>>>>         BatchMode yes
>>>>   [tsakai@domU-12-31-39-16-75-1E .ssh]$
>>>>   [tsakai@domU-12-31-39-16-75-1E .ssh]$ mv config config.svd
>>>>   [tsakai@domU-12-31-39-16-75-1E .ssh]$
>>>>   [tsakai@domU-12-31-39-16-75-1E .ssh]$ cat config.svd > config
>>>>   [tsakai@domU-12-31-39-16-75-1E .ssh]$
>>>>   [tsakai@domU-12-31-39-16-75-1E .ssh]$ ll config
>>>>   -rw-rw-r-- 1 tsakai tsakai 81 Feb 16 07:06 config
>>>>   [tsakai@domU-12-31-39-16-75-1E .ssh]$
>>>>   [tsakai@domU-12-31-39-16-75-1E .ssh]$ chmod 600 config
>>>>   [tsakai@domU-12-31-39-16-75-1E .ssh]$
>>>>   [tsakai@domU-12-31-39-16-75-1E .ssh]$ cat config
>>>>   Host *
>>>>         IdentityFile /home/tsakai/.ssh/tsakai
>>>>         IdentitiesOnly yes
>>>>         BatchMode yes
>>>>   [tsakai@domU-12-31-39-16-75-1E .ssh]$
>>>>   [tsakai@domU-12-31-39-16-75-1E .ssh]$ cat - >> config
>>>>         LogLevel DEBUG3
>>>>   [tsakai@domU-12-31-39-16-75-1E .ssh]$
>>>>   [tsakai@domU-12-31-39-16-75-1E .ssh]$ cat config
>>>>   Host *
>>>>         IdentityFile /home/tsakai/.ssh/tsakai
>>>>         IdentitiesOnly yes
>>>>         BatchMode yes
>>>>         LogLevel DEBUG3
>>>>   [tsakai@domU-12-31-39-16-75-1E .ssh]$
>>>>   [tsakai@domU-12-31-39-16-75-1E .ssh]$ ll config
>>>>   -rw------- 1 tsakai tsakai 98 Feb 16 07:07 config
>>>>   [tsakai@domU-12-31-39-16-75-1E .ssh]$
>>>>   [tsakai@domU-12-31-39-16-75-1E .ssh]$ cd ..
>>>>   [tsakai@domU-12-31-39-16-75-1E ~]$
>>>>   [tsakai@domU-12-31-39-16-75-1E ~]$ mpirun -app app.ac
>>>>   debug2: ssh_connect: needpriv 0
>>>>   debug1: Connecting to domU-12-31-39-16-4E-4C.compute-1.internal
>>>> [10.96.77.182] port 22.
>>>>   debug1: Connection established.
>>>>   debug3: Not a RSA1 key file /home/tsakai/.ssh/tsakai.
>>>>   debug2: key_type_from_name: unknown key type '-----BEGIN'
>>>>   debug3: key_read: missing keytype
>>>>   debug3: key_read: missing whitespace
>>>>   debug3: key_read: missing whitespace
>>>>   debug3: key_read: missing whitespace
>>>>   debug3: key_read: missing whitespace
>>>>   debug3: key_read: missing whitespace
>>>>   debug3: key_read: missing whitespace
>>>>   debug3: key_read: missing whitespace
>>>>   debug3: key_read: missing whitespace
>>>>   debug3: key_read: missing whitespace
>>>>   debug3: key_read: missing whitespace
>>>>   debug3: key_read: missing whitespace
>>>>   debug3: key_read: missing whitespace
>>>>   debug3: key_read: missing whitespace
>>>>   debug2: key_type_from_name: unknown key type '-----END'
>>>>   debug3: key_read: missing keytype
>>>>   debug1: identity file /home/tsakai/.ssh/tsakai type -1
>>>>   debug1: Remote protocol version 2.0, remote software version OpenSSH_5.3
>>>>   debug1: match: OpenSSH_5.3 pat OpenSSH*
>>>>   debug1: Enabling compatibility mode for protocol 2.0
>>>>   debug1: Local version string SSH-2.0-OpenSSH_5.3
>>>>   debug2: fd 3 setting O_NONBLOCK
>>>>   debug1: SSH2_MSG_KEXINIT sent
>>>>   debug3: Wrote 792 bytes for a total of 813
>>>>   debug1: SSH2_MSG_KEXINIT received
>>>>   debug2: kex_parse_kexinit:
>>>>
diffie-hellman-group-exchange-sha256,diffie-hellman-group-exchange-sha1,dif>>>>
f
>>>> ie-hellman-group14-sha1,diffie-hellman-group1-sha1
>>>>   debug2: kex_parse_kexinit: ssh-rsa,ssh-dss
>>>>   debug2: kex_parse_kexinit:
>>>>
aes128-ctr,aes192-ctr,aes256-ctr,arcfour256,arcfour128,aes128-cbc,3des-cbc,>>>>
b
>>>>
lowfish-cbc,cast128-cbc,aes192-cbc,aes256-cbc,arcfour,rijndael-cbc@lysator.>>>>
l
>>>> iu.se
>>>>   debug2: kex_parse_kexinit:
>>>>
aes128-ctr,aes192-ctr,aes256-ctr,arcfour256,arcfour128,aes128-cbc,3des-cbc,>>>>
b
>>>>
lowfish-cbc,cast128-cbc,aes192-cbc,aes256-cbc,arcfour,rijndael-cbc@lysator.>>>>
l
>>>> iu.se
>>>>   debug2: kex_parse_kexinit:
>>>>
hmac-md5,hmac-sha1,umac...@openssh.com,hmac-ripemd160,hmac-ripemd160@openss>>>>
h
>>>> .com,hmac-sha1-96,hmac-md5-96
>>>>   debug2: kex_parse_kexinit:
>>>>
hmac-md5,hmac-sha1,umac...@openssh.com,hmac-ripemd160,hmac-ripemd160@openss>>>>
h
>>>> .com,hmac-sha1-96,hmac-md5-96
>>>>   debug2: kex_parse_kexinit: none,z...@openssh.com,zlib
>>>>   debug2: kex_parse_kexinit: none,z...@openssh.com,zlib
>>>>   debug2: kex_parse_kexinit:
>>>>   debug2: kex_parse_kexinit:
>>>>   debug2: kex_parse_kexinit: first_kex_follows 0
>>>>   debug2: kex_parse_kexinit: reserved 0
>>>>   debug2: kex_parse_kexinit:
>>>>
diffie-hellman-group-exchange-sha256,diffie-hellman-group-exchange-sha1,dif>>>>
f
>>>> ie-hellman-group14-sha1,diffie-hellman-group1-sha1
>>>>   debug2: kex_parse_kexinit: ssh-rsa,ssh-dss
>>>>   debug2: kex_parse_kexinit:
>>>>
aes128-ctr,aes192-ctr,aes256-ctr,arcfour256,arcfour128,aes128-cbc,3des-cbc,>>>>
b
>>>>
lowfish-cbc,cast128-cbc,aes192-cbc,aes256-cbc,arcfour,rijndael-cbc@lysator.>>>>
l
>>>> iu.se
>>>>   debug2: kex_parse_kexinit:
>>>>
aes128-ctr,aes192-ctr,aes256-ctr,arcfour256,arcfour128,aes128-cbc,3des-cbc,>>>>
b
>>>>
lowfish-cbc,cast128-cbc,aes192-cbc,aes256-cbc,arcfour,rijndael-cbc@lysator.>>>>
l
>>>> iu.se
>>>>   debug2: kex_parse_kexinit:
>>>>
hmac-md5,hmac-sha1,umac...@openssh.com,hmac-ripemd160,hmac-ripemd160@openss>>>>
h
>>>> .com,hmac-sha1-96,hmac-md5-96
>>>>   debug2: kex_parse_kexinit:
>>>>
hmac-md5,hmac-sha1,umac...@openssh.com,hmac-ripemd160,hmac-ripemd160@openss>>>>
h
>>>> .com,hmac-sha1-96,hmac-md5-96
>>>>   debug2: kex_parse_kexinit: none,z...@openssh.com
>>>>   debug2: kex_parse_kexinit: none,z...@openssh.com
>>>>   debug2: kex_parse_kexinit:
>>>>   debug2: kex_parse_kexinit:
>>>>   debug2: kex_parse_kexinit: first_kex_follows 0
>>>>   debug2: kex_parse_kexinit: reserved 0
>>>>   debug2: mac_setup: found hmac-md5
>>>>   debug1: kex: server->client aes128-ctr hmac-md5 none
>>>>   debug2: mac_setup: found hmac-md5
>>>>   debug1: kex: client->server aes128-ctr hmac-md5 none
>>>>   debug1: SSH2_MSG_KEX_DH_GEX_REQUEST(1024<1024<8192) sent
>>>>   debug1: expecting SSH2_MSG_KEX_DH_GEX_GROUP
>>>>   debug3: Wrote 24 bytes for a total of 837
>>>>   debug2: dh_gen_key: priv key bits set: 125/256
>>>>   debug2: bits set: 489/1024
>>>>   debug1: SSH2_MSG_KEX_DH_GEX_INIT sent
>>>>   debug1: expecting SSH2_MSG_KEX_DH_GEX_REPLY
>>>>   debug3: Wrote 144 bytes for a total of 981
>>>>   debug3: check_host_in_hostfile: filename /home/tsakai/.ssh/known_hosts
>>>>   debug3: check_host_in_hostfile: match line 1
>>>>   debug3: check_host_in_hostfile: filename /home/tsakai/.ssh/known_hosts
>>>>   debug3: check_host_in_hostfile: match line 1
>>>>   debug1: Host 'domu-12-31-39-16-4e-4c.compute-1.internal' is known and
>>>> matches the RSA host key.
>>>>   debug1: Found key in /home/tsakai/.ssh/known_hosts:1
>>>>   debug2: bits set: 491/1024
>>>>   debug1: ssh_rsa_verify: signature correct
>>>>   debug2: kex_derive_keys
>>>>   debug2: set_newkeys: mode 1
>>>>   debug1: SSH2_MSG_NEWKEYS sent
>>>>   debug1: expecting SSH2_MSG_NEWKEYS
>>>>   debug3: Wrote 16 bytes for a total of 997
>>>>   debug2: set_newkeys: mode 0
>>>>   debug1: SSH2_MSG_NEWKEYS received
>>>>   debug1: SSH2_MSG_SERVICE_REQUEST sent
>>>>   debug3: Wrote 48 bytes for a total of 1045
>>>>   debug2: service_accept: ssh-userauth
>>>>   debug1: SSH2_MSG_SERVICE_ACCEPT received
>>>>   debug2: key: /home/tsakai/.ssh/tsakai ((nil))
>>>>   debug3: Wrote 64 bytes for a total of 1109
>>>>   debug1: Authentications that can continue: publickey
>>>>   debug3: start over, passed a different list publickey
>>>>   debug3: preferred gssapi-with-mic,publickey
>>>>   debug3: authmethod_lookup publickey
>>>>   debug3: remaining preferred: ,publickey
>>>>   debug3: authmethod_is_enabled publickey
>>>>   debug1: Next authentication method: publickey
>>>>   debug1: Trying private key: /home/tsakai/.ssh/tsakai
>>>>   debug1: read PEM private key done: type RSA
>>>>   debug3: sign_and_send_pubkey
>>>>   debug2: we sent a publickey packet, wait for reply
>>>>   debug3: Wrote 384 bytes for a total of 1493
>>>>   debug1: Authentication succeeded (publickey).
>>>>   debug2: fd 4 setting O_NONBLOCK
>>>>   debug1: channel 0: new [client-session]
>>>>   debug3: ssh_session2_open: channel_new: 0
>>>>   debug2: channel 0: send open
>>>>   debug1: Requesting no-more-sessi...@openssh.com
>>>>   debug1: Entering interactive session.
>>>>   debug3: Wrote 128 bytes for a total of 1621
>>>>   debug2: callback start
>>>>   debug2: client_session2_setup: id 0
>>>>   debug1: Sending environment.
>>>>   debug3: Ignored env HOSTNAME
>>>>   debug3: Ignored env TERM
>>>>   debug3: Ignored env SHELL
>>>>   debug3: Ignored env HISTSIZE
>>>>   debug3: Ignored env EC2_AMITOOL_HOME
>>>>   debug3: Ignored env SSH_CLIENT
>>>>   debug3: Ignored env SSH_TTY
>>>>   debug3: Ignored env USER
>>>>   debug3: Ignored env LD_LIBRARY_PATH
>>>>   debug3: Ignored env LS_COLORS
>>>>   debug3: Ignored env EC2_HOME
>>>>   debug3: Ignored env MAIL
>>>>   debug3: Ignored env PATH
>>>>   debug3: Ignored env INPUTRC
>>>>   debug3: Ignored env PWD
>>>>   debug3: Ignored env JAVA_HOME
>>>>   debug1: Sending env LANG = en_US.UTF-8
>>>>   debug2: channel 0: request env confirm 0
>>>>   debug3: Ignored env AWS_CLOUDWATCH_HOME
>>>>   debug3: Ignored env AWS_IAM_HOME
>>>>   debug3: Ignored env SHLVL
>>>>   debug3: Ignored env HOME
>>>>   debug3: Ignored env AWS_PATH
>>>>   debug3: Ignored env AWS_AUTO_SCALING_HOME
>>>>   debug3: Ignored env LOGNAME
>>>>   debug3: Ignored env AWS_ELB_HOME
>>>>   debug3: Ignored env SSH_CONNECTION
>>>>   debug3: Ignored env LESSOPEN
>>>>   debug3: Ignored env AWS_RDS_HOME
>>>>   debug3: Ignored env G_BROKEN_FILENAMES
>>>>   debug3: Ignored env _
>>>>   debug3: Ignored env OLDPWD
>>>>   debug3: Ignored env OMPI_MCA_plm
>>>>   debug1: Sending command:  orted --daemonize -mca ess env -mca
>>>> orte_ess_jobid 125566976 -mca orte_ess_vpid 1 -mca orte_ess_num_procs 2
>>>> --hnp-uri "125566976.0;tcp://10.96.118.236:56064"
>>>>   debug2: channel 0: request exec confirm 1
>>>>   debug2: fd 3 setting TCP_NODELAY
>>>>   debug2: callback done
>>>>   debug2: channel 0: open confirm rwindow 0 rmax 32768
>>>>   debug3: Wrote 272 bytes for a total of 1893
>>>>   debug2: channel 0: rcvd adjust 2097152
>>>>   debug2: channel_input_status_confirm: type 99 id 0
>>>>   debug2: exec request accepted on channel 0
>>>>   debug2: channel 0: read<=0 rfd 4 len 0
>>>>   debug2: channel 0: read failed
>>>>   debug2: channel 0: close_read
>>>>   debug2: channel 0: input open -> drain
>>>>   debug2: channel 0: ibuf empty
>>>>   debug2: channel 0: send eof
>>>>   debug2: channel 0: input drain -> closed
>>>>   debug3: Wrote 32 bytes for a total of 1925
>>>>   debug2: channel 0: rcvd eof
>>>>   debug2: channel 0: output open -> drain
>>>>   debug2: channel 0: obuf empty
>>>>   debug2: channel 0: close_write
>>>>   debug2: channel 0: output drain -> closed
>>>>   debug1: client_input_channel_req: channel 0 rtype exit-status reply 0
>>>>   debug2: channel 0: rcvd close
>>>>   debug3: channel 0: will not send data after close
>>>>   debug2: channel 0: almost dead
>>>>   debug2: channel 0: gc: notify user
>>>>   debug2: channel 0: gc: user detached
>>>>   debug2: channel 0: send close
>>>>   debug2: channel 0: is dead
>>>>   debug2: channel 0: garbage collecting
>>>>   debug1: channel 0: free: client-session, nchannels 1
>>>>   debug3: channel 0: status: The following connections are open:
>>>>     #0 client-session (t4 r0 i3/0 o3/0 fd -1/-1 cfd -1)
>>>>
>>>>   debug3: channel 0: close_fds r -1 w -1 e 6 c -1
>>>>   debug3: Wrote 32 bytes for a total of 1957
>>>>   debug3: Wrote 64 bytes for a total of 2021
>>>>   debug1: fd 0 clearing O_NONBLOCK
>>>>   Transferred: sent 1840, received 1896 bytes, in 0.1 seconds
>>>>   Bytes per second: sent 18384.8, received 18944.3
>>>>   debug1: Exit status 0
>>>>   # it is hanging; I am about to issue control-C
>>>>   ^Cmpirun: killing job...
>>>>
>>>>
>>>> --------------------------------------------------------------------------
>>>>   mpirun noticed that the job aborted, but has no info as to the process
>>>>   that caused that situation.
>>>>
>>>> --------------------------------------------------------------------------
>>>>
>>>> --------------------------------------------------------------------------
>>>>   mpirun was unable to cleanly terminate the daemons on the nodes shown
>>>>   below. Additional manual cleanup may be required - please refer to
>>>>   the "orte-clean" tool for assistance.
>>>>
>>>> --------------------------------------------------------------------------
>>>>         domU-12-31-39-16-4E-4C.compute-1.internal - daemon did not report
>>>> back when launched
>>>>   [tsakai@domU-12-31-39-16-75-1E ~]$
>>>>   [tsakai@domU-12-31-39-16-75-1E ~]$ # it says the same thing, i.e.,
>>>>   [tsakai@domU-12-31-39-16-75-1E ~]$ # daemon did not report back when
>>>> launched
>>>>   [tsakai@domU-12-31-39-16-75-1E ~]$
>>>>   [tsakai@domU-12-31-39-16-75-1E ~]$ # what does that mean?
>>>>   [tsakai@domU-12-31-39-16-75-1E ~]$ # ssh doesn't say anything alarming...
>>>>   [tsakai@domU-12-31-39-16-75-1E ~]$
>>>>   [tsakai@domU-12-31-39-16-75-1E ~]$ # I give up
>>>>   [tsakai@domU-12-31-39-16-75-1E ~]$
>>>>   [tsakai@domU-12-31-39-16-75-1E ~]$ exit
>>>>   logout
>>>>   [tsakai@vixen ec2]$
>>>>   [tsakai@vixen ec2]$
>>>>
>>>> Do you see anything strange?
>>>>
>>>> One final question: On ssh man page, it mentions a few environmental
>>>> varialbles.  SSH_ASKPASS, SSH_AUTH_SOCK, SSH_CONNECTION, etc.  Do
>>>> any of these matter as far as openMPI is concerned?
>>>>
>>>> Thank you, Gus.
>>>>
>>>> Regards,
>>>>
>>>> Tena
>>>>
>>>> On 2/15/11 5:09 PM, "Gus Correa" <g...@ldeo.columbia.edu> wrote:
>>>>
>>>>> Tena Sakai wrote:
>>>>>> Hi,
>>>>>>
>>>>>> I am trying to reproduce what I was able to show last Friday on Amazon
>>>>>> EC2 instances, but I am having a problem.  What I was able to show last
>>>>>> Friday as root was with this command:
>>>>>>   mpirun ­app app.ac
>>>>>> with app.ac being:
>>>>>>   -H dns-entry-A ­np 1 (linux command)
>>>>>>   -H dns-entry-A ­np 1 (linux command)
>>>>>>   -H dns-entry-B ­np 1 (linux command)
>>>>>>   -H dns-entry-B ­np 1 (linux command)
>>>>>>
>>>>>> Here¹s the config file in root¹s .ssh directory:
>>>>>>   Host *
>>>>>>         IdentityFile /root/.ssh/.derobee/.kagi
>>>>>>         IdentitiesOnly yes
>>>>>>         BatchMode yes
>>>>>>
>>>>>> Yesterday and today I can¹t get this to work.  I made the last part of
>>>>>> app.ac
>>>>>> file simpler (it now says /bin/hostname).  Below is the session:
>>>>>>
>>>>>>   -bash-3.2#
>>>>>>   -bash-3.2# # I am on instance A, host name for inst A is:
>>>>>>   -bash-3.2# hostname
>>>>>>   domU-12-31-39-09-CD-C2
>>>>>>   -bash-3.2#
>>>>>>   -bash-3.2# nslookup domU-12-31-39-09-CD-C2
>>>>>>   Server:               172.16.0.23
>>>>>>   Address:      172.16.0.23#53
>>>>>>
>>>>>>   Non-authoritative answer:
>>>>>>   Name: domU-12-31-39-09-CD-C2.compute-1.internal
>>>>>>   Address: 10.210.210.48
>>>>>>
>>>>>>   -bash-3.2# cd .ssh
>>>>>>   -bash-3.2#
>>>>>>   -bash-3.2# cat config
>>>>>>   Host *
>>>>>>           IdentityFile /root/.ssh/.derobee/.kagi
>>>>>>           IdentitiesOnly yes
>>>>>>           BatchMode yes
>>>>>>   -bash-3.2#
>>>>>>   -bash-3.2# ll config
>>>>>>   -rw-r--r-- 1 root root 103 Feb 15 17:18 config
>>>>>>   -bash-3.2#
>>>>>>   -bash-3.2# chmod 600 config
>>>>>>   -bash-3.2#
>>>>>>   -bash-3.2# # show I can go to inst B without password/passphrase
>>>>>>   -bash-3.2#
>>>>>>   -bash-3.2# ssh domU-12-31-39-09-E6-71.compute-1.internal
>>>>>>   Last login: Tue Feb 15 17:18:46 2011 from 10.210.210.48
>>>>>>   -bash-3.2#
>>>>>>   -bash-3.2# hostname
>>>>>>   domU-12-31-39-09-E6-71
>>>>>>   -bash-3.2#
>>>>>>   -bash-3.2# nslookup `hostname`
>>>>>>   Server:               172.16.0.23
>>>>>>   Address:      172.16.0.23#53
>>>>>>
>>>>>>   Non-authoritative answer:
>>>>>>   Name: domU-12-31-39-09-E6-71.compute-1.internal
>>>>>>   Address: 10.210.233.123
>>>>>>
>>>>>>   -bash-3.2# # and back to inst A is also no problem
>>>>>>   -bash-3.2#
>>>>>>   -bash-3.2# ssh domU-12-31-39-09-CD-C2.compute-1.internal
>>>>>>   Last login: Tue Feb 15 17:36:19 2011 from 63.193.205.1
>>>>>>   -bash-3.2#
>>>>>>   -bash-3.2# hostname
>>>>>>   domU-12-31-39-09-CD-C2
>>>>>>   -bash-3.2#
>>>>>>   -bash-3.2# # log out twice to go back to inst A
>>>>>>   -bash-3.2# exit
>>>>>>   logout
>>>>>>   Connection to domU-12-31-39-09-CD-C2.compute-1.internal closed.
>>>>>>   -bash-3.2#
>>>>>>   -bash-3.2# exit
>>>>>>   logout
>>>>>>   Connection to domU-12-31-39-09-E6-71.compute-1.internal closed.
>>>>>>   -bash-3.2#
>>>>>>   -bash-3.2# hostname
>>>>>>   domU-12-31-39-09-CD-C2
>>>>>>   -bash-3.2#
>>>>>>   -bash-3.2# cd ..
>>>>>>   -bash-3.2#
>>>>>>   -bash-3.2# pwd
>>>>>>   /root
>>>>>>   -bash-3.2#
>>>>>>   -bash-3.2# ll
>>>>>>   total 8
>>>>>>   -rw-r--r-- 1 root root 260 Feb 15 17:24 app.ac
>>>>>>   -rw-r--r-- 1 root root 130 Feb 15 17:34 app.ac2
>>>>>>   -bash-3.2#
>>>>>>   -bash-3.2# cat app.ac
>>>>>>   -H domU-12-31-39-09-CD-C2.compute-1.internal -np 1 /bin/hostname
>>>>>>   -H domU-12-31-39-09-CD-C2.compute-1.internal -np 1 /bin/hostname
>>>>>>   -H domU-12-31-39-09-E6-71.compute-1.internal -np 1 /bin/hostname
>>>>>>   -H domU-12-31-39-09-E6-71.compute-1.internal -np 1 /bin/hostname
>>>>>>   -bash-3.2#
>>>>>>   -bash-3.2# # when there is a remote machine (bottome 2 lines) it hangs
>>>>>>   -bash-3.2# mpirun -app app.ac
>>>>>>   mpirun: killing job...
>>>>>>
>>>>>>
>>>>>>
------------------------------------------------------------------------->>>>>>
-
>>>>>>   mpirun noticed that the job aborted, but has no info as to the process
>>>>>>   that caused that situation.
>>>>>>
>>>>>>
------------------------------------------------------------------------->>>>>>
-
>>>>>>
>>>>>>
------------------------------------------------------------------------->>>>>>
-
>>>>>>   mpirun was unable to cleanly terminate the daemons on the nodes shown
>>>>>>   below. Additional manual cleanup may be required - please refer to
>>>>>>   the "orte-clean" tool for assistance.
>>>>>>
>>>>>>
------------------------------------------------------------------------->>>>>>
-
>>>>>>         domU-12-31-39-09-E6-71.compute-1.internal - daemon did not
>>>>>> report back when launched
>>>>>>   -bash-3.2#
>>>>>>   -bash-3.2# cat app.ac2
>>>>>>   -H domU-12-31-39-09-CD-C2.compute-1.internal -np 1 /bin/hostname
>>>>>>   -H domU-12-31-39-09-CD-C2.compute-1.internal -np 1 /bin/hostname
>>>>>>   -bash-3.2#
>>>>>>   -bash-3.2# # when there is no remote machine, then mpirun works:
>>>>>>   -bash-3.2# mpirun -app app.ac2
>>>>>>   domU-12-31-39-09-CD-C2
>>>>>>   domU-12-31-39-09-CD-C2
>>>>>>   -bash-3.2#
>>>>>>   -bash-3.2# hostname
>>>>>>   domU-12-31-39-09-CD-C2
>>>>>>   -bash-3.2#
>>>>>>   -bash-3.2# # this gotta be ssh problem....
>>>>>>   -bash-3.2#
>>>>>>   -bash-3.2# # show no firewall is used
>>>>>>   -bash-3.2# iptables --list
>>>>>>   Chain INPUT (policy ACCEPT)
>>>>>>    target     prot opt source               destination
>>>>>>
>>>>>>   Chain FORWARD (policy ACCEPT)
>>>>>>   target     prot opt source               destination
>>>>>>
>>>>>>   Chain OUTPUT (policy ACCEPT)
>>>>>>   target     prot opt source               destination
>>>>>>   -bash-3.2#
>>>>>>   -bash-3.2# exit
>>>>>>   logout
>>>>>>   [tsakai@vixen ec2]$
>>>>>>
>>>>>> Would someone please point out what I am doing wrong?
>>>>>>
>>>>>> Thank you.
>>>>>>
>>>>>> Regards,
>>>>>>
>>>>>> Tena
>>>>>>
>>>>> Hi Tena
>>>>>
>>>>> Nothing wrong that I can see.
>>>>> Just another couple of suggestions,
>>>>> based on somewhat vague possibilities.
>>>>>
>>>>> A slight difference is that on vixen and dashen you ran the
>>>>> MPI hostname tests as a regular user, not as root, right?
>>>>> Not sure if this will make much of a difference,
>>>>> but it may be worth trying to run it as a regular user in EC2 also.
>>>>> I general most people avoid running user applications (MPI programs
>>>>> included) as root.
>>>>> Mostly for safety, but I wonder if there are any
>>>>> implications in the 'rootly powers'
>>>>> regarding the under-the-hood processes that OpenMPI
>>>>> launches along with the actual user programs.
>>>>>
>>>>> This may make no difference either,
>>>>> but you could do a 'service iptables status',
>>>>> to see if the service is running, even though there are
>>>>> no explicit iptable rules (as per your email).
>>>>> If the service is not running you get
>>>>> 'Firewall is stopped.' (in CentOS).
>>>>> I *think* 'iptables --list' loads the iptables module into the
>>>>> kernel, as a side effect, whereas the service command does not.
>>>>> So, it may be cleaner (safer?) to use the service version
>>>>> instead of 'iptables --list'.
>>>>> I don't know if it will make any difference,
>>>>> but just in case, if the service is running,
>>>>> why not do 'service iptables stop',
>>>>> and perhaps also 'chkconfig iptables off' to be completely
>>>>> free of iptables?
>>>>>
>>>>> Gus Correa
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> us...@open-mpi.org
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> us...@open-mpi.org
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


Reply via email to