Hi Tena

Ashley already answered the main point.
See more comments below.

Ashley Pittman wrote:
> On 14 Feb 2011, at 21:10, Tena Sakai wrote:
>> Regarding firewall, they are different:
>
>> I don't understand what they mean.

Well, IPtables syntax is kind of unfriendly.
The firewall in dasher seems to be the default from RedHat.
On vixen it is turned off.

>
> vixen has a normal, or empty config and as such has no firewall,
> dasher has a number of firewall rules configured which could easily
> be the cause of the problem on these two machines.
> To be able to run OpenMPI across these two machines you'll need to
> disable the firewall on dasher.
>
> To disable the firewall the command (as root) is "service iptables off"
> to turn it off until next boot or "chkconfig iptables off" to do it permanently
> from the next boot, obviously you should check with your network
> administrator before doing this.
>
> Ashley.
>

Ditto.

But Ashley, I'd guess among other hats, Tena may be wearing a network
administrator hat too, for the lack of somebody with this official role.

**

Tena:
Are these machines' IP addresses private
or public (Internet)?
Do they have multiple Ethernet ports?
(The situation may be different on each.)

ifconfig should give more info about IPs.
lspci will tell about existing Ethernet interfaces.

Public IP with no firewall is kind of risky.
(I wonder if that is what vixen has.)
Private IP on a subnet (without firewall) is typically used with MPI,
say, on a cluster, where the head node (or a few special nodes)
may have additional IP[s]/port[s] with public address[es]
protected by a firewall.

Is this reminiscent of dashen and vixen?
Were they one time part of a cluster where dashen was the head node?
Are they connected through a switch that implements a private subnet?
Are they connected to the Internet directly/independently from each other?
Something else perhaps?

If the machines have more than one Ethernet port,
you can create private IPs on idle Ether ports, connect the machines (even directly with an Ethernet cable, if there are two machines only
and you don't have a switch), and use this little private
subnet for MPI.
This would obviate the need for turning off the firewall
on the public IP, but you would have to set up your 'nodes'
file and your /etc/hosts
with the appropriate computer names and IPs consistent with that.


I hope this helps,
Gus Correa

Tena Sakai wrote:
Hi Gus,

Thank you for your response.

I have verified that
 1) /etc/hosts files on both machines vixen and dasher are identical
 2) both machines have nothing but comments in hosts.allow and hosts.deny
Regarding firewall, they are different:
On vixen this how it looks:
  [root@vixen ec2]# cat /etc/sysconfig/iptables
  cat: /etc/sysconfig/iptables: No such file or directory
  [root@vixen ec2]#
  [root@vixen ec2]# /sbin/iptables --list
  Chain INPUT (policy ACCEPT)
  target     prot opt source               destination

  Chain FORWARD (policy ACCEPT)
  target     prot opt source               destination

  Chain OUTPUT (policy ACCEPT)
  target     prot opt source               destination
  [root@vixen ec2]#

On dasher:
  [tsakai@dasher Rmpi]$ sudo cat /etc/sysconfig/iptables
  # Firewall configuration written by system-config-securitylevel
  # Manual customization of this file is not recommended.
  *filter
  :INPUT ACCEPT [0:0]
  :FORWARD ACCEPT [0:0]
  :OUTPUT ACCEPT [0:0]
  :RH-Firewall-1-INPUT - [0:0]
  -A INPUT -j RH-Firewall-1-INPUT
  -A FORWARD -j RH-Firewall-1-INPUT
  -A RH-Firewall-1-INPUT -i lo -j ACCEPT
  -A RH-Firewall-1-INPUT -p icmp --icmp-type any -j ACCEPT
  -A RH-Firewall-1-INPUT -p 50 -j ACCEPT
  -A RH-Firewall-1-INPUT -p 51 -j ACCEPT
  -A RH-Firewall-1-INPUT -p udp --dport 5353 -d 224.0.0.251 -j ACCEPT
  -A RH-Firewall-1-INPUT -p udp -m udp --dport 631 -j ACCEPT
  -A RH-Firewall-1-INPUT -p tcp -m tcp --dport 631 -j ACCEPT
  -A RH-Firewall-1-INPUT -m state --state ESTABLISHED,RELATED -j ACCEPT
  -A RH-Firewall-1-INPUT -m state --state NEW -m tcp -p tcp --dport 22 -j
ACCEPT
  -A RH-Firewall-1-INPUT -m state --state NEW -m tcp -p tcp --dport 80 -j
ACCEPT
  -A RH-Firewall-1-INPUT -j REJECT --reject-with icmp-host-prohibited
  COMMIT
  [tsakai@dasher Rmpi]$
  [tsakai@dasher Rmpi]$ sudo /sbin/iptables --list
  [sudo] password for tsakai:
  Chain INPUT (policy ACCEPT)
  target     prot opt source               destination
  RH-Firewall-1-INPUT  all  --  anywhere             anywhere

  Chain FORWARD (policy ACCEPT)
  target     prot opt source               destination
  RH-Firewall-1-INPUT  all  --  anywhere             anywhere

  Chain OUTPUT (policy ACCEPT)
  target     prot opt source               destination

  Chain RH-Firewall-1-INPUT (2 references)
  target     prot opt source               destination
  ACCEPT     all  --  anywhere             anywhere
  ACCEPT     icmp --  anywhere             anywhere            icmp any
  ACCEPT     esp  --  anywhere             anywhere
  ACCEPT     ah   --  anywhere             anywhere
  ACCEPT     udp  --  anywhere             224.0.0.251         udp dpt:mdns
  ACCEPT     udp  --  anywhere             anywhere            udp dpt:ipp
  ACCEPT     tcp  --  anywhere             anywhere            tcp dpt:ipp
  ACCEPT     all  --  anywhere             anywhere            state
RELATED,ESTABLISHED
  ACCEPT     tcp  --  anywhere             anywhere            state NEW tcp
dpt:ssh
  ACCEPT     tcp  --  anywhere             anywhere            state NEW tcp
dpt:http
  REJECT     all  --  anywhere             anywhere            reject-with
icmp-host-prohibited
  [tsakai@dasher Rmpi]$

I don't understand what they mean.  Can you see any clue as to
why vixen can and dasher cannot run mpirun with the app file:
  -H dasher.egcrc.org  -np 1 hostname
  -H dasher.egcrc.org  -np 1 hostname
  -H vixen.egcrc.org   -np 1 hostname
  -H vixen.egcrc.org   -np 1 hostname

Many thanks.

Tena

On 2/14/11 11:15 AM, "Gus Correa" <g...@ldeo.columbia.edu> wrote:

Tena Sakai wrote:
Hi Reuti,

a) can you ssh from dasher to vixen?
Yes, no problem.
  [tsakai@dasher Rmpi]$
  [tsakai@dasher Rmpi]$ hostname
  dasher.egcrc.org
  [tsakai@dasher Rmpi]$
  [tsakai@dasher Rmpi]$ ssh vixen
  Last login: Mon Feb 14 10:39:20 2011 from dasher.egcrc.org
  [tsakai@vixen ~]$
  [tsakai@vixen ~]$ hostname
  vixen.egcrc.org
  [tsakai@vixen ~]$

b) firewall on vixen?
There is no firewall on vixen that I know of, but I don't
know how I can definitively show it one way or the other.
Can you please suggest how I can do this?

Regards,

Tena


Hi Tena

Besides Reuti suggestions:

Check the consistency of /etc/hosts on both machines.
Check if there are restrictions on /etc/hosts.allow and
/etc/hosts.deny on both machines.
Check if both the MPI directories and your home/work directory
is mounted/available on both machines.
(We may have been through this checklist before, sorry if I forgot.)

Firewall info (not very friendly syntax ...):

iptables --list

or maybe better:

cat /etc/sysconfig/iptables

I hope it helps,
Gus Correa

On 2/14/11 4:38 AM, "Reuti" <re...@staff.uni-marburg.de> wrote:

Hi,

Am 14.02.2011 um 04:54 schrieb Tena Sakai:

I have digressed and started downward descent...

I was trying to make a simple and clear case.  Everything
I write in this very mail is about local machines.  There
is no virtual machines involved.  I am talking about two
machines, vixen and dasher, which share the same file
structure.  Vixen is a nfs server and dasher is an nfs
client.  I have just installed openmpi 1.4.3 on dasher,
which is the same version I have on vixen.

I have a file app.ac3, which looks like:
 [tsakai@vixen Rmpi]$ cat app.ac3
 -H dasher.egcrc.org  -np 1 hostname
 -H dasher.egcrc.org  -np 1 hostname
 -H vixen.egcrc.org   -np 1 hostname
 -H vixen.egcrc.org   -np 1 hostname
 [tsakai@vixen Rmpi]$

Vixen can run this without any problem:
 [tsakai@vixen Rmpi]$ mpirun -app app.ac3
 vixen.egcrc.org
 vixen.egcrc.org
 dasher.egcrc.org
 dasher.egcrc.org
 [tsakai@vixen Rmpi]$

But I can't run this very command from dasher:
 [tsakai@vixen Rmpi]$
 [tsakai@vixen Rmpi]$ ssh dasher
 Last login: Sun Feb 13 19:26:57 2011 from vixen.egcrc.org
 [tsakai@dasher ~]$
 [tsakai@dasher ~]$ cd Notes/R/parallel/Rmpi/
 [tsakai@dasher Rmpi]$
 [tsakai@dasher Rmpi]$ mpirun -app app.ac3
 mpirun: killing job...
a) can you ssh from dasher to vixen?

b) firewall on vixen?

-- Reuti


 --------------------------------------------------------------------------
 mpirun noticed that the job aborted, but has no info as to the process
 that caused that situation.
 --------------------------------------------------------------------------
 --------------------------------------------------------------------------
 mpirun was unable to cleanly terminate the daemons on the nodes shown
 below. Additional manual cleanup may be required - please refer to
 the "orte-clean" tool for assistance.
 --------------------------------------------------------------------------
       vixen.egcrc.org - daemon did not report back when launched
 [tsakai@dasher Rmpi]$

After I issue the mpirun command, it hangs and I had to Cntrol-C out
of it at which point it generated all lines " mpirun: killing job..."
and below.

A strange thing is that dahser has no problem executing the same
thing via ssh:
 [tsakai@dasher Rmpi]$ ssh vixen.egcrc.org hostname
 vixen.egcrc.org
 [tsakai@dasher Rmpi]$

In fact, dasher can run it via mpirun so long as no foreign machine
is present in the app file.  Ie.,
 [tsakai@dasher Rmpi]$ cat app.ac4
 -H dasher.egcrc.org  -np 1 hostname
 -H dasher.egcrc.org  -np 1 hostname
 # -H vixen.egcrc.org   -np 1 hostname
 # -H vixen.egcrc.org   -np 1 hostname
 [tsakai@dasher Rmpi]$
 [tsakai@dasher Rmpi]$ mpirun -app app.ac4
 dasher.egcrc.org
 dasher.egcrc.org
 [tsakai@dasher Rmpi]$

Can you please tell me why I can go one way (from vixen to dasher)
and not the other way (dasher to vixen)?

Thank you.

Tena


On 2/12/11 9:42 PM, "Gustavo Correa" <g...@ldeo.columbia.edu> wrote:

Hi Tena

Thank you for taking the time to explain the details of
the EC2 procedure.

I am afraid everything in my bag of tricks was used.
As Ralph and Jeff suggested, this seems to be a very specific
problem with EC2.

The difference in behavior when you run as root vs. when you
run as Tena, tells that there is some use restriction to regular users
in EC2 that isn't present in common machines (Linux or other), I guess.
This may be yet another  'stone to turn', as you like to say.
It also suggests that there is nothing wrong in principle with your
openMPI setup or with your program, otherwise root would not be able to
run
it.

Besides Ralph suggestion of trying the EC2 mailing list archive,
I wonder if EC2 has any type of user support where you could ask
for help.
After all, it is a paid sevice, isn't it?
(OpenMPI is not paid and has a great customer service, doesn't it?  :) )
You have a well documented case to present,
and the very peculiar fact that the program fails for normal users but
runs
for root.
This should help the EC2 support to start looking for a solution.

I am running out of suggestions of what you could try on your own.
But let me try:

1) You may try to reduce the problem to its less common denominator,
perhaps by trying to run non-R based MPI programs on EC2, maybe the
hello_c.c,
ring_c.c, and connectivity_c.c programs in the OpenMPi examples directory.
This would be to avoid the extra layer of complexity introduced by R.
Even simpler would be to run 'hostname' with mpiexec (mpiexec -np 2
hostname).
I.e. go in a progression of increasing complexity, see where you hit the
wall.
This may shed some light on what is going on.

I don't know if this suggestion may  really help, though.
It is not clear to me where the thing fails, whether it is during program
execution,
or while mpiexec is setting up the environment for the program to run.
If it is very early in the process, before the program starts, my
suggestion
won't work.
Jeff and Ralph, who know OpenMPI inside out, may have better advice in
this
regard.

2) Another thing would be to try to run R on E2C in serial mode, without
mpiexec,
interactively or via script, to see who EC2 doesn't like: R or OpenMPI
(but
maybe it's both).

Gus Correa

On Feb 11, 2011, at 9:54 PM, Tena Sakai wrote:

Hi Gus,

Thank you for your tips.

I didn't find any smoking gun or anything comes close.
Here's the upshot:

[tsakai@ip-10-114-239-188 ~]$ ulimit -a
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 61504
max locked memory       (kbytes, -l) 32
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1024
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) 61504
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited
[tsakai@ip-10-114-239-188 ~]$
[tsakai@ip-10-114-239-188 ~]$ sudo su
bash-3.2#
bash-3.2# ulimit -a
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 61504
max locked memory       (kbytes, -l) 32
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1024
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) unlimited
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited
bash-3.2#
bash-3.2#
bash-3.2# ulimit -a > root_ulimit-a
bash-3.2# exit
[tsakai@ip-10-114-239-188 ~]$
[tsakai@ip-10-114-239-188 ~]$ ulimit -a > tsakai_ulimit-a
[tsakai@ip-10-114-239-188 ~]$
[tsakai@ip-10-114-239-188 ~]$ diff root_ulimit-a tsakai_ulimit-a
14c14
< max user processes              (-u) unlimited
---
max user processes              (-u) 61504
[tsakai@ip-10-114-239-188 ~]$
[tsakai@ip-10-114-239-188 ~]$ cat /proc/sys/fs/file-nr
/proc/sys/fs/file-max
480     0       762674
762674
[tsakai@ip-10-114-239-188 ~]$
[tsakai@ip-10-114-239-188 ~]$ sudo su
bash-3.2#
bash-3.2# cat /proc/sys/fs/file-nr /proc/sys/fs/file-max
512     0       762674
762674
bash-3.2# exit
exit
[tsakai@ip-10-114-239-188 ~]$
[tsakai@ip-10-114-239-188 ~]$
[tsakai@ip-10-114-239-188 ~]$ sysctl -a |grep fs.file-max
-bash: sysctl: command not found
[tsakai@ip-10-114-239-188 ~]$
[tsakai@ip-10-114-239-188 ~]$ /sbin/!!
/sbin/sysctl -a |grep fs.file-max
error: permission denied on key 'kernel.cad_pid'
error: permission denied on key 'kernel.cap-bound'
fs.file-max = 762674
[tsakai@ip-10-114-239-188 ~]$
[tsakai@ip-10-114-239-188 ~]$ sudo /sbin/sysctl -a | grep fs.file-max
fs.file-max = 762674
[tsakai@ip-10-114-239-188 ~]$

I see a bit of difference between root and tsakai, but I cannot
believe such small difference results in somewhat a catastrophic
failure as I have reported.  Would you agree with me?

Regards,

Tena

On 2/11/11 6:06 PM, "Gus Correa" <g...@ldeo.columbia.edu> wrote:

Hi Tena

Please read one answer inline.

Tena Sakai wrote:
Hi Jeff,
Hi Gus,

Thanks for your replies.

I have pretty much ruled out PATH issues by setting tsakai's PATH
as identical to that of root.  In that setting I reproduced the
same result as before: root can run mpirun correctly and tsakai
cannot.

I have also checked out permission on /tmp directory.  tsakai has
no problem creating files under /tmp.

I am trying to come up with a strategy to show that each and every
programs in the PATH has "world" executable permission.  It is a
stone to turn over, but I am not holding my breath.

... you are running out of file descriptors. Are file descriptors
limited on a per-process basis, perchance?
I have never heard there is such restriction on Amazon EC2.  There
are folks who keep running instances for a long, long time.  Whereas
in my case, I launch 2 instances, check things out, and then turn
the instances off.  (Given that the state of California has a huge
debts, our funding is very tight.)  So, I really doubt that's the
case.  I have run mpirun unsuccessfully as user tsakai and immediately
after successfully as root.  Still, I would be happy if you can tell
me a way to tell number of file descriptors used or remmain.

Your mentioned file descriptors made me think of something under
/dev.  But I don't know exactly what I am fishing.  Do you have
some suggestions?

1) If the environment has anything to do with Linux,
check:

cat /proc/sys/fs/file-nr /proc/sys/fs/file-max


or

sysctl -a |grep fs.file-max

This max can be set (fs.file-max=whatever_is_reasonable)
in /etc/sysctl.conf

See 'man sysctl' and 'man sysctl.conf'

2) Another possible source of limits.

Check "ulimit -a" (bash) or "limit" (tcsh).

If you need to change look at:

/etc/security/limits.conf

(See also 'man limits.conf')

**

Since "root can but Tena cannot",
I would check 2) first,
as they are the 'per user/per group' limits,
whereas 1) is kernel/system-wise.

I hope this helps,
Gus Correa

PS - I know you are a wise and careful programmer,
but here we had cases of programs that would
fail because of too many files that were open and never closed,
eventually exceeding the max available/permissible.
So, it does happen.

I wish I could reproduce this (weired) behavior on a different
set of machines.  I certainly cannot in my local environment.  Sigh!

Regards,

Tena


On 2/11/11 3:17 PM, "Jeff Squyres (jsquyres)" <jsquy...@cisco.com>
wrote:

It is concerning if the pipe system call fails - I can't think of why
that
would happen. Thats not usually a permissions issue but rather a
deeper
indication that something is either seriously wrong on your system or
you
are
running out of file descriptors. Are file descriptors limited on a
per-process
basis, perchance?

Sent from my PDA. No type good.

On Feb 11, 2011, at 10:08 AM, "Gus Correa" <g...@ldeo.columbia.edu>
wrote:

Hi Tena

Since root can but you can't,
is is a directory permission problem perhaps?
Check the execution directory permission (on both machines,
if this is not NFS mounted dir).
I am not sure, but IIRR OpenMPI also uses /tmp for
under-the-hood stuff, worth checking permissions there also.
Just a naive guess.

Congrats for all the progress with the cloudy MPI!

Gus Correa

Tena Sakai wrote:
Hi,
I have made a bit more progress.  I think I can say ssh authenti-
cation problem is behind me now.  I am still having a problem
running
mpirun, but the latest discovery, which I can reproduce, is that
I can run mpirun as root.  Here's the session log:
[tsakai@vixen ec2]$ 2ec2 ec2-184-73-104-242.compute-1.amazonaws.com
Last login: Fri Feb 11 00:41:11 2011 from 10.100.243.195
[tsakai@ip-10-195-198-31 ~]$
[tsakai@ip-10-195-198-31 ~]$ ll
total 8
-rw-rw-r-- 1 tsakai tsakai 274 Feb 11 00:47 app.ac
-rwxr-xr-x 1 tsakai tsakai 379 Feb 11 00:48 fib.R
[tsakai@ip-10-195-198-31 ~]$
[tsakai@ip-10-195-198-31 ~]$ ll .ssh
total 16
-rw------- 1 tsakai tsakai  232 Feb  5 23:19 authorized_keys
-rw------- 1 tsakai tsakai  102 Feb 11 00:34 config
-rw-r--r-- 1 tsakai tsakai 1302 Feb 11 00:36 known_hosts
-rw------- 1 tsakai tsakai  887 Feb  8 22:03 tsakai
[tsakai@ip-10-195-198-31 ~]$
[tsakai@ip-10-195-198-31 ~]$ ssh ip-10-100-243-195.ec2.internal
Last login: Fri Feb 11 00:36:20 2011 from 10.195.198.31
[tsakai@ip-10-100-243-195 ~]$
[tsakai@ip-10-100-243-195 ~]$ # I am on machine B
[tsakai@ip-10-100-243-195 ~]$ hostname
ip-10-100-243-195
[tsakai@ip-10-100-243-195 ~]$
[tsakai@ip-10-100-243-195 ~]$ ll
total 8
-rw-rw-r-- 1 tsakai tsakai 274 Feb 11 00:44 app.ac
-rwxr-xr-x 1 tsakai tsakai 379 Feb 11 00:47 fib.R
[tsakai@ip-10-100-243-195 ~]$
[tsakai@ip-10-100-243-195 ~]$
[tsakai@ip-10-100-243-195 ~]$ cat app.ac
-H ip-10-195-198-31.ec2.internal -np 1 Rscript /home/tsakai/fib.R 5
-H ip-10-195-198-31.ec2.internal -np 1 Rscript /home/tsakai/fib.R 6
-H ip-10-100-243-195.ec2.internal -np 1 Rscript /home/tsakai/fib.R 7
-H ip-10-100-243-195.ec2.internal -np 1 Rscript /home/tsakai/fib.R 8
[tsakai@ip-10-100-243-195 ~]$
[tsakai@ip-10-100-243-195 ~]$ # go back to machine A
[tsakai@ip-10-100-243-195 ~]$
[tsakai@ip-10-100-243-195 ~]$ exit
logout
Connection to ip-10-100-243-195.ec2.internal closed.
[tsakai@ip-10-195-198-31 ~]$
[tsakai@ip-10-195-198-31 ~]$ hostname
ip-10-195-198-31
[tsakai@ip-10-195-198-31 ~]$
[tsakai@ip-10-195-198-31 ~]$ # Execute mpirun
[tsakai@ip-10-195-198-31 ~]$
[tsakai@ip-10-195-198-31 ~]$ mpirun -app app.ac

--------------------------------------------------------------------
--
--
--
mpirun was unable to launch the specified application as it
encountered
an
error:
Error: pipe function call failed when setting up I/O forwarding
subsystem
Node: ip-10-195-198-31
while attempting to start process rank 0.

--------------------------------------------------------------------
--
--
--
[tsakai@ip-10-195-198-31 ~]$
[tsakai@ip-10-195-198-31 ~]$ # try it as root
[tsakai@ip-10-195-198-31 ~]$
[tsakai@ip-10-195-198-31 ~]$ sudo su
bash-3.2#
bash-3.2# pwd
/home/tsakai
bash-3.2#
bash-3.2# ls -l /root/.ssh/config
-rw------- 1 root root 103 Feb 11 00:56 /root/.ssh/config
bash-3.2#
bash-3.2# cat /root/.ssh/config
Host *
       IdentityFile /root/.ssh/.derobee/.kagi
       IdentitiesOnly yes
       BatchMode yes
bash-3.2#
bash-3.2# pwd
/home/tsakai
bash-3.2#
bash-3.2# ls -l
total 8
-rw-rw-r-- 1 tsakai tsakai 274 Feb 11 00:47 app.ac
-rwxr-xr-x 1 tsakai tsakai 379 Feb 11 00:48 fib.R
bash-3.2#
bash-3.2# # now is the time for mpirun
bash-3.2#
bash-3.2# mpirun --app ./app.ac
13 ip-10-100-243-195
21 ip-10-100-243-195
5 ip-10-195-198-31
8 ip-10-195-198-31
bash-3.2#
bash-3.2# # It works (being root)!
bash-3.2#
bash-3.2# exit
exit
[tsakai@ip-10-195-198-31 ~]$
[tsakai@ip-10-195-198-31 ~]$ # try it one more time as tsakai
[tsakai@ip-10-195-198-31 ~]$
[tsakai@ip-10-195-198-31 ~]$ mpirun --app app.ac

--------------------------------------------------------------------
--
--
--
mpirun was unable to launch the specified application as it
encountered
an
error:
Error: pipe function call failed when setting up I/O forwarding
subsystem
Node: ip-10-195-198-31
while attempting to start process rank 0.

--------------------------------------------------------------------
--
--
--
[tsakai@ip-10-195-198-31 ~]$
[tsakai@ip-10-195-198-31 ~]$ # I don't get it.
[tsakai@ip-10-195-198-31 ~]$
[tsakai@ip-10-195-198-31 ~]$ exit
logout
[tsakai@vixen ec2]$
So, why does it say "pipe function call failed when setting up
I/O forwarding subsystem Node: ip-10-195-198-31" ?
The node it is referring to is not the remote machine.  It is
What I call machine A.  I first thought maybe this is a problem
With PATH variable.  But I don't think so.  I compared root's
Path to that of tsaki's and made them identical and retried.
I got the same behavior.
If you could enlighten me why this is happening, I would really
Appreciate it.
Thank you.
Tena
On 2/10/11 4:12 PM, "Tena Sakai" <tsa...@gallo.ucsf.edu> wrote:
Hi jeff,

Thanks for the firewall tip.  I tried it while allowing all tip
traffic
and got interesting and preplexing result.  Here's what's
interesting
(BTW, I got rid of "LogLevel DEBUG3" from ./ssh/config on this
run):

[tsakai@ip-10-203-21-132 ~]$
[tsakai@ip-10-203-21-132 ~]$ mpirun --app app.ac2
Host key verification failed.


-------------------------------------------------------------------------

-
A daemon (pid 2743) died unexpectedly with status 255 while
attempting
to launch so we are aborting.

There may be more information reported by the environment (see
above).

This may be because the daemon was unable to find all the needed
shared
libraries on the remote node. You may set your LD_LIBRARY_PATH to
have
the
location of the shared libraries on the remote nodes and this will
automatically be forwarded to the remote nodes.


-------------------------------------------------------------------------

-
-------------------------------------------------------------------------

-
mpirun noticed that the job aborted, but has no info as to the
process
that caused that situation.


-------------------------------------------------------------------------

-
mpirun: clean termination accomplished

[tsakai@ip-10-203-21-132 ~]$
[tsakai@ip-10-203-21-132 ~]$ env | grep LD_LIB
[tsakai@ip-10-203-21-132 ~]$
[tsakai@ip-10-203-21-132 ~]$ # Let's set LD_LIBRARY_PATH to
/usr/local/lib
[tsakai@ip-10-203-21-132 ~]$
[tsakai@ip-10-203-21-132 ~]$
[tsakai@ip-10-203-21-132 ~]$ export
LD_LIBRARY_PATH='/usr/local/lib'
[tsakai@ip-10-203-21-132 ~]$
[tsakai@ip-10-203-21-132 ~]$ # I better to this on machine B as
well
[tsakai@ip-10-203-21-132 ~]$
[tsakai@ip-10-203-21-132 ~]$ ssh -i tsakai ip-10-195-171-159
Warning: Identity file tsakai not accessible: No such file or
directory.
Last login: Thu Feb 10 18:31:20 2011 from 10.203.21.132
[tsakai@ip-10-195-171-159 ~]$
[tsakai@ip-10-195-171-159 ~]$ export
LD_LIBRARY_PATH='/usr/local/lib'
[tsakai@ip-10-195-171-159 ~]$
[tsakai@ip-10-195-171-159 ~]$ env | grep LD_LIB
LD_LIBRARY_PATH=/usr/local/lib
[tsakai@ip-10-195-171-159 ~]$
[tsakai@ip-10-195-171-159 ~]$ # OK, now go bak to machine A
[tsakai@ip-10-195-171-159 ~]$ exit
logout
Connection to ip-10-195-171-159 closed.
[tsakai@ip-10-203-21-132 ~]$
[tsakai@ip-10-203-21-132 ~]$ hostname
ip-10-203-21-132
[tsakai@ip-10-203-21-132 ~]$ # try mpirun again
[tsakai@ip-10-203-21-132 ~]$
[tsakai@ip-10-203-21-132 ~]$ mpirun --app app.ac2
Host key verification failed.


-------------------------------------------------------------------------

-
A daemon (pid 2789) died unexpectedly with status 255 while
attempting
to launch so we are aborting.

There may be more information reported by the environment (see
above).

This may be because the daemon was unable to find all the needed
shared
libraries on the remote node. You may set your LD_LIBRARY_PATH to
have
the
location of the shared libraries on the remote nodes and this will
automatically be forwarded to the remote nodes.


-------------------------------------------------------------------------

-
-------------------------------------------------------------------------

-
mpirun noticed that the job aborted, but has no info as to the
process
that caused that situation.


-------------------------------------------------------------------------

-
mpirun: clean termination accomplished

[tsakai@ip-10-203-21-132 ~]$
[tsakai@ip-10-203-21-132 ~]$ # I thought openmpi library was in
/usr/local/lib...
[tsakai@ip-10-203-21-132 ~]$
[tsakai@ip-10-203-21-132 ~]$ ll -t /usr/local/lib | less
total 16604
lrwxrwxrwx 1 root root      16 Feb  8 23:06 libfuse.so ->
libfuse.so.2.8.5
lrwxrwxrwx 1 root root      16 Feb  8 23:06 libfuse.so.2 ->
libfuse.so.2.8.5
lrwxrwxrwx 1 root root      25 Feb  8 23:06 libmca_common_sm.so ->
libmca_common_sm.so.1.0.0
lrwxrwxrwx 1 root root      25 Feb  8 23:06 libmca_common_sm.so.1
->
libmca_common_sm.so.1.0.0
lrwxrwxrwx 1 root root      15 Feb  8 23:06 libmpi.so ->
libmpi.so.0.0.2
lrwxrwxrwx 1 root root      15 Feb  8 23:06 libmpi.so.0 ->
libmpi.so.0.0.2
lrwxrwxrwx 1 root root      19 Feb  8 23:06 libmpi_cxx.so ->
libmpi_cxx.so.0.0.1
lrwxrwxrwx 1 root root      19 Feb  8 23:06 libmpi_cxx.so.0 ->
libmpi_cxx.so.0.0.1
lrwxrwxrwx 1 root root      19 Feb  8 23:06 libmpi_f77.so ->
libmpi_f77.so.0.0.1
lrwxrwxrwx 1 root root      19 Feb  8 23:06 libmpi_f77.so.0 ->
libmpi_f77.so.0.0.1
lrwxrwxrwx 1 root root      19 Feb  8 23:06 libmpi_f90.so ->
libmpi_f90.so.0.0.1
lrwxrwxrwx 1 root root      19 Feb  8 23:06 libmpi_f90.so.0 ->
libmpi_f90.so.0.0.1
lrwxrwxrwx 1 root root      20 Feb  8 23:06 libopen-pal.so ->
libopen-pal.so.0.0.0
lrwxrwxrwx 1 root root      20 Feb  8 23:06 libopen-pal.so.0 ->
libopen-pal.so.0.0.0
lrwxrwxrwx 1 root root      20 Feb  8 23:06 libopen-rte.so ->
libopen-rte.so.0.0.0
lrwxrwxrwx 1 root root      20 Feb  8 23:06 libopen-rte.so.0 ->
libopen-rte.so.0.0.0
lrwxrwxrwx 1 root root      26 Feb  8 23:06 libopenmpi_malloc.so ->
libopenmpi_malloc.so.0.0.0
lrwxrwxrwx 1 root root      26 Feb  8 23:06 libopenmpi_malloc.so.0
->
libopenmpi_malloc.so.0.0.0
lrwxrwxrwx 1 root root      20 Feb  8 23:06 libulockmgr.so ->
libulockmgr.so.1.0.1
lrwxrwxrwx 1 root root      20 Feb  8 23:06 libulockmgr.so.1 ->
libulockmgr.so.1.0.1
lrwxrwxrwx 1 root root      16 Feb  8 23:06 libxml2.so ->
libxml2.so.2.7.2
lrwxrwxrwx 1 root root      16 Feb  8 23:06 libxml2.so.2 ->
libxml2.so.2.7.2
-rw-r--r-- 1 root root  385912 Jan 26 01:00 libvt.a
[tsakai@ip-10-203-21-132 ~]$
[tsakai@ip-10-203-21-132 ~]$ # Now, I am really confused...
[tsakai@ip-10-203-21-132 ~]$

Do you know why it's complaining about shared libraries?

Thank you.

Tena


On 2/10/11 1:05 PM, "Jeff Squyres" <jsquy...@cisco.com> wrote:

Your prior mails were about ssh issues, but this one sounds like
you
might
have firewall issues.

That is, the "orted" command attempts to open a TCP socket back to
mpirun
for
various command and control reasons.  If it is blocked from doing
so
by
a
firewall, Open MPI won't run.  In general, you can either disable
your
firewall or you can setup a trust relationship for TCP connections
within
your
cluster.



On Feb 10, 2011, at 1:03 PM, Tena Sakai wrote:

Hi Reuti,

Thanks for suggesting "LogLevel DEBUG3."  I did so and complete
session is captured in the attached file.

What I did is much similar to what I have done before: verify
that ssh works and then run mpirun command.  In my a bit lengthy
session log, there are two responses from "LogLevel DEBUG3."  First
from an scp invocation and then from mpirun invocation.  They both
say
debug1: Authentication succeeded (publickey).

From mpirun invocation, I see a line:
debug1: Sending command:  orted --daemonize -mca ess env -mca
orte_ess_jobid 3344891904 -mca orte_ess_vpid 1 -mca
orte_ess_num_procs
2 --hnp-uri "3344891904.0;tcp://10.194.95.239:54256"
The IP address at the end of the line is indeed that of machine B.
After that there was hanging and I controlled-C out of it, which
gave me more lines.  But the lines after
debug1: Sending command:  orted bla bla bla
doesn't look good to me.  But, in truth, I have no idea what they
mean.

If you could shed some light, I would appreciate it very much.

Regards,

Tena


On 2/10/11 10:57 AM, "Reuti" <re...@staff.uni-marburg.de> wrote:

Hi,

Am 10.02.2011 um 19:11 schrieb Tena Sakai:

your local machine is Linux like, but the execution hosts
are Macs? I saw the /Users/tsakai/... in your output.
No, my environment is entirely linux.  The path to my home
directory on one host (blitzen) has been known as /Users/tsakai,
despite it is an nfs mount from vixen (which is known to
itself as /home/tsakai).  For historical reasons, I have
chosen to give a symbolic link named /Users to vixen's /Home,
so that I can use consistent path for both vixen and blitzen.
okay. Sometimes the protection of the home directory must be
adjusted
too,
but
as you can do it from the command line this shouldn't be an issue.


Is this a private cluster (or at least private interfaces)?
It would also be an option to use hostbased authentication,
which will avoid setting any known_hosts file or passphraseless
ssh-keys for each user.
No, it is not a private cluster.  It is Amazon EC2.  When I
Ssh from my local machine (vixen) I use its public interface,
but to address from one amazon cluster node to the other I
use nodes' private dns names: domU-12-31-39-07-35-21 and
domU-12-31-39-06-74-E2.  Both public and private dns names
change from a launch to another.  I am using passphrasesless
ssh-keys for authentication in all cases, i.e., from vixen to
Amazon node A, from amazon node A to amazon node B, and from
Amazon node B back to A.  (Please see my initail post.  There
is a session dialogue for this.)  They all work without authen-
tication dialogue, except a brief initial dialogue:
The authenticity of host 'domu-xx-xx-xx-xx-xx-x (10.xx.xx.xx)'
can't be established.
RSA key fingerprint is
e3:ad:75:b1:a4:63:7f:0f:c4:0b:10:71:f3:2f:21:81.
Are you sure you want to continue connecting (yes/no)?
to which I say "yes."
But I am unclear with what you mean by "hostbased authentication"?
Doesn't that mean with password?  If so, it is not an option.
No. It's convenient inside a private cluster as it won't fill each
users'
known_hosts file and you don't need to create any ssh-keys. But
when
the
hostname changes every time it might also create new hostkeys. It
uses
hostkeys (private and public), this way it works for all users.
Just
for
reference:

http://arc.liv.ac.uk/SGE/howto/hostbased-ssh.html

You could look into it later.

==

- Can you try to use a command when connecting from A to B? E.g.
ssh
`domU-12-31-39-06-74-E2 ls`. Is this working too?

- What about putting:

LogLevel DEBUG3

In your ~/.ssh/config. Maybe we can see what he's trying to
negotiate
before
it fails in verbose mode.


-- Reuti



Regards,

Tena


On 2/10/11 2:27 AM, "Reuti" <re...@staff.uni-marburg.de> wrote:

Hi,

your local machine is Linux like, but the execution hosts are Macs?
I
saw
the
/Users/tsakai/... in your output.

a) executing a command on them is also working, e.g.: ssh
domU-12-31-39-07-35-21 ls

Am 10.02.2011 um 07:08 schrieb Tena Sakai:

Hi,

I have made a bit of progress(?)...
I made a config file in my .ssh directory on the cloud.  It looks
like:
# machine A
Host domU-12-31-39-07-35-21.compute-1.internal
This is just an abbreviation or nickname above. To use the
specified
settings,
it's necessary to specify exactly this name. When the settings are
the
same
anyway for all machines, you can use:

Host *
IdentityFile /home/tsakai/.ssh/tsakai
IdentitiesOnly yes
BatchMode yes

instead.

Is this a private cluster (or at least private interfaces)? It
would
also
be
an option to use hostbased authentication, which will avoid setting
any
known_hosts file or passphraseless ssh-keys for each user.

-- Reuti


HostName domU-12-31-39-07-35-21
BatchMode yes
IdentityFile /home/tsakai/.ssh/tsakai
ChallengeResponseAuthentication no
IdentitiesOnly yes

# machine B
Host domU-12-31-39-06-74-E2.compute-1.internal
HostName domU-12-31-39-06-74-E2
BatchMode yes
IdentityFile /home/tsakai/.ssh/tsakai
ChallengeResponseAuthentication no
IdentitiesOnly yes

This file exists on both machine A and machine B.

Now When I issue mpirun command as below:
[tsakai@domU-12-31-39-06-74-E2 ~]$ mpirun -app app.ac2

It hungs.  I control-C out of it and I get:
mpirun: killing job...



--------------------------------------------------------------------
--
--
->
-
mpirun noticed that the job aborted, but has no info as to the
process
that caused that situation.


--------------------------------------------------------------------
--
--
->
-
--------------------------------------------------------------------
--
--
->
-
mpirun was unable to cleanly terminate the daemons on the nodes
shown
below. Additional manual cleanup may be required - please refer to
the "orte-clean" tool for assistance.


--------------------------------------------------------------------
--
--
->
-
   domU-12-31-39-07-35-21.compute-1.internal - daemon did not
report
back when launched

Am I making progress?

Does this mean I am past authentication and something else is the
problem?
Does someone have an example .ssh/config file I can look at?  There
are
so
many keyword-argument paris for this config file and I would like
to
look
at
some very basic one that works.

Thank you.

Tena Sakai
tsa...@gallo.ucsf.edu

On 2/9/11 7:52 PM, "Tena Sakai" <tsa...@gallo.ucsf.edu> wrote:

Hi

I have an app.ac1 file like below:
[tsakai@vixen local]$ cat app.ac1
-H vixen.egcrc.org   -np 1 Rscript
/Users/tsakai/Notes/R/parallel/Rmpi/local/fib.R 5
-H vixen.egcrc.org   -np 1 Rscript
/Users/tsakai/Notes/R/parallel/Rmpi/local/fib.R 6
-H blitzen.egcrc.org -np 1 Rscript
/Users/tsakai/Notes/R/parallel/Rmpi/local/fib.R 7
-H blitzen.egcrc.org -np 1 Rscript
/Users/tsakai/Notes/R/parallel/Rmpi/local/fib.R 8

The program I run is
Rscript /Users/tsakai/Notes/R/parallel/Rmpi/local/fib.R x
Where x is [5..8].  The machines vixen and blitzen each run 2 runs.

Here¹s the program fib.R:
[ tsakai@vixen local]$ cat fib.R
   # fib() computes, given index n, fibonacci number iteratively
   # here's the first dozen sequence (indexed from 0..11)
   # 1, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89

fib <- function( n ) {
       a <- 0
       b <- 1
       for ( i in 1:n ) {
            t <- b
            b <- a
            a <- a + t
       }
   a

arg <- commandArgs( TRUE )
myHost <- system( 'hostname', intern=TRUE )
cat( fib(arg), myHost, '\n' )

It reads an argument from command line and produces a fibonacci
number
that
corresponds to that index, followed by the machine name.  Pretty
simple
stuff.

Here¹s the run output:
[tsakai@vixen local]$ mpirun -app app.ac1
5 vixen.egcrc.org
8 vixen.egcrc.org
13 blitzen.egcrc.org
21 blitzen.egcrc.org

Which is exactly what I expect.  So far so good.

Now I want to run the same thing on cloud.  I launch 2 instances of
the
same
virtual machine, to which I get to by:
[tsakai@vixen local]$ ssh ­A I ~/.ssh/tsakai
machine-instance-A-public-dns

Now I am on machine A:
[tsakai@domU-12-31-39-00-D1-F2 ~]$
[tsakai@domU-12-31-39-00-D1-F2 ~]$ # and I can go to machine B
without
password authentication,
[tsakai@domU-12-31-39-00-D1-F2 ~]$ # i.e., use public/private key
[tsakai@domU-12-31-39-00-D1-F2 ~]$
[tsakai@domU-12-31-39-00-D1-F2 ~]$ hostname
domU-12-31-39-00-D1-F2
[tsakai@domU-12-31-39-00-D1-F2 ~]$ ssh -i .ssh/tsakai
domU-12-31-39-0C-C8-01
Last login: Wed Feb  9 20:51:48 2011 from 10.254.214.4
[tsakai@domU-12-31-39-0C-C8-01 ~]$
[tsakai@domU-12-31-39-0C-C8-01 ~]$ # I am now on machine B
[tsakai@domU-12-31-39-0C-C8-01 ~]$ hostname
domU-12-31-39-0C-C8-01
[tsakai@domU-12-31-39-0C-C8-01 ~]$
[tsakai@domU-12-31-39-0C-C8-01 ~]$ # now show I can get to machine
A
without using password
[tsakai@domU-12-31-39-0C-C8-01 ~]$
[tsakai@domU-12-31-39-0C-C8-01 ~]$ ssh -i .ssh/tsakai
domU-12-31-39-00-D1-F2
The authenticity of host 'domu-12-31-39-00-d1-f2 (10.254.214.4)'
can't
be established.
RSA key fingerprint is
e3:ad:75:b1:a4:63:7f:0f:c4:0b:10:71:f3:2f:21:81.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'domu-12-31-39-00-d1-f2' (RSA) to the
list
of
known hosts.
Last login: Wed Feb  9 20:49:34 2011 from 10.215.203.239
[tsakai@domU-12-31-39-00-D1-F2 ~]$
[tsakai@domU-12-31-39-00-D1-F2 ~]$ hostname
domU-12-31-39-00-D1-F2
[tsakai@domU-12-31-39-00-D1-F2 ~]$
[tsakai@domU-12-31-39-00-D1-F2 ~]$ exit
logout
Connection to domU-12-31-39-00-D1-F2 closed.
[tsakai@domU-12-31-39-0C-C8-01 ~]$
[tsakai@domU-12-31-39-0C-C8-01 ~]$ exit
logout
Connection to domU-12-31-39-0C-C8-01 closed.
[tsakai@domU-12-31-39-00-D1-F2 ~]$
[tsakai@domU-12-31-39-00-D1-F2 ~]$ # back at machine A
[tsakai@domU-12-31-39-00-D1-F2 ~]$ hostname
domU-12-31-39-00-D1-F2

As you can see, neither machine uses password for authentication;
it
uses
public/private key pairs.  There is no problem (that I can see) for
ssh
invocation
from one machine to the other.  This is so because I have a copy of
public
key
and a copy of private key on each instance.

The app.ac file is identical, except the node names:
[tsakai@domU-12-31-39-00-D1-F2 ~]$ cat app.ac1
-H domU-12-31-39-00-D1-F2 -np 1 Rscript /home/tsakai/fib.R 5
-H domU-12-31-39-00-D1-F2 -np 1 Rscript /home/tsakai/fib.R 6
-H domU-12-31-39-0C-C8-01 -np 1 Rscript /home/tsakai/fib.R 7
-H domU-12-31-39-0C-C8-01 -np 1 Rscript /home/tsakai/fib.R 8

Here¹s what happens with mpirun:

[tsakai@domU-12-31-39-00-D1-F2 ~]$ mpirun -app app.ac1
tsakai@domu-12-31-39-0c-c8-01's password:
Permission denied, please try again.
tsakai@domu-12-31-39-0c-c8-01's password: mpirun: killing job...



--------------------------------------------------------------------
--
->
-
--
mpirun noticed that the job aborted, but has no info as to the
process
that caused that situation.


--------------------------------------------------------------------
--
->
-
--

mpirun: clean termination accomplished

[tsakai@domU-12-31-39-00-D1-F2 ~]$

Mpirun (or somebody else?) asks me password, which I don¹t have.
I end up typing control-C.

Here¹s my question:
How can I get past authentication by mpirun where there is no
password?

I would appreciate your help/insight greatly.

Thank you.

Tena Sakai
tsa...@gallo.ucsf.edu
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Reply via email to