Hi Tena

Answers inline.

Tena Sakai wrote:
Hi Gus,

Hence, I don't understand why the lack of symmetry in the
firewall protection.
Either vixen's is too loose, or dashen's is too tight, I'd risk to say.
Maybe dashen was installed later, just got whatever boilerplate firewall
that comes with RedHat, CentOS, Fedora.
If there is a gateway for this LAN somewhere with another firewall,
which is probably the case,

You are correct.  We had a system administrator, but we lost
that person and I installed dasher from scratch myslef and
I did use boilerplage firewall from centos 5.5 distribution.


I read your answer to Ashley and Reuti telling that you
turned the firewall off and OpenMPI now works with vixen and dashen.
That's good news!

Do you have Internet access from either machine?

Yes, I do.

The LAN gateway is probably doing NAT.
I would guess it also has its own firewall.
Is there anybody there that could tell you about this?


Vixen has yet another private IP 10.1.1.2 (eth0),
with a bit weird combination of broadcast address 192.168.255.255(?),
mask 255.0.0.0.
vixen is/was part of another group of machines, via this other IP,
cluster perhaps?

We have a Rocks HPC cluster.  The cluster head is called blitzen
and there are 8 nodes in the cluster.  We have completely outgrown
this setting.  For example, I am running an application for last
2 weeks with 4 of 8 nodes and the other 4 nodes have been used
by my colleagues and I expect my jobs to run another 2-3 weeks.
Which is why I am interested in cloud.

Vixen is not part of the Rocks cluster, but it is an nfs server,
as well as database server.  Here's ifconfig of blitzen:

  [tsakai@blitzen Rmpi]$ ifconfig
  eth0      Link encap:Ethernet  HWaddr 00:19:B9:E0:C0:0B
            inet addr:10.1.1.1  Bcast:10.255.255.255  Mask:255.0.0.0
            inet6 addr: fe80::219:b9ff:fee0:c00b/64 Scope:Link
            UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
            RX packets:58859908 errors:0 dropped:0 overruns:0 frame:0
            TX packets:38795319 errors:0 dropped:0 overruns:0 carrier:0
            collisions:0 txqueuelen:1000
            RX bytes:14637456238 (13.6 GiB)  TX bytes:25487423161 (23.7 GiB)
            Interrupt:193 Memory:ec000000-ec012100
eth1 Link encap:Ethernet HWaddr 00:19:B9:E0:C0:0D
            inet addr:172.16.1.106  Bcast:172.16.3.255  Mask:255.255.252.0
            inet6 addr: fe80::219:b9ff:fee0:c00d/64 Scope:Link
            UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
            RX packets:99465693 errors:0 dropped:0 overruns:0 frame:0
            TX packets:46026372 errors:0 dropped:0 overruns:0 carrier:0
            collisions:0 txqueuelen:1000
            RX bytes:44685802310 (41.6 GiB)  TX bytes:28223858173 (26.2 GiB)
            Interrupt:193 Memory:ea000000-ea012100
lo Link encap:Local Loopback
            inet addr:127.0.0.1  Mask:255.0.0.0
            inet6 addr: ::1/128 Scope:Host
            UP LOOPBACK RUNNING  MTU:16436  Metric:1
            RX packets:80078179 errors:0 dropped:0 overruns:0 frame:0
            TX packets:80078179 errors:0 dropped:0 overruns:0 carrier:0
            collisions:0 txqueuelen:0
            RX bytes:27450135463 (25.5 GiB)  TX bytes:27450135463 (25.5 GiB)

And here's the same thing of vixen:
[tsakai@vixen Rmpi]$ cat moo
  eth0      Link encap:Ethernet  HWaddr 00:1A:A0:1C:00:31
            inet addr:10.1.1.2  Bcast:192.168.255.255  Mask:255.0.0.0
            inet6 addr: fe80::21a:a0ff:fe1c:31/64 Scope:Link
            UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
            RX packets:61942079 errors:0 dropped:0 overruns:0 frame:0
            TX packets:61950934 errors:0 dropped:0 overruns:0 carrier:0
            collisions:0 txqueuelen:1000
            RX bytes:47837093368 (44.5 GiB)  TX bytes:54525223424 (50.7 GiB)
            Interrupt:185 Memory:ea000000-ea012100
eth1 Link encap:Ethernet HWaddr 00:1A:A0:1C:00:33
            inet addr:172.16.1.107  Bcast:172.16.3.255  Mask:255.255.252.0
            inet6 addr: fe80::21a:a0ff:fe1c:33/64 Scope:Link
            UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
            RX packets:5204606192 errors:0 dropped:0 overruns:0 frame:0
            TX packets:8935890067 errors:0 dropped:0 overruns:0 carrier:0
            collisions:0 txqueuelen:1000
            RX bytes:371146631795 (345.6 GiB)  TX bytes:13424275898600 (12.2
TiB)
            Interrupt:193 Memory:ec000000-ec012100
lo Link encap:Local Loopback
            inet addr:127.0.0.1  Mask:255.0.0.0
            inet6 addr: ::1/128 Scope:Host
            UP LOOPBACK RUNNING  MTU:16436  Metric:1
            RX packets:244240818 errors:0 dropped:0 overruns:0 frame:0
            TX packets:244240818 errors:0 dropped:0 overruns:0 carrier:0
            collisions:0 txqueuelen:0
            RX bytes:1190988294201 (1.0 TiB)  TX bytes:1190988294201 (1.0
TiB)

I think you are also correct as to:

a bit weird combination of broadcast address 192.168.255.255 (?),
and mask 255.0.0.0.

I think they are both misconfigured.  I will fix them when I can.


Blitzen's configuration looks like standard Rocks to me:
eth0 for private net, eth1 for LAN or WAN.
I think it is not misconfigured.

Also, beware that Rocks has its own ways/commands to configure things
(i.e., '$ rocks do this and that').
Using the Linux tools directly sometimes breaks or leaves loose
ends on Rocks.

Vixen eth0 looks weird, but now that you mentioned your Rocks cluster,
it may be that its eth0 is used to connect vixen to the
cluster's private subnet, and serve NFS to it.
Still the Bcast address doesn't look right.
I would expect it to be 10.255.255.255 (as in blitzen's eth0), if vixen
serves NFS to the cluster via eth0.

What is in your ${TORQUE}/server_priv/nodes file?
IPs or names (vixen & dashen).

We don't use TORQUE.  We do use SGE from blitzen.


Oh, sorry, you said before you don't use Torque.
I forgot that one.

What I really meant to ask is about your OpenMPI hostfile,
or how the --app file refers to the machines,
but I guess you use host names there, not IPs.

Are they on a DNS server or do you resolve their names/IPs
via /etc/hosts?
Hopefully vixen's name resolves as 172.16.1.107.

They are on dns server:

  [tsakai@dasher Rmpi]$ nslookup vixen.egcrc.org
  Server:         172.16.1.2
  Address:        172.16.1.2#53

  Name:   vixen.egcrc.org
  Address: 172.16.1.107

  [tsakai@dasher Rmpi]$ nslookup blitzen
  Server:         172.16.1.2
  Address:        172.16.1.2#53

  Name:   blitzen.egcrc.org
  Address: 172.16.1.106

  [tsakai@dasher Rmpi]$
  [tsakai@dasher Rmpi]$


DNS makes it easier for you, specially on a LAN, where machines
change often in ways that you can't control.
You don't need to worry about resolving names with /etc/hosts,
which is an the easy thing to do in a cluster.

One more point that I over looked in a previous post:

I have yet to understand whether you copy your compiled tools
(OpenMPI, R, etc) from your local machines to EC2,
or if you build/compile them directly on the EC2 environment.

Tools like OpenMPI, R, and for that matter gcc, must be part
of ami.  The ami is stored on amazon device, it could be on
an S3 (simple storage server) or volume (which is what Ashley
recommends).  So I put R and everything I needed on the ami
before I uploaded it onto amazon.  Only I didn't put OpenMPI
on it.  I did wget from my ami instance to download OpenMPI
source, compiled it on the instance, and saved that image
on S3.  So now when I launch the instance OpenMPI is part of
the ami.


It is more clear to me now.
It sounds right, although other than storage,
I can't fathom the difference between what you
did and what Ashley suggested.
Yet, somehow Ashley got it to work.
There may be something to pursue there.

Also, it's not clear to me if the OS in EC2 is an image
from your local machines' OS/Linux distro, or independent of them,
or if you can choose to have it either way.

The OS in EC2 is either linux or windows.  (I have never
used windows in my life.)

I did.
Don't worry.
It is not a sin.  :)

But seriously, from the problems I read on the MPICH2 mailing list,
I it seems to be hard to use it for HPC and parallel programing at least.


For linux, it can be any linux
as one chooses.  In my case, I built an ami from centos
distribution with everything I needed.  It is essentially
the same thing as dasher.

Except for the firewall, I suppose.
Did you check if it is turned off on your EC2 replica of dasher?
I don't know if this question makes any sense in the EC2 context,
but maybe it does.


On another posting, Ashley Pittman reported to
be using OpenMPI in Amazon EC2 without problems,
suggests pathway and gives several tips for that.
That is probably a more promising path,
which you may want to try.

I have a feeling that I will be in need of more help
from her.


Save a mistake, I have the feeling that the
Ashley Pitmann we've been talking to is a gentleman:

http://uk.linkedin.com/in/ashleypittman

not the jewelry designer:

http://www.ashleypittman.com/company-ashley-pittman.php

Regards,

Tena




Best,
Gus


On 2/14/11 3:46 PM, "Gus Correa" <g...@ldeo.columbia.edu> wrote:

Tena Sakai wrote:
Hi Kevin,

Thanks for your reply.
Dasher is physically located under my desk and vixen is in a
cecure data center.

 does dasher have any network interfaces that vixen does not?
No, I don't think so.
Here is more definitive info:
  [tsakai@dasher Rmpi]$ ifconfig
  eth0      Link encap:Ethernet  HWaddr 00:1A:A0:E1:84:A9
            inet addr:172.16.0.116  Bcast:172.16.3.255  Mask:255.255.252.0
            inet6 addr: fe80::21a:a0ff:fee1:84a9/64 Scope:Link
            UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
            RX packets:2347 errors:0 dropped:0 overruns:0 frame:0
            TX packets:1005 errors:0 dropped:0 overruns:0 carrier:0
            collisions:0 txqueuelen:100
            RX bytes:531809 (519.3 KiB)  TX bytes:269872 (263.5 KiB)
            Memory:c2200000-c2220000

  lo        Link encap:Local Loopback
            inet addr:127.0.0.1  Mask:255.0.0.0
            inet6 addr: ::1/128 Scope:Host
            UP LOOPBACK RUNNING  MTU:16436  Metric:1
            RX packets:74 errors:0 dropped:0 overruns:0 frame:0
            TX packets:74 errors:0 dropped:0 overruns:0 carrier:0
            collisions:0 txqueuelen:0
            RX bytes:7824 (7.6 KiB)  TX bytes:7824 (7.6 KiB)

  [tsakai@dasher Rmpi]$

However, vixen has two ethernet[tsakai@vixen Rmpi]$ cat moo
  [root@vixen ec2]# /sbin/ifconfig
  eth0      Link encap:Ethernet  HWaddr 00:1A:A0:1C:00:31
            inet addr:10.1.1.2  Bcast:192.168.255.255  Mask:255.0.0.0
            inet6 addr: fe80::21a:a0ff:fe1c:31/64 Scope:Link
            UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
            RX packets:61913135 errors:0 dropped:0 overruns:0 frame:0
            TX packets:61923635 errors:0 dropped:0 overruns:0 carrier:0
            collisions:0 txqueuelen:1000
            RX bytes:47832124690 (44.5 GiB)  TX bytes:54515478860 (50.7 GiB)
            Interrupt:185 Memory:ea000000-ea012100
eth1 Link encap:Ethernet HWaddr 00:1A:A0:1C:00:33
            inet addr:172.16.1.107  Bcast:172.16.3.255  Mask:255.255.252.0
            inet6 addr: fe80::21a:a0ff:fe1c:33/64 Scope:Link
            UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
            RX packets:5204431112 errors:0 dropped:0 overruns:0 frame:0
            TX packets:8935796075 errors:0 dropped:0 overruns:0 carrier:0
            collisions:0 txqueuelen:1000
            RX bytes:371123590892 (345.6 GiB)  TX bytes:13424246629869 (12.2
TiB)
            Interrupt:193 Memory:ec000000-ec012100
lo Link encap:Local Loopback
            inet addr:127.0.0.1  Mask:255.0.0.0
            inet6 addr: ::1/128 Scope:Host
            UP LOOPBACK RUNNING  MTU:16436  Metric:1
            RX packets:244169216 errors:0 dropped:0 overruns:0 frame:0
            TX packets:244169216 errors:0 dropped:0 overruns:0 carrier:0
            collisions:0 txqueuelen:0
            RX bytes:1190976360356 (1.0 TiB)  TX bytes:1190976360356 (1.0
TiB)
[root@vixen ec2]# interfaces:

Please see the mail posting that follows this, my reply to Ashley,
whom nailed the problem precisely.

Regards,

Tena


On 2/14/11 1:35 PM, "kevin.buck...@ecs.vuw.ac.nz"
<kevin.buck...@ecs.vuw.ac.nz> wrote:

This probably shows my lack of understanding as to how OpenMPI
negotiates the connectivity between nodes when given a choice
of interfaces but anyway:

 does dasher have any network interfaces that vixen does not?

The scenario I am imgaining would be that you ssh into dasher
from vixen using a "network" that both share and similarly, when
you mpirun from vixen, the network that OpenMPI uses is constrained
by the interfaces that can be seen from vixen, so you are fine.

However when you are on dasher, mpirun sees another interface which
it takes a liking to and so tries to use that, but that interface
is not available to vixen so the OpenMPI processes spawned there
terminate when they can't find that interface so as to talk back
to dasher's controlling process.

I know that you are no longer working with VMs but it's along those
lines that I was thinking: extra network interfaces that you assume
won't be used but which are and which could then be overcome by use
of an explicit

 --mca btl_tcp_if_exclude virbr0

or some such construction (virbr0 used as an example here).

Kevin

_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
Hi Tena


They seem to be connected through the LAN 172.16.0.0/255.255.252.0,
with private IPs 172.16.0.116 (dashen,eth0) and
172.16.1.107 (vixen,eth1).
These addresses are probably what OpenMPI is using.
Not much like a cluster, but just machines in a LAN.

Hence, I don't understand why the lack of symmetry in the
firewall protection.
Either vixen's is too loose, or dashen's is too tight, I'd risk to say.
Maybe dashen was installed later, just got whatever boilerplate firewall
that comes with RedHat, CentOS, Fedora.
If there is a gateway for this LAN somewhere with another firewall,
which is probably the case,
I'd guess it is OK to turn off dashen's firewall.

Do you have Internet access from either machine?

Vixen has yet another private IP 10.1.1.2 (eth0),
with a bit weird combination of broadcast address 192.168.255.255 (?),
and mask 255.0.0.0.
Maybe vixen is/was part of another group of machines, via this other IP,
a cluster perhaps?

What is in your ${TORQUE}/server_priv/nodes file?
IPs or names (vixen & dashen).

Are they on a DNS server or do you resolve their names/IPs
via /etc/hosts?

Hopefully vixen's name resolves as 172.16.1.107.
(ping -R vixen may tell).

Gus Correa


_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Reply via email to