Re: hadoop on EC2

Steve Loughran Wed, 04 Jun 2008 04:12:03 -0700

Andreas Kostyrka wrote:

Well, the basic "trouble" with EC2 is that clusters usually are not networksin the TCP/IP sense.
This makes it painful to decide which URLs should be resolved where.
Plus to make it even more painful, you cannot easily run it with one simpleSOCKS server, because you need to defer DNS resolution to the inside thecluster, because VM names do resolve to external IPs, while the webserverswe'd be all interested in reside on the internal 10/8 IPs.
Another fun item is that in many situations you will have multiple islandsinside EC2 (the contractor working for multiple customers that have EC2deployments come to mind), so you cannot just route everything over one pipeinto EC2.
My current setup relies on a very long list of -L ssh tunnel forwards plusiptables into the nat OUTPUT rule that make external-ip-of-vm1:50030 getredirected to localhost:SOMEPORT that is forwarded to name-of-vm1:50030 viassh. (Implementation left as an exercise for the reader, or my ugly non-errorchecking script available on request :-P)
If one would want to have a more generic solution to redirect TCP ports via assh SOCKS tunnel (aka "dynamic port forwarding"), the following componentswould be needed:
-) a list of rules what gets forwarded where and how.
-) a DNS resolver that issues fake IP addresses to capture the "name" of theconnected host.-) a small forwarding script that checks the "real destination IP" to decidewhich IP address/port is being requested. (Hint: current Linux kernels don'tuse getsockname anymore, the real destination is carried nowadays as a socketoption)
One of the uglier parts that I have found no "real" solution was the fact thatone cannot be sure that ssh will be able to listen on a given port.
Solutions I've found include:
-) check the port before issueing ssh (Racecondition warning: Going throughthis hole the whole federation star fleet could get lost.)
-) using some kind of except to drive ssh through a pty.
-) roll your own ssh tunnel solution. The only lib that come to my mind isTwisted, in which case one could ignore the need for the SOCKS protocol.
But luckily for us, the solution is easier, because we only need to tunnelhttp in the hadoop case, which has the high benefit that we do not need tocapture the hostname, because http remembers the hostname inside the payload.

Do you worry/address the risk of someone like me bringing up a machinein the EC2 farm that then portscans all the near-neighbours in theaddress space for open hdfs data node/name node ports, and strikes up aconversation with your filesystem?




--
Steve Loughran                  http://www.1060.org/blogxter/publish/5
Author: Ant in Action           http://antbook.org/

Re: hadoop on EC2

Reply via email to