Re: public IP for datanode on EC2

Tom White Thu, 14 May 2009 05:58:51 -0700

Hi Joydeep,

The problem you are hitting may be because port 50001 isn't open,
whereas from within the cluster any node may talk to any other node
(because the security groups are set up to do this).


However I'm not sure this is a good approach. Configuring Hadoop to
use public IP addresses everywhere should work, but you have to pay
for all data transfer between nodes (see http://aws.amazon.com/ec2/,
"Public and Elastic IP Data Transfer"). This is going to get expensive
fast!

So to get this to work well, we would have to make changes to Hadoop
so it was aware of both public and private addresses, and use the
appropriate one: clients would use the public address, while daemons
would use the private address. I haven't looked at what it would take
to do this or how invasive it would be.

Cheers,
Tom

On Thu, May 14, 2009 at 1:37 PM, Joydeep Sen Sarma <jssa...@facebook.com> wrote:
> I changed the ec2 scripts to have fs.default.name assigned to the public 
> hostname (instead of the private hostname).
>
> Now I can submit jobs remotely via the socks proxy (the problem below is 
> resolved) - but the map tasks fail with an exception:
>
>
> 2009-05-14 07:30:34,913 INFO org.apache.hadoop.ipc.Client: Retrying connect 
> to server: ec2-75-101-199-45.compute-1.amazonaws.com/10.254.175.132:50001. 
> Already tried 9 time(s).
> 2009-05-14 07:30:34,914 WARN org.apache.hadoop.mapred.TaskTracker: Error 
> running child
> java.io.IOException: Call to 
> ec2-75-101-199-45.compute-1.amazonaws.com/10.254.175.132:50001 failed on 
> local exception: Connection refused
>        at org.apache.hadoop.ipc.Client.call(Client.java:699)
>        at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216)
>        at $Proxy1.getProtocolVersion(Unknown Source)
>        at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:319)
>        at 
> org.apache.hadoop.hdfs.DFSClient.createRPCNamenode(DFSClient.java:104)
>        at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:177)
>        at 
> org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:74)
>        at 
> org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1367)
>        at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:56)
>        at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1379)
>        at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:215)
>        at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:120)
>        at org.apache.hadoop.mapred.Child.main(Child.java:153)
>
>
> strangely enough - job submissions from nodes within the ec2 cluster work 
> just fine. I looked at the job.xml files of jobs submitted locally and 
> remotely and don't see any relevant differences.
>
> Totally foxed now.
>
> Joydeep
>
> -----Original Message-----
> From: Joydeep Sen Sarma [mailto:jssa...@facebook.com]
> Sent: Wednesday, May 13, 2009 9:38 PM
> To: core-user@hadoop.apache.org
> Cc: Tom White
> Subject: RE: public IP for datanode on EC2
>
> Thanks Philip. Very helpful (and great blog post)! This seems to make basic 
> dfs command line operations work just fine.
>
> However - I am hitting a new error during job submission (running 
> hadoop-0.19.0):
>
> 2009-05-14 00:15:34,430 ERROR exec.ExecDriver 
> (SessionState.java:printError(279)) - Job Submission failed with exception 
> 'java.net.UnknownHostException(unknown host: 
> domU-12-31-39-00-51-94.compute-1.internal)'
> java.net.UnknownHostException: unknown host: 
> domU-12-31-39-00-51-94.compute-1.internal
>        at org.apache.hadoop.ipc.Client$Connection.<init>(Client.java:195)
>        at org.apache.hadoop.ipc.Client.getConnection(Client.java:791)
>        at org.apache.hadoop.ipc.Client.call(Client.java:686)
>        at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216)
>        at $Proxy0.getProtocolVersion(Unknown Source)
>        at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:348)
>        at 
> org.apache.hadoop.hdfs.DFSClient.createRPCNamenode(DFSClient.java:104)
>        at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:176)
>        at 
> org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:75)
>        at 
> org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1367)
>        at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:56)
>        at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1379)
>        at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:215)
>        at org.apache.hadoop.fs.Path.getFileSystem(Path.java:175)
>        at org.apache.hadoop.mapred.JobClient.getFs(JobClient.java:469)
>        at 
> org.apache.hadoop.mapred.JobClient.configureCommandLineOptions(JobClient.java:603)
>        at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:788)
>
>
> looking at the stack trace and the code - it seems that this is happening 
> because the jobclient asks for the mapred system directory from the 
> jobtracker - which replies back with a path name that's qualified against the 
> fs.default.name setting of the jobtracker. Unfortunately the standard EC2 
> scripts assign this to the internal hostname of the hadoop master.
>
> Is there any downside to using public hostnames instead of the private ones 
> in the ec2 starter scripts?
>
> Thanks for the help,
>
> Joydeep
>
>
> -----Original Message-----
> From: Philip Zeyliger [mailto:phi...@cloudera.com]
> Sent: Wednesday, May 13, 2009 2:40 PM
> To: core-user@hadoop.apache.org
> Subject: Re: public IP for datanode on EC2
>
> On Tue, May 12, 2009 at 9:11 PM, Joydeep Sen Sarma <jssa...@facebook.com> 
> wrote:
>> (raking up real old thread)
>>
>> After struggling with this issue for sometime now - it seems that accessing 
>> hdfs on ec2 from outside ec2 is not possible.
>>
>> This is primarily because of 
>> https://issues.apache.org/jira/browse/HADOOP-985. Even if datanode ports are 
>> authorized in ec2 and we set the public hostname via slave.host.name - the 
>> namenode uses the internal IP address of the datanodes for block locations. 
>> DFS clients outside ec2 cannot reach these addresses and report failure 
>> reading/writing data blocks.
>>
>> HDFS/EC2 gurus - would it be reasonable to ask for an option to not use IP 
>> addresses (and use datanode host names as pre-985)?
>>
>> I really like the idea of being able to use an external node (my personal 
>> workstation) to do job submission (which typically requires interacting with 
>> HDFS in order to push files into the jobcache etc). This way I don't need 
>> custom AMIs - I can use stock hadoop amis (all the custom software is on the 
>> external node). Without the above option - this is not possible currently.
>
> You could use ssh to set up a SOCKS proxy between your machine and
> ec2, and setup org.apache.hadoop.net.SocksSocketFactory to be the
> socket factory.
> http://www.cloudera.com/blog/2008/12/03/securing-a-hadoop-cluster-through-a-gateway/
> has more information.
>
> -- Philip
>

Re: public IP for datanode on EC2

Reply via email to