RE: public IP for datanode on EC2

Joydeep Sen Sarma Thu, 14 May 2009 12:16:01 -0700

Btw - I figured out the problem.

The jobconf from the remote client had the socks proxy configuration - the jvm 
spawned by TTs picked this up and tried to connect using the proxy which of 
course didn't work.


This was easy to solve - just had to make the remote initialization script mark 
hadoop.rpc.socket.factory.class.default as final variable in the 
hadoop-site.xml on server side.

I am assuming that this would be a good thing to do in general (can't believe 
why server side traffic would be routed through a proxy!). 

Filed https://issues.apache.org/jira/browse/HADOOP-5839 to follow up on the 
issues uncovered here.

-----Original Message-----
From: Tom White [mailto:t...@cloudera.com] 
Sent: Thursday, May 14, 2009 7:07 AM
To: core-user@hadoop.apache.org
Subject: Re: public IP for datanode on EC2

Yes, you're absolutely right.

Tom

On Thu, May 14, 2009 at 2:19 PM, Joydeep Sen Sarma <jssa...@facebook.com> wrote:
> The ec2 documentation point to the use of public 'ip' addresses - whereas 
> using public 'hostnames' seems safe since it resolves to internal addresses 
> from within the cluster (and resolve to public ip addresses from outside).
>
> The only data transfer that I would incur while submitting jobs from outside 
> is the cost of copying the jar files and any other files meant for the 
> distributed cache). That would be extremely small.
>
>
> -----Original Message-----
> From: Tom White [mailto:t...@cloudera.com]
> Sent: Thursday, May 14, 2009 5:58 AM
> To: core-user@hadoop.apache.org
> Subject: Re: public IP for datanode on EC2
>
> Hi Joydeep,
>
> The problem you are hitting may be because port 50001 isn't open,
> whereas from within the cluster any node may talk to any other node
> (because the security groups are set up to do this).
>
> However I'm not sure this is a good approach. Configuring Hadoop to
> use public IP addresses everywhere should work, but you have to pay
> for all data transfer between nodes (see http://aws.amazon.com/ec2/,
> "Public and Elastic IP Data Transfer"). This is going to get expensive
> fast!
>
> So to get this to work well, we would have to make changes to Hadoop
> so it was aware of both public and private addresses, and use the
> appropriate one: clients would use the public address, while daemons
> would use the private address. I haven't looked at what it would take
> to do this or how invasive it would be.
>
> Cheers,
> Tom
>
> On Thu, May 14, 2009 at 1:37 PM, Joydeep Sen Sarma <jssa...@facebook.com> 
> wrote:
>> I changed the ec2 scripts to have fs.default.name assigned to the public 
>> hostname (instead of the private hostname).
>>
>> Now I can submit jobs remotely via the socks proxy (the problem below is 
>> resolved) - but the map tasks fail with an exception:
>>
>>
>> 2009-05-14 07:30:34,913 INFO org.apache.hadoop.ipc.Client: Retrying connect 
>> to server: ec2-75-101-199-45.compute-1.amazonaws.com/10.254.175.132:50001. 
>> Already tried 9 time(s).
>> 2009-05-14 07:30:34,914 WARN org.apache.hadoop.mapred.TaskTracker: Error 
>> running child
>> java.io.IOException: Call to 
>> ec2-75-101-199-45.compute-1.amazonaws.com/10.254.175.132:50001 failed on 
>> local exception: Connection refused
>>        at org.apache.hadoop.ipc.Client.call(Client.java:699)
>>        at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216)
>>        at $Proxy1.getProtocolVersion(Unknown Source)
>>        at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:319)
>>        at 
>> org.apache.hadoop.hdfs.DFSClient.createRPCNamenode(DFSClient.java:104)
>>        at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:177)
>>        at 
>> org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:74)
>>        at 
>> org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1367)
>>        at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:56)
>>        at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1379)
>>        at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:215)
>>        at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:120)
>>        at org.apache.hadoop.mapred.Child.main(Child.java:153)
>>
>>
>> strangely enough - job submissions from nodes within the ec2 cluster work 
>> just fine. I looked at the job.xml files of jobs submitted locally and 
>> remotely and don't see any relevant differences.
>>
>> Totally foxed now.
>>
>> Joydeep
>>
>> -----Original Message-----
>> From: Joydeep Sen Sarma [mailto:jssa...@facebook.com]
>> Sent: Wednesday, May 13, 2009 9:38 PM
>> To: core-user@hadoop.apache.org
>> Cc: Tom White
>> Subject: RE: public IP for datanode on EC2
>>
>> Thanks Philip. Very helpful (and great blog post)! This seems to make basic 
>> dfs command line operations work just fine.
>>
>> However - I am hitting a new error during job submission (running 
>> hadoop-0.19.0):
>>
>> 2009-05-14 00:15:34,430 ERROR exec.ExecDriver 
>> (SessionState.java:printError(279)) - Job Submission failed with exception 
>> 'java.net.UnknownHostException(unknown host: 
>> domU-12-31-39-00-51-94.compute-1.internal)'
>> java.net.UnknownHostException: unknown host: 
>> domU-12-31-39-00-51-94.compute-1.internal
>>        at org.apache.hadoop.ipc.Client$Connection.<init>(Client.java:195)
>>        at org.apache.hadoop.ipc.Client.getConnection(Client.java:791)
>>        at org.apache.hadoop.ipc.Client.call(Client.java:686)
>>        at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216)
>>        at $Proxy0.getProtocolVersion(Unknown Source)
>>        at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:348)
>>        at 
>> org.apache.hadoop.hdfs.DFSClient.createRPCNamenode(DFSClient.java:104)
>>        at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:176)
>>        at 
>> org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:75)
>>        at 
>> org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1367)
>>        at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:56)
>>        at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1379)
>>        at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:215)
>>        at org.apache.hadoop.fs.Path.getFileSystem(Path.java:175)
>>        at org.apache.hadoop.mapred.JobClient.getFs(JobClient.java:469)
>>        at 
>> org.apache.hadoop.mapred.JobClient.configureCommandLineOptions(JobClient.java:603)
>>        at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:788)
>>
>>
>> looking at the stack trace and the code - it seems that this is happening 
>> because the jobclient asks for the mapred system directory from the 
>> jobtracker - which replies back with a path name that's qualified against 
>> the fs.default.name setting of the jobtracker. Unfortunately the standard 
>> EC2 scripts assign this to the internal hostname of the hadoop master.
>>
>> Is there any downside to using public hostnames instead of the private ones 
>> in the ec2 starter scripts?
>>
>> Thanks for the help,
>>
>> Joydeep
>>
>>
>> -----Original Message-----
>> From: Philip Zeyliger [mailto:phi...@cloudera.com]
>> Sent: Wednesday, May 13, 2009 2:40 PM
>> To: core-user@hadoop.apache.org
>> Subject: Re: public IP for datanode on EC2
>>
>> On Tue, May 12, 2009 at 9:11 PM, Joydeep Sen Sarma <jssa...@facebook.com> 
>> wrote:
>>> (raking up real old thread)
>>>
>>> After struggling with this issue for sometime now - it seems that accessing 
>>> hdfs on ec2 from outside ec2 is not possible.
>>>
>>> This is primarily because of 
>>> https://issues.apache.org/jira/browse/HADOOP-985. Even if datanode ports 
>>> are authorized in ec2 and we set the public hostname via slave.host.name - 
>>> the namenode uses the internal IP address of the datanodes for block 
>>> locations. DFS clients outside ec2 cannot reach these addresses and report 
>>> failure reading/writing data blocks.
>>>
>>> HDFS/EC2 gurus - would it be reasonable to ask for an option to not use IP 
>>> addresses (and use datanode host names as pre-985)?
>>>
>>> I really like the idea of being able to use an external node (my personal 
>>> workstation) to do job submission (which typically requires interacting 
>>> with HDFS in order to push files into the jobcache etc). This way I don't 
>>> need custom AMIs - I can use stock hadoop amis (all the custom software is 
>>> on the external node). Without the above option - this is not possible 
>>> currently.
>>
>> You could use ssh to set up a SOCKS proxy between your machine and
>> ec2, and setup org.apache.hadoop.net.SocksSocketFactory to be the
>> socket factory.
>> http://www.cloudera.com/blog/2008/12/03/securing-a-hadoop-cluster-through-a-gateway/
>> has more information.
>>
>> -- Philip
>>
>

RE: public IP for datanode on EC2

Reply via email to