(raking up real old thread) After struggling with this issue for sometime now - it seems that accessing hdfs on ec2 from outside ec2 is not possible.
This is primarily because of https://issues.apache.org/jira/browse/HADOOP-985. Even if datanode ports are authorized in ec2 and we set the public hostname via slave.host.name - the namenode uses the internal IP address of the datanodes for block locations. DFS clients outside ec2 cannot reach these addresses and report failure reading/writing data blocks. HDFS/EC2 gurus - would it be reasonable to ask for an option to not use IP addresses (and use datanode host names as pre-985)? I really like the idea of being able to use an external node (my personal workstation) to do job submission (which typically requires interacting with HDFS in order to push files into the jobcache etc). This way I don't need custom AMIs - I can use stock hadoop amis (all the custom software is on the external node). Without the above option - this is not possible currently. -----Original Message----- From: Julien Nioche [mailto:lists.digitalpeb...@gmail.com] Sent: Tuesday, September 09, 2008 7:04 AM To: core-user@hadoop.apache.org Subject: Re: public IP for datanode on EC2 > I think most people try to avoid allowing remote access for security > reasons. If you can add a file, I can mount your filesystem too, maybe > even delete things. Whereas with EC2-only filesystems, your files are > *only* exposed to everyone else that knows or can scan for your IPAddr and > ports. > I imagine that the access to the ports used by HDFS could be restricted to specific IPs using the EC2 group (ec2-authorize) or any other firewall mechanism if necessary. Could anyone confirm that there is no conf parameter I could use to force the address of my DataNodes? Thanks Julien -- DigitalPebble Ltd http://www.digitalpebble.com