TaskTrackers behind NAT

Ben Clay Mon, 18 Jul 2011 12:54:20 -0700

I'd like to spread Hadoop across two physical clusters, one which is
publicly accessible and the other which is behind a NAT.  The NAT'd machines
will only run TaskTrackers, not HDFS, and not Reducers either (configured
with 0 Reduce slots).  The master node will run in the publicly-available
cluster.


 

Two questions:

 

1. Port 50060 needs to be opened for all NAT'd machines, since Reduce tasks
fetch intermediate data from http://
<http://%3ctasktracker%3e:50060/mapOutput> <tasktracker>:50060/mapOutput,
correct ?  I'm getting "Too many fetch-failures" with no open ports, so I
assume the Reduce tasks need to pull the intermediate data instead of Map
tasks pushing it.

 

2. Although the NAT'd machines have unique IPs and reach the outside, the
DHCP is not assigning them hostnames.  Therefore, when they join the
JobTracker I get
"tracker_localhost.localdomain:localhost.localdomain/127.0.0.1" on the
machine list page.  Is there some way to force Hadoop to refer to them via
IP instead of hostname, since I don't have control over the DHCP?  I could
manually assign a hostname via /etc/hosts on each NAT'd machine, but these
are actually VMs and I will have many of them receiving semi-random IPs,
making this an ugly administrative task.

 

Thanks for any input!

 

-Ben

TaskTrackers behind NAT

Reply via email to