I'd like to spread Hadoop across two physical clusters, one which is publicly accessible and the other which is behind a NAT. The NAT'd machines will only run TaskTrackers, not HDFS, and not Reducers either (configured with 0 Reduce slots). The master node will run in the publicly-available cluster.
Two questions: 1. Port 50060 needs to be opened for all NAT'd machines, since Reduce tasks fetch intermediate data from http:// <http://%3ctasktracker%3e:50060/mapOutput> <tasktracker>:50060/mapOutput, correct ? I'm getting "Too many fetch-failures" with no open ports, so I assume the Reduce tasks need to pull the intermediate data instead of Map tasks pushing it. 2. Although the NAT'd machines have unique IPs and reach the outside, the DHCP is not assigning them hostnames. Therefore, when they join the JobTracker I get "tracker_localhost.localdomain:localhost.localdomain/127.0.0.1" on the machine list page. Is there some way to force Hadoop to refer to them via IP instead of hostname, since I don't have control over the DHCP? I could manually assign a hostname via /etc/hosts on each NAT'd machine, but these are actually VMs and I will have many of them receiving semi-random IPs, making this an ugly administrative task. Thanks for any input! -Ben