Hey Hadoop'sters - I wanted to break out the DFS runtime to stand alone in addition to not requring ssh/rsync for opertions, although those are fine choices.
Here's the results of my cannibalizing the existing scripts, etc (not perfect): http://67.113.25.210/james/projects/hadoop/hadoop.sh note: if there is interest and this shell/thread is wiki worthy and such, i can help w/ the details To start w/, the simple nits/issues i encountered earlier are resolved: bin/hadoop.sh fails under dash, which is ubuntu's 6.10b /bin/sh if [ "$HADOOP_NICENESS" == "" ]; then would be nice to add ".txt" suffix to the log and out files so that they can be viewed w/in the browser On the downside, with this script I seem to have lost my log files, which sucks. I'm pretty sure it is a classpath issue but it eludes me thus far. Now, the more critical issue. I'm wondering why, if I am understanding things correctly, why the DataNode *requires* DNS resolution vs optionally simply using the specified IP? As a simple case I created 2 loopback configurations, one specifying NameNode and the other specifying the singular DataNode. The two never connected since Hadoop appears to prefer to get the host name and do a lookup on that IP which is not loopback, for my case, thereby causing the failed to connect problem. I can understand why DNS resolution is a good default but why is it mandatory, unless I am missing something? I can see some deployments may which to forego dns and simply map to predetermined IPs. In the most trivial case, I may want to dev nodes on loopback emulating the just mentioned scenario. much appreciated, - james
