Hello, I have been working on setting up a Hadoop on Demand cluster on three machines and have run into a bit of a snag. I went through the admin and user guides and have successfully installed torque and HOD. When I run "hod allocate" it successfully starts hodring on 2 of the machines but not the third. The result is that I have a working Namenode and Jobtracker (though its UI does not seem to work presently) but no slave nodes.
Even at level 4 debug in all sections there is nothing to indicate a failure as the ringmaster has no problem communicating with the running hodring jobs and pbsdsh returns without error. I can find no logs on any of the machines indicating a torque issue (though I admit I am not terribly familiar with torque) and no logs at all for HOD on the machine that is not running hodring. It would appear that the pbsdsh job simply isn't starting hodring on the one node given the lack of any HOD log on that machine. Either it is not recognizing the node (seems somewhat unlikely as it comes up in pbsnodes as free) or there is a relatively silent failure somewhere. If you have any suggestions I would much appreciate them. One quick side note is I have successfully run a standard hadoop-0.20.0 cluster on these three machines with no difficulty, which should rule out connection, ssh or firewall issues. Thanks, Seb
