Hi Nick, Can you curl http://127.0.0.1:50075/stacks on one of the stuck nodes and paste the result?
Sometimes that can give an indication as to where things are getting stuck. -Todd On Mon, Sep 28, 2009 at 7:21 PM, Nick Rathke <n...@sci.utah.edu> wrote: > FYI I get the same hanging behavior if I follow the Hadoop quick start for > a single node base line configuration ( no modified conf files) > > -Nick > > > > Brian Bockelman wrote: > >> Hey Nicke, >> >> Do you have any error messages appearing in the log files? >> >> Brian >> >> On Sep 28, 2009, at 2:06 PM, Nick Rathke wrote: >> >> Ted Dunning wrote: >>> >>>> I think that the last time you asked this question, the suggestion was >>>> to >>>> look at DNS and make sure that everything is exactly correct in the >>>> net-boot >>>> configuration. Hadoop is very sensitive to network routing and naming >>>> details. >>>> >>>> So, >>>> >>>> a) in your net-boot, how are IP addresses assigned? >>>> >>>> We assign static IP's based on a node's MAC address via DHCP so that >>> when a node is netbooted or booted with a local OS it gets the same IP and >>> hostname. >>> >>>> b) how are DNS names propagated? >>>> >>>> cluster DNS names are on a mixed in with our facility DNS servers. >>> All nodes have proper forward and reverse DNS lookups. >>> >>>> c) how have you guaranteed that (a) and (b) are exactly consistent? >>>> >>>> Host MAC address. I also have manually conformed this. >>> >>>> d) how have your guaranteed that every node can talk to every other node >>>> both by name and IP address? >>>> >>>> Local cluster DNS / DHCP + all nodes have all other nodes host names >>> and IP's in /etc/hosts. I have compared all the config files for DNS / DHCP >>> / and /etc/hosts to make sure all information is the same. >>> >>>> e) have you assured yourself that any reverse mapping that exists is >>>> correct? >>>> >>>> Yes, and tested. >>> >>> One more bit of information. The system boots on a 1Gb network all other >>> network traffic i.e. MPI and NFS to data volumes is via IB. >>> >>> The IB network also has proper forward/backwards DNS entries. IB IP >>> address are setup at boot time via a script that takes the host IP and a >>> fixed offset to calculate the address for the IB interface. I have also >>> confirmed that the IB IP address's match our DNS . >>> >>> -Nick >>> >>> >>> On Mon, Sep 28, 2009 at 9:45 AM, Nick Rathke <n...@sci.utah.edu> wrote: >>>> >>>> >>>> I am hopping that someone can help with this issue. I have a 64 node >>>>> cluster that we would like to run Hadoop on, most of the nodes are >>>>> netbooted >>>>> via NFS. >>>>> >>>>> Hadoop runs fine on nodes IF the node uses a local OS install, but >>>>> doesn't >>>>> work when nodes are netbooted. Under netboot I can see that the slaves >>>>> have >>>>> the correct Java processes running, but the Hadoop web pages never >>>>> shows the >>>>> nodes as available. The Hadoop logs on the nodes also show that >>>>> everything >>>>> is running and started up correctly. >>>>> >>>>> On the few node that have a local OS installed everything works just >>>>> fine >>>>> and I can run the test jobs without issue (so far). >>>>> >>>>> I am using the identical hadoop install and configuration between >>>>> netbooted nodes and none netbooted nodes. >>>>> >>>>> Has anyone encountered this type of issue ? >>>>> >>>>> >>>>> >>>>> >>>> >>>> >>> >>> -- >>> Nick Rathke >>> Scientific Computing and Imaging Institute >>> Sr. Systems Administrator >>> n...@sci.utah.edu >>> www.sci.utah.edu >>> 801-587-9933 >>> 801-557-3832 >>> >>> "I came I saw I made it possible" Royal Bliss - Here They Come >>> >> >> >