Hi there: I'm a new user to Hadoop and Nutch, and I am trying to run the crawler * Nutch* on a distributed system powered by *Hadoop*. However as it turns out, the distributed system does not recognise any slave nodes in the cluster. I've stucked at this point for months and am desperate to look for a solution. I appreciate if anyone would be kindly enough to spend 10 minutes of their valuable time to help.
Thank you so much!! This is what I currently encounter: ================================== In order to set up Hadoop clusters, I followed the instructions described in both of: http://wiki.apache.org/nutch/NutchHadoopTutorial http://hadoop.apache.org/common/docs/current/cluster_setup.html The problem is that, when we have a distributed file system (*HDFS* in Hadoop) , the files are stored on both of the computers. All data in HDFS, which are supposed to be replicated or stored onto every computer in the cluster, is only found on the master node. They are not replicated to other slave nodes in the cluster, which causes the subsequent tasks such as * jobtracker* to fail. I've attached a jobstracker log file. It worked fine when there is only one computer (the master node) in the cluster and everything is stored in the master node. However the problem arises when the program tries to write files onto another computer (slave node). The wield part is that HDFS can create folders on the slave nodes but not the files. Therefore the HDFS folders on the slave nodes are all empty. On the web interface (http://materNode:50070 and http://materNode:50030) which shows the status of HDFS and jobtracker, it indicates that there is only one active node (i.e., the master node). It fails to recognize any of the slave nodes. I use Nutch 1.2 and Hadoop 0.20 in the experiment. Here are the things that I've done: I followed the instructions in the aforementioned documentations. I created users with identical username on multiple computers, which belong to the same local network, with Ubuntu 10.10 installed. I set passphrase-less ssh keys for all computers and experiments show that every node in the cluster can *ssh* to another without the requirement of a password. I've shutdown the firewall by "*sudo ufw disable*". I've tried to search for solutions on the Internet, but there is no luck so far. Appreciate for the help. The Hadoop configuration files (*core-site.xml* <http://db.tt/co0q25s>, * hdfs-site.xml <http://db.tt/TSK7jA6>*, *mapred-site.xml<http://db.tt/8dJoUrp> *, and *hadoop-env.sh <http://db.tt/FztxTEw>*) and the log file with error message (*hadoop-rui-jobtracker-ss2.log <http://db.tt/PPGhEaa>*) are linked. p.s.: Re: Harsh J: Thank you so much for your time and reply, I've uploaded the configuration and log files as links. The directory of *HADOOP_HOME*(i.e., */home/rui/workspace/nutch/search/*) is where the '*bin/*', '*conf/*', '* lib/*' etc are located. the '*start-all.sh*' is located at * ${HADOOP_HOME}/bin/shart-all.sh*. There is no separate directory for Hadoop. I thought it is integrated into Nutch. ================================== Regards Andy The University of Melbourne