Dear Wiki user, You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.
The following page has been changed by mozdevil: http://wiki.apache.org/nutch/Nutch0%2e9-Hadoop0%2e10-Tutorial ------------------------------------------------------------------------------ = How to setup Nutch and Hadoop on Ubuntu 6.06 = - <h2>Prerequisites</h2> + == Prerequisites == To run Hadoop one needs at least 2 computers to make use of a real distributed file system, it can also run on a single machine but than no use is made of the distributed capabilities. Nutch is written in Java, so the java compiler and runtime are needed as well as ant. Hadoop makes use of ssh clients and servers on all machines. Lucene needs an servlet container, I used tomcat5. - <pre> + - su + `su - edit /etc/apt/sources.list to enable the universe and multiverse repositories. + `edit /etc/apt/sources.list to enable the universe and multiverse repositories. - apt-get install sun-java5 + `apt-get install sun-java5 - apt-get install openssh + `apt-get install openssh - apt-get install tomcat5 + `apt-get install tomcat5 - </pre> - <h2>Build nutch</h2> + == Build nutch == Download Nutch, this includes Hadoop and Lucene. I used the latest nightly build, which was at the time of writing 2007-02-06. [http://cvs.apache.org/dist/lucene/nutch/nightly/ Nutch nightly] Unpack the tarball to nutch-nightly and build it with ant. - <pre> + {{{ tar -xvzf nutch-2007-02-05.tar.gz cd nutch-nightly mkdir ~/nutch-build echo "~/nutch-build" >> build.properties ant package - </pre> + }}} - <h2>Prepare the machines</h2> + == Prepare the machines == Create the nutch user on each machine and create the necesarry directories for nutch - <pre> + {{{ su export NUTCH_INSTALL_DIR=/nutch-0.9.0 mkdir ${NUTCH_INSTALL_DIR} @@ -44, +43 @@ chown -R nutch:users ${NUTCH_INSTALL_DIR} exit - </pre> + }}} - <h2>Install and configure nutch</h2> + == Install and configure nutch == Install nutch on the master - <pre> + {{{ export NUTCH_INSTALL_DIR=/nutch-0.9.0 cp -Rv ~/nutch-build/* ${NUTCH_INSTALL_DIR}/search/ chown -R nutch:users ${NUTCH_INSTALL_DIR} - </pre> + }}} Edit the hadoop-env.sh shell script so that the following variables are set. - <pre> + {{{ ssh [EMAIL PROTECTED] echo "export HADOOP_HOME="${NUTCH_INSTALL_DIR}"/search" >> ${NUTCH_INSTALL_DIR}/search/conf/hadoop-env.sh @@ -63, +62 @@ echo "export HADOOP_LOG_DIR=\${HADOOP_HOME}/logs" >> ${NUTCH_INSTALL_DIR}/search/conf/hadoop-env.sh echo "export HADOOP_SLAVES=\${HADOOP_HOME}/conf/slaves" >> ${NUTCH_INSTALL_DIR}/search/conf/hadoop-env.sh exit - </pre> + }}} Create ssh keys so that the nutch user can login over ssh without being prompted for a password. - <pre> + {{{ ssh [EMAIL PROTECTED] cd ${NUTCH_INSTALL_DIR}/home ssh-keygen -t rsa (Use empty responses for each prompt) @@ -76, +75 @@ Your public key has been saved in ${NUTCH_INSTALL_DIR}/home/.ssh/id_rsa.pub. The key fingerprint is: a6:5c:c3:eb:18:94:0b:06:a1:a6:29:58:fa:80:0a:bc [EMAIL PROTECTED] - </pre> + }}} Copy the key for this machine to the authorized_keys file that will be copied to the other machines (the slaves). - <pre> + {{{ cd ${NUTCH_INSTALL_DIR}/home/.ssh cp id_rsa.pub authorized_keys - </pre> + }}} Edit the hadoop-site.xml configuration file. - <pre> + {{{ <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <!-- Put site-specific property overrides in this file. --> @@ -153, +152 @@ </property> </configuration> - </pre> + }}} Edit the nutch-site.xml file - <pre> + {{{ <?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> @@ -209, +208 @@ </description> </property> </configuration> - </pre> + }}} Edit the crawl-urlfilter.txt file to edit the pattern of the urls that have to be fetched. - <pre> + {{{ cd ${NUTCH_INSTALL_DIR}/search vi conf/crawl-urlfilter.txt change the line that reads: +^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/ to read: +^http://([a-z0-9]*\.)*org/ - </pre> + }}} Or if downloading the whole internet is desired edit the nutch-site.xml file so that it includes the following property. - <pre> + {{{ <property> <name>urlfilter.regex.file</name> <value>automaton-urlfilter.txt</value> </property> - </pre> + }}} - - <h2>Distribute the code and the configuration</h2> + == Distribute the code and the configuration == Copy the code and the configuration to the slaves - <pre> + {{{ scp -r ${NUTCH_INSTALL_DIR}/search/* [EMAIL PROTECTED]:${NUTCH_INSTALL_DIR}/search - </pre> + }}} Copy the keys to the slave machines - <pre> + {{{ scp ${NUTCH_INSTALL_DIR}/home/.ssh/authorized_keys [EMAIL PROTECTED]:${NUTCH_INSTALL_DIR}/home/.ssh/authorized_keys - </pre> + }}} Check if shhd is ready on the machines - <pre> + {{{ ssh ??? hostname - </pre> + }}} - <h2>Start Hadoop</h2> + == Start Hadoop == Format the namenode - <pre> + {{{ bin/hadoop namenode -format - </pre> + }}} Start all services on all machines. - <pre> + {{{ bin/start-all.sh - </pre> + }}} To stop all of the servers you would use the following command: - <pre> + {{{ bin/stop-all.sh - </pre> + }}} - <h2>Crawling</h2> + == Crawling == To start crawling from a few urls as seeds an url directory is made in which a seed file is put with some seed urls. This file is put into the hdfs, to check if hdfs has stored the directory use the dfs -ls option of hadoop. - <pre> + {{{ mkdir urls echo "http://lucene.apache.org" >> urls/seed echo "http://nl.wikipedia.org" >> urls/seed echo "http://en.wikipedia.org" >> urls/seed bin/hadoop dfs -put urls urls bin/hadoop dfs -ls urls - </pre> + }}} Start to crawl - <pre> + {{{ bin/nutch crawl urls -dir crawled01 -depth 3 - </pre> + }}} On the masternode the progress and status can be viewed with a webbrowser. [[http://localhost:50030/ http://localhost:50030/]] - <h2>Searching</h2> + == Searching == To search in the collected webpages the data that is now on the hdfs is best copied to the local filesystem for better performance. If an index becomes to large for one machine to handle, the index can be split and sepperate machines handle a part of the index. First we try to perform a search on one machine. Because the searching needs different settings for nutch than for crawling, the easiest thing to do is to make a sepperate folder for the nutch search part. - <pre> + {{{ su export SEARCH_INSTALL_DIR=/nutch-search-0.9.0 mkdir ${SEARCH_INSTALL_DIR} @@ -295, +293 @@ cp -Rv ${NUTCH_INSTALL_DIR}/search ${SEARCH_INSTALL_DIR}/search mkdir ${SEARCH_INSTALL_DIR}/local mkdir ${SEARCH_INSTALL_DIR}/home - </pre> + }}} Copy the data - <pre> + {{{ bin/hadoop dfs -copyToLocal crawled01 ${SEARCH_INSTALL_DIR}/local/ - </pre> + }}} - Edit the nutch-site.xml in the nutch search directory <pre> + Edit the nutch-site.xml in the nutch search directory + {{{ <?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> @@ -321, +320 @@ </property> </configuration> - </pre> + }}} Edit the hadoop-site.xml file and delete all the properties - <pre> + {{{ <?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> @@ -333, +332 @@ <configuration> </configuration> - </pre> + }}} Test if all is configured properly - <pre> + {{{ bin/nutch org.apache.nutch.searcher.NutchBean an - </pre> + }}} The last command should give a number of hits. If the query results in 0 hits there could be something wrong with the configuration, with the index or there are no documents containing the word. Try a few words, if all result in 0 hits most probably the configuration is wrong or the index is corrupt. The configuration problems I came across were: pointing to the wrong index directory and unintentionally using hadoop. Copy the war file to the tomcat directory - <pre> + {{{ rm -rf usr/share/tomcat5/webapps/ROOT* cp ${SEARCH_INSTALL_DIR}/*.war /usr/share/tomcat5/webapps/ROOT.war - </pre> + }}} Copy the configuration to the tomcat directory - <pre> + {{{ cp ${SEARCH_INSTALL_DIR}/search/conf/* /usr/share/tomcat5/webapps/ROOT/WEB-INF/classes/ - </pre> + }}} Start tomcat - <pre> + {{{ /usr/share/tomcat5/bin/startup.sh - </pre> + }}} Open the search page in a webbrowser [[http://localhost:8180/ http://localhost:8180/]] - <h2>Distributed searching</h2> + == Distributed searching == Copy the search install directory to other machines. - <pre> + {{{ scp -R ${SEARCH_INSTALL_DIR}/search [EMAIL PROTECTED]:${SEARCH_INSTALL_DIR}/search - </pre> + }}} Edit the nutch-site.xml so that the searcher.dir property points to a directory containing a search-servers.txt file with a list of ip adresses and ports. Edit the search-servers.txt file - <pre> + {{{ x.x.x.1 9999 x.x.x.2 9999 x.x.x.3 9999 - </pre> + }}} Startup the search service - <pre> + {{{ bin/nutch server 9999 ${SEARCH_INSTALL_DIR}/local/crawled01 - </pre> + }}} ------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys-and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV _______________________________________________ Nutch-cvs mailing list Nutch-cvs@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-cvs