Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.

The following page has been changed by mozdevil:
http://wiki.apache.org/nutch/Nutch0%2e9-Hadoop0%2e10-Tutorial

------------------------------------------------------------------------------
  = How to setup Nutch and Hadoop on Ubuntu 6.06 =
  
- <h2>Prerequisites</h2>
+ == Prerequisites ==
  To run Hadoop one needs at least 2 computers to make use of a real 
distributed file system, it can also run on a single machine but than no use is 
made of the distributed capabilities.
  
  Nutch is written in Java, so the java compiler and runtime are needed as well 
as ant. Hadoop makes use of ssh clients and servers on all machines. Lucene 
needs an servlet container, I used tomcat5.
- <pre>
+ 
- su
+ `su
- edit /etc/apt/sources.list to enable the universe and multiverse repositories.
+ `edit /etc/apt/sources.list to enable the universe and multiverse 
repositories.
- apt-get install sun-java5
+ `apt-get install sun-java5
- apt-get install openssh
+ `apt-get install openssh
- apt-get install tomcat5
+ `apt-get install tomcat5
- </pre>
  
- <h2>Build nutch</h2>
+ == Build nutch ==
  Download Nutch, this includes Hadoop and Lucene. I used the latest nightly 
build, which was at the time of writing 2007-02-06.
  [http://cvs.apache.org/dist/lucene/nutch/nightly/ Nutch nightly]
  
  Unpack the tarball to nutch-nightly and build it with ant.
- <pre>
+ {{{
  tar -xvzf nutch-2007-02-05.tar.gz
  cd nutch-nightly
  mkdir ~/nutch-build
  echo "~/nutch-build" >> build.properties
  ant package
- </pre>
+ }}}
  
- <h2>Prepare the machines</h2>
+ == Prepare the machines ==
  Create the nutch user on each machine and create the necesarry directories 
for nutch
- <pre>
+ {{{
  su
  export NUTCH_INSTALL_DIR=/nutch-0.9.0
  mkdir ${NUTCH_INSTALL_DIR}
@@ -44, +43 @@

  
  chown -R nutch:users ${NUTCH_INSTALL_DIR}
  exit
- </pre>
+ }}}
  
- <h2>Install and configure nutch</h2>
+ == Install and configure nutch ==
  Install nutch on the master
- <pre>
+ {{{
  export NUTCH_INSTALL_DIR=/nutch-0.9.0
  cp -Rv ~/nutch-build/* ${NUTCH_INSTALL_DIR}/search/
  chown -R nutch:users ${NUTCH_INSTALL_DIR}
- </pre>
+ }}}
  
  Edit the hadoop-env.sh shell script so that the following variables are set.
- <pre>
+ {{{
  ssh [EMAIL PROTECTED]
  
  echo "export HADOOP_HOME="${NUTCH_INSTALL_DIR}"/search" >> 
${NUTCH_INSTALL_DIR}/search/conf/hadoop-env.sh
@@ -63, +62 @@

  echo "export HADOOP_LOG_DIR=\${HADOOP_HOME}/logs" >> 
${NUTCH_INSTALL_DIR}/search/conf/hadoop-env.sh
  echo "export HADOOP_SLAVES=\${HADOOP_HOME}/conf/slaves" >> 
${NUTCH_INSTALL_DIR}/search/conf/hadoop-env.sh
  exit
- </pre>
+ }}}
  
  Create ssh keys so that the nutch user can login over ssh without being 
prompted for a password.
- <pre>
+ {{{
  ssh [EMAIL PROTECTED]
  cd ${NUTCH_INSTALL_DIR}/home
  ssh-keygen -t rsa (Use empty responses for each prompt)
@@ -76, +75 @@

    Your public key has been saved in ${NUTCH_INSTALL_DIR}/home/.ssh/id_rsa.pub.
    The key fingerprint is:
    a6:5c:c3:eb:18:94:0b:06:a1:a6:29:58:fa:80:0a:bc [EMAIL PROTECTED]
- </pre>
+ }}}
  
  Copy the key for this machine to the authorized_keys file that will be copied 
to the other machines (the slaves).
- <pre>
+ {{{
  cd ${NUTCH_INSTALL_DIR}/home/.ssh
  cp id_rsa.pub authorized_keys
- </pre>
+ }}}
  
  Edit the hadoop-site.xml configuration file.
- <pre>
+ {{{
  <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
  
  <!-- Put site-specific property overrides in this file. -->
@@ -153, +152 @@

  </property>
  
  </configuration>
- </pre>
+ }}}
  
  Edit the nutch-site.xml file
- <pre>
+ {{{
  <?xml version="1.0"?>
  <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
  
@@ -209, +208 @@

    </description>
  </property>
  </configuration>
- </pre>
+ }}}
  
  Edit the crawl-urlfilter.txt file to edit the pattern of the urls that have 
to be fetched.
- <pre>
+ {{{
  cd ${NUTCH_INSTALL_DIR}/search
  vi conf/crawl-urlfilter.txt
  
  change the line that reads:   +^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/
  to read:                      +^http://([a-z0-9]*\.)*org/
- </pre>
+ }}}
  
  Or if downloading the whole internet is desired edit the nutch-site.xml file 
so that it includes the following property.
- <pre>
+ {{{
  <property>
    <name>urlfilter.regex.file</name>
    <value>automaton-urlfilter.txt</value>
  </property>
- </pre>
+ }}}
  
- 
- <h2>Distribute the code and the configuration</h2>
+ == Distribute the code and the configuration ==
  Copy the code and the configuration to the slaves
- <pre>
+ {{{
  scp -r ${NUTCH_INSTALL_DIR}/search/* [EMAIL 
PROTECTED]:${NUTCH_INSTALL_DIR}/search
- </pre>
+ }}}
  
  Copy the keys to the slave machines
- <pre>
+ {{{
  scp ${NUTCH_INSTALL_DIR}/home/.ssh/authorized_keys [EMAIL 
PROTECTED]:${NUTCH_INSTALL_DIR}/home/.ssh/authorized_keys
- </pre>
+ }}}
  
  Check if shhd is ready on the machines
- <pre>
+ {{{
  ssh ???
  hostname
- </pre>
+ }}}
  
- <h2>Start Hadoop</h2>
+ == Start Hadoop ==
  Format the namenode
- <pre>
+ {{{
  bin/hadoop namenode -format
- </pre>
+ }}}
  
  Start all services on all machines.
- <pre>
+ {{{
  bin/start-all.sh
- </pre>
+ }}}
  
  To stop all of the servers you would use the following command:
- <pre>
+ {{{
  bin/stop-all.sh
- </pre>
+ }}}
  
- <h2>Crawling</h2>
+ == Crawling ==
  To start crawling from a few urls as seeds an url directory is made in which 
a seed file is put with some seed urls. This file is put into the hdfs, to 
check if hdfs has stored the directory use the dfs -ls option of hadoop.
- <pre>
+ {{{
  mkdir urls
  echo "http://lucene.apache.org"; >> urls/seed
  echo "http://nl.wikipedia.org"; >> urls/seed
  echo "http://en.wikipedia.org"; >> urls/seed
  bin/hadoop dfs -put urls urls
  bin/hadoop dfs -ls urls
- </pre>
+ }}}
  
  Start to crawl
- <pre>
+ {{{
  bin/nutch crawl urls -dir crawled01 -depth 3
- </pre>
+ }}}
  
  On the masternode the progress and status can be viewed with a webbrowser.
  [[http://localhost:50030/ http://localhost:50030/]]
  
- <h2>Searching</h2>
+ == Searching ==
  To search in the collected webpages the data that is now on the hdfs is best 
copied to the local filesystem for better performance. If an index becomes to 
large for one machine to handle, the index can be split and sepperate machines 
handle a part of the index. First we try to perform a search on one machine.
  
  Because the searching needs different settings for nutch than for crawling, 
the easiest thing to do is to make a sepperate folder for the nutch search part.
- <pre>
+ {{{
  su
  export SEARCH_INSTALL_DIR=/nutch-search-0.9.0
  mkdir ${SEARCH_INSTALL_DIR}
@@ -295, +293 @@

  cp -Rv ${NUTCH_INSTALL_DIR}/search ${SEARCH_INSTALL_DIR}/search
  mkdir ${SEARCH_INSTALL_DIR}/local
  mkdir ${SEARCH_INSTALL_DIR}/home
- </pre>
+ }}}
  
  Copy the data 
- <pre>
+ {{{
  bin/hadoop dfs -copyToLocal crawled01 ${SEARCH_INSTALL_DIR}/local/
- </pre>
+ }}}
  
- Edit the nutch-site.xml in the nutch search directory <pre>
+ Edit the nutch-site.xml in the nutch search directory 
+ {{{
  <?xml version="1.0"?>
  <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
  
@@ -321, +320 @@

    </property>
  
  </configuration>
- </pre>
+ }}}
  
  Edit the hadoop-site.xml file and delete all the properties
- <pre>
+ {{{
  <?xml version="1.0"?>
  <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
  
@@ -333, +332 @@

  <configuration>
  
  </configuration>
- </pre>
+ }}}
  
  Test if all is configured properly
- <pre>
+ {{{
  bin/nutch org.apache.nutch.searcher.NutchBean an
- </pre>
+ }}}
  The last command should give a number of hits. If the query results in 0 hits 
there could be something wrong with the configuration, with the index or there 
are no documents containing the word. Try a few words, if all result in 0 hits 
most probably the configuration is wrong or the index is corrupt. The 
configuration problems I came across were: pointing to the wrong index 
directory and unintentionally using hadoop.
  
  Copy the war file to the tomcat directory
- <pre>
+ {{{
  rm -rf usr/share/tomcat5/webapps/ROOT*
  cp ${SEARCH_INSTALL_DIR}/*.war /usr/share/tomcat5/webapps/ROOT.war
- </pre>
+ }}}
  
  Copy the configuration to the tomcat directory
- <pre>
+ {{{
  cp ${SEARCH_INSTALL_DIR}/search/conf/* 
/usr/share/tomcat5/webapps/ROOT/WEB-INF/classes/
- </pre>
+ }}}
  
  Start tomcat 
- <pre>
+ {{{
  /usr/share/tomcat5/bin/startup.sh
- </pre>
+ }}}
  
  Open the search page in a webbrowser 
  [[http://localhost:8180/ http://localhost:8180/]]
  
- <h2>Distributed searching</h2>
+ == Distributed searching ==
  Copy the search install directory to other machines.
- <pre>
+ {{{
  scp -R ${SEARCH_INSTALL_DIR}/search [EMAIL 
PROTECTED]:${SEARCH_INSTALL_DIR}/search
- </pre>
+ }}}
  
  Edit the nutch-site.xml so that the searcher.dir property points to a 
directory containing a search-servers.txt file with a list of ip adresses and 
ports.
  Edit the search-servers.txt file
- <pre>
+ {{{
  x.x.x.1 9999
  x.x.x.2 9999
  x.x.x.3 9999
- </pre>
+ }}}
  
  Startup the search service
- <pre>
+ {{{
  bin/nutch server 9999 ${SEARCH_INSTALL_DIR}/local/crawled01
- </pre>
+ }}}
  

Reply via email to