[Nutch-cvs] [Nutch Wiki] Update of "Nutch0.9-Hadoop0.10-Tutorial" by mozdevil

Apache Wiki Fri, 23 Feb 2007 08:12:19 -0800

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.


The following page has been changed by mozdevil:
http://wiki.apache.org/nutch/Nutch0%2e9-Hadoop0%2e10-Tutorial

------------------------------------------------------------------------------
  
  {{{
  su
- edit /etc/apt/sources.list to enable the universe and multiverse repositories.
+ #enable the universe and multiverse repositories.
+ vi /etc/apt/sources.list 
  apt-get install sun-java5
  apt-get install openssh
  apt-get install tomcat5
@@ -22, +23 @@

  
  Unpack the tarball to nutch-nightly and build it with ant.
  {{{
+ export NUTCH_BUILD_DIR=~/nutch-build
- tar -xvzf nutch-2007-02-05.tar.gz
+ tar -xvzf nutch-2007-02-06.tar.gz
  cd nutch-nightly
- mkdir ~/nutch-build
- echo "~/nutch-build" >> build.properties
+ mkdir ${NUTCH_BUILD_DIR}
+ echo ${NUTCH_BUILD_DIR} >> build.properties
  ant package
  }}}
  
+ == Setup ==
- == Prepare the machines ==
+ === Prepare the machines ===
- Create the nutch user on each machine and create the necesarry directories 
for nutch
+ Create the nutch user on each machine and create the necessary directories 
for nutch
  {{{
- su
+ ssh [EMAIL PROTECTED]
  export NUTCH_INSTALL_DIR=/nutch-0.9.0
  mkdir ${NUTCH_INSTALL_DIR}
  mkdir ${NUTCH_INSTALL_DIR}/search
@@ -48, +51 @@

  exit
  }}}
  
- == Install and configure nutch ==
+ === Install and configure nutch and hadoop ===
- Install nutch on the master
+ Install nutch on the namenode (the master) and add the following variables to 
the hadoop-env.sh shell script.
  {{{
+ ssh [EMAIL PROTECTED]
  export NUTCH_INSTALL_DIR=/nutch-0.9.0
- cp -Rv ~/nutch-build/* ${NUTCH_INSTALL_DIR}/search/
+ cp -Rv ${NUTCH_BUILD_DIR}/* ${NUTCH_INSTALL_DIR}/search/
- chown -R nutch:users ${NUTCH_INSTALL_DIR}
+ #chown -R nutch:users ${NUTCH_INSTALL_DIR}
- }}}
- 
- Edit the hadoop-env.sh shell script so that the following variables are set.
- {{{
- ssh [EMAIL PROTECTED]
  
  echo "export HADOOP_HOME="${NUTCH_INSTALL_DIR}"/search" >> 
${NUTCH_INSTALL_DIR}/search/conf/hadoop-env.sh
  echo "export JAVA_HOME=/usr/lib/jvm/java-1.5.0-sun" >> 
${NUTCH_INSTALL_DIR}/search/conf/hadoop-env.sh
  echo "export HADOOP_LOG_DIR=\${HADOOP_HOME}/logs" >> 
${NUTCH_INSTALL_DIR}/search/conf/hadoop-env.sh
  echo "export HADOOP_SLAVES=\${HADOOP_HOME}/conf/slaves" >> 
${NUTCH_INSTALL_DIR}/search/conf/hadoop-env.sh
+ 
  exit
  }}}
  
+ === Configure SSH ===
  Create ssh keys so that the nutch user can login over ssh without being 
prompted for a password.
  {{{
  ssh [EMAIL PROTECTED]
@@ -86, +87 @@

  cp id_rsa.pub authorized_keys
  }}}
  
+ === Configure Hadoop ===
  Edit the hadoop-site.xml configuration file.
  {{{
  <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
@@ -157, +159 @@

  </configuration>
  }}}
  
- Edit the nutch-site.xml file
+ === Configure Nutch ===
+ Edit the nutch-site.xml file. Take the contents below and fill in the value 
tags.
  {{{
  <?xml version="1.0"?>
  <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
@@ -167, +170 @@

  <configuration>
  <property>
    <name>http.agent.name</name>
-   <value>heeii</value>
+   <value></value>
    <description>HTTP 'User-Agent' request header. MUST NOT be empty - 
    please set this to a single word uniquely related to your organization.
  
@@ -186, +189 @@

  
  <property>
    <name>http.agent.description</name>
-   <value>heeii.com</value>
+   <value></value>
    <description>Further description of our bot- this text is used in
    the User-Agent header.  It appears in parenthesis after the agent name.
    </description>
@@ -194, +197 @@

  
  <property>
    <name>http.agent.url</name>
-   <value>www.heeii.com</value>
+   <value></value>
    <description>A URL to advertise in the User-Agent header.  This will 
     appear in parenthesis after the agent name. Custom dictates that this
     should be a URL of a page explaining the purpose and behavior of this
@@ -204, +207 @@

  
  <property>
    <name>http.agent.email</name>
-   <value>nutch at heeii.com</value>
+   <value></value>
    <description>An email address to advertise in the HTTP 'From' request
     header and User-Agent header. A good practice is to mangle this
     address (e.g. 'info at example dot com') to avoid spamming.
@@ -230, +233 @@

  </property>
  }}}
  
- == Distribute the code and the configuration ==
+ === Distribute the code and the configuration ===
  Copy the code and the configuration to the slaves
  {{{
  scp -r ${NUTCH_INSTALL_DIR}/search/* [EMAIL 
PROTECTED]:${NUTCH_INSTALL_DIR}/search
@@ -268, +271 @@

  {{{
  mkdir urls
  echo "http://lucene.apache.org"; >> urls/seed
- echo "http://nl.wikipedia.org"; >> urls/seed
- echo "http://en.wikipedia.org"; >> urls/seed
  bin/hadoop dfs -put urls urls
  bin/hadoop dfs -ls urls
  }}}
  
  Start to crawl
  {{{
- bin/nutch crawl urls -dir crawled01 -depth 3
+ bin/nutch crawl urls -dir crawled -depth 3
  }}}
  
  On the masternode the progress and status can be viewed with a webbrowser.
  [[http://localhost:50030/ http://localhost:50030/]]
  
  == Searching ==
- To search in the collected webpages the data that is now on the hdfs is best 
copied to the local filesystem for better performance. If an index becomes to 
large for one machine to handle, the index can be split and sepperate machines 
handle a part of the index. First we try to perform a search on one machine.
+ To search in the collected webpages the data that is now on the hdfs is best 
copied to the local filesystem for better performance. If an index becomes to 
large for one machine to handle, the index can be split and separate machines 
handle a part of the index. First we try to perform a search on one machine.
  
+ === Install nutch for searching ===
  Because the searching needs different settings for nutch than for crawling, 
the easiest thing to do is to make a sepperate folder for the nutch search part.
  {{{
- su
+ ssh [EMAIL PROTECTED]
+ export NUTCH_BUILD_DIR=~/nutch-build
  export SEARCH_INSTALL_DIR=/nutch-search-0.9.0
  mkdir ${SEARCH_INSTALL_DIR}
  chown nutch:users ${SEARCH_INSTALL_DIR}
  exit
+ 
+ ssh [EMAIL PROTECTED]
  export SEARCH_INSTALL_DIR=/nutch-search-0.9.0
- cp -Rv ${NUTCH_INSTALL_DIR}/search ${SEARCH_INSTALL_DIR}/search
+ cp -Rv ${NUTCH_BUILD_DIR}/search ${SEARCH_INSTALL_DIR}/search
  mkdir ${SEARCH_INSTALL_DIR}/local
- mkdir ${SEARCH_INSTALL_DIR}/home
  }}}
  
+ === Configure ===
- Copy the data 
- {{{
- bin/hadoop dfs -copyToLocal crawled01 ${SEARCH_INSTALL_DIR}/local/
- }}}
- 
  Edit the nutch-site.xml in the nutch search directory 
  {{{
  <?xml version="1.0"?>
@@ -319, +319 @@

  
    <property>
      <name>searcher.dir</name>
-     <value>${SEARCH_INSTALL_DIR}/local/crawled01</value>
+     <value>${SEARCH_INSTALL_DIR}/local/crawled</value>
    </property>
  
  </configuration>
@@ -337, +337 @@

  </configuration>
  }}}
  
+ === Make a local index ===
+ Copy the data from dfs to the local filesystem.
+ {{{
+ bin/hadoop dfs -copyToLocal crawled ${SEARCH_INSTALL_DIR}/local/
+ }}}
+ 
  Test if all is configured properly
  {{{
  bin/nutch org.apache.nutch.searcher.NutchBean an
  }}}
  The last command should give a number of hits. If the query results in 0 hits 
there could be something wrong with the configuration, with the index or there 
are no documents containing the word. Try a few words, if all result in 0 hits 
most probably the configuration is wrong or the index is corrupt. The 
configuration problems I came across were: pointing to the wrong index 
directory and unintentionally using hadoop.
  
+ === Enable the web search interface ===
  Copy the war file to the tomcat directory
  {{{
  rm -rf usr/share/tomcat5/webapps/ROOT*
@@ -363, +370 @@

  [[http://localhost:8180/ http://localhost:8180/]]
  
  == Distributed searching ==
+ Prepare the other machines that are going to host a part of the index.
+ {{{
+ ssh [EMAIL PROTECTED]
+ export NUTCH_BUILD_DIR=~/nutch-build
+ export SEARCH_INSTALL_DIR=/nutch-search-0.9.0
+ mkdir ${SEARCH_INSTALL_DIR}
+ chown nutch:users ${SEARCH_INSTALL_DIR}
+ exit
+ }}}
+ 
  Copy the search install directory to other machines.
  {{{
  scp -R ${SEARCH_INSTALL_DIR}/search [EMAIL 
PROTECTED]:${SEARCH_INSTALL_DIR}/search
  }}}
  
+ === Configure ===
- Edit the nutch-site.xml so that the searcher.dir property points to a 
directory containing a search-servers.txt file with a list of ip adresses and 
ports.
+ Edit the nutch-site.xml so that the searcher.dir property points to the 
directory containing a search-servers.txt file with a list of ip adresses and 
ports.
- Edit the search-servers.txt file
+ Put the ip adresses and ports in a search-servers.txt file in the conf 
directory:
  {{{
  x.x.x.1 9999
  x.x.x.2 9999
  x.x.x.3 9999
  }}}
  
- Startup the search service
+ Edit the nutch-site.xml file:
  {{{
+ <?xml version="1.0"?>
+ <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
+ 
+ <!-- Put site-specific property overrides in this file. -->
+ 
+ <configuration>
+ 
+   <property>
+     <name>fs.default.name</name>
+     <value>local</value>
+   </property>
+ 
+   <property>
+     <name>searcher.dir</name>
+     <value>${SEARCH_INSTALL_DIR}/search/conf/</value>
+   </property>
+ 
+ </configuration>
+ }}}
+ 
+ === Split the index ===
+ ???
+ 
+ Copy each part of the index to a different machine.
+ {{{
+ ???
+ scp -R ${SEARCH_INSTALL_DIR}/local/partX/crawled [EMAIL 
PROTECTED]:${SEARCH_INSTALL_DIR}/local/
+ }}}
+ 
+ === Start the services ===
+ Startup the search services on all the machines that have a part of the index.
+ {{{
- bin/nutch server 9999 ${SEARCH_INSTALL_DIR}/local/crawled01
+ bin/nutch server 9999 ${SEARCH_INSTALL_DIR}/local/crawled
  }}}
  
+ Restart the master search node
+ {{{
+ /usr/share/tomcat5/bin/shutdown.sh
+ /usr/share/tomcat5/bin/startup.sh
+ }}}
+ 
+ Open the search page in a webbrowser 
+ [[http://localhost:8180/ http://localhost:8180/]]
+ 
+  
  == Crawling more pages ==
  To select links from the index and crawl for other pages there are a couple 
of nutch commands: generate, fetch and updatedb. The following bash script 
combines these, so that it can be started with just two parameters: the base 
directory of the data and the number of pages. Save this file as e.g. 
bin/fetch, if the data is in crawled01 than `bin/fetch crawled01 10000' selects 
10000 links from the index and fetches them. 
  {{{
@@ -399, +459 @@

  
  Copy the data to local and searching can be done on the new data.
    
+ I noticed that the number of map and reduce task has an impact on the 
performance of Hadoop. Many times after crawling a lot of pages the nodes 
reported 'java.lang.OutOfMemoryError: Java heap space' errors, this happend 
also in the indexing part. Increasing the number of maps solved these problems, 
with an index that has over 200.000 pages I needed 306 maps in total over 3 
machines. By setting the mapred.maps.tasks property in hadoop-site.xml to 99 
(much higher than what is advised in other tutorials and in the hadoop-site.xml 
file) that problem is solved.
  

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-cvs mailing list
Nutch-cvs@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-cvs

[Nutch-cvs] [Nutch Wiki] Update of "Nutch0.9-Hadoop0.10-Tutorial" by mozdevil

Reply via email to