[Nutch-cvs] [Nutch Wiki] Update of "Nutch0.9-Hadoop0.10-Tutorial" by mozdevil

Apache Wiki Mon, 19 Feb 2007 08:28:53 -0800

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.


The following page has been changed by mozdevil:
http://wiki.apache.org/nutch/Nutch0%2e9-Hadoop0%2e10-Tutorial

New page:
<h1>How to setup Nutch and Hadoop on Ubuntu 6.06</h1>

<h2>Prerequisites</h2>
To run Hadoop one needs at least 2 computers to make use of a real distributed 
file system, it can also run on a single machine but than no use is made of the 
distributed capabilities.

Nutch is written in Java, so the java compiler and runtime are needed as well 
as ant. Hadoop makes use of ssh clients and servers on all machines. Lucene 
needs an servlet container, I used tomcat5.
<pre>
su
edit /etc/apt/sources.list to enable the universe and multiverse repositories.
apt-get install sun-java5
apt-get install openssh
apt-get install tomcat5
</pre>

<h2>Build nutch</h2>
Download Nutch, this includes Hadoop and Lucene. I used the latest nightly 
build, which was at the time of writing 2007-02-06.
[http://cvs.apache.org/dist/lucene/nutch/nightly/ Nutch nightly]

Unpack the tarball to nutch-nightly and build it with ant.
<pre>
tar -xvzf nutch-2007-02-05.tar.gz
cd nutch-nightly
mkdir ~/nutch-build
echo "~/nutch-build" >> build.properties
ant package
</pre>

<h2>Prepare the machines</h2>
Create the nutch user on each machine and create the necesarry directories for 
nutch
<pre>
su
export NUTCH_INSTALL_DIR=/nutch-0.9.0
mkdir ${NUTCH_INSTALL_DIR}
mkdir ${NUTCH_INSTALL_DIR}/search
mkdir ${NUTCH_INSTALL_DIR}/filesystem
mkdir ${NUTCH_INSTALL_DIR}/local
mkdir ${NUTCH_INSTALL_DIR}/home

groupadd users
useradd -d ${NUTCH_INSTALL_DIR}/home -g users nutch
passwd nutch

chown -R nutch:users ${NUTCH_INSTALL_DIR}
exit
</pre>

<h2>Install and configure nutch</h2>
Install nutch on the master
<pre>
export NUTCH_INSTALL_DIR=/nutch-0.9.0
cp -Rv ~/nutch-build/* ${NUTCH_INSTALL_DIR}/search/
chown -R nutch:users ${NUTCH_INSTALL_DIR}
</pre>

Edit the hadoop-env.sh shell script so that the following variables are set.
<pre>
ssh [EMAIL PROTECTED]

echo "export HADOOP_HOME="${NUTCH_INSTALL_DIR}"/search" >> 
${NUTCH_INSTALL_DIR}/search/conf/hadoop-env.sh
echo "export JAVA_HOME=/usr/lib/jvm/java-1.5.0-sun" >> 
${NUTCH_INSTALL_DIR}/search/conf/hadoop-env.sh
echo "export HADOOP_LOG_DIR=\${HADOOP_HOME}/logs" >> 
${NUTCH_INSTALL_DIR}/search/conf/hadoop-env.sh
echo "export HADOOP_SLAVES=\${HADOOP_HOME}/conf/slaves" >> 
${NUTCH_INSTALL_DIR}/search/conf/hadoop-env.sh
exit
</pre>

Create ssh keys so that the nutch user can login over ssh without being 
prompted for a password.
<pre>
ssh [EMAIL PROTECTED]
cd ${NUTCH_INSTALL_DIR}/home
ssh-keygen -t rsa (Use empty responses for each prompt)
  Enter passphrase (empty for no passphrase): 
  Enter same passphrase again: 
  Your identification has been saved in ${NUTCH_INSTALL_DIR}/home/.ssh/id_rsa.
  Your public key has been saved in ${NUTCH_INSTALL_DIR}/home/.ssh/id_rsa.pub.
  The key fingerprint is:
  a6:5c:c3:eb:18:94:0b:06:a1:a6:29:58:fa:80:0a:bc [EMAIL PROTECTED]
</pre>

Copy the key for this machine to the authorized_keys file that will be copied 
to the other machines (the slaves).
<pre>
cd ${NUTCH_INSTALL_DIR}/home/.ssh
cp id_rsa.pub authorized_keys
</pre>

Edit the hadoop-site.xml configuration file.
<pre>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>

<property>
  <name>fs.default.name</name>
  <value>???:9000</value>
  <description>
    The name of the default file system. Either the literal string 
    "local" or a host:port for NDFS.
  </description>
</property>

<property>
  <name>mapred.job.tracker</name>
  <value>???:9001</value>
  <description>
    The host and port that the MapReduce job tracker runs at. If 
    "local", then jobs are run in-process as a single map and 
    reduce task.
  </description>
</property>

<property> 
  <name>mapred.map.tasks</name>
  <value>2</value>
  <description>
    define mapred.map tasks to be number of slave hosts
  </description> 
</property> 

<property> 
  <name>mapred.reduce.tasks</name>
  <value>2</value>
  <description>
    define mapred.reduce tasks to be number of slave hosts
  </description> 
</property> 

<property>
  <name>dfs.name.dir</name>
  <value>${NUTCH_INSTALL_DIR}/filesystem/name</value>
</property>

<property>
  <name>dfs.data.dir</name>
  <value>${NUTCH_INSTALL_DIR}/filesystem/data</value>
</property>

<property>
  <name>mapred.system.dir</name>
  <value>${NUTCH_INSTALL_DIR}/filesystem/mapreduce/system</value>
</property>

<property>
  <name>mapred.local.dir</name>
  <value>${NUTCH_INSTALL_DIR}/filesystem/mapreduce/local</value>
</property>

<property>
  <name>dfs.replication</name>
  <value>1</value>
</property>

</configuration>
</pre>

Edit the nutch-site.xml file
<pre>
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>
<property>
  <name>http.agent.name</name>
  <value>heeii</value>
  <description>HTTP 'User-Agent' request header. MUST NOT be empty - 
  please set this to a single word uniquely related to your organization.

  NOTE: You should also check other related properties:

  http.robots.agents
  http.agent.description
  http.agent.url
  http.agent.email
  http.agent.version

  and set their values appropriately.

  </description>
</property>

<property>
  <name>http.agent.description</name>
  <value>heeii.com</value>
  <description>Further description of our bot- this text is used in
  the User-Agent header.  It appears in parenthesis after the agent name.
  </description>
</property>

<property>
  <name>http.agent.url</name>
  <value>www.heeii.com</value>
  <description>A URL to advertise in the User-Agent header.  This will 
   appear in parenthesis after the agent name. Custom dictates that this
   should be a URL of a page explaining the purpose and behavior of this
   crawler.
  </description>
</property>

<property>
  <name>http.agent.email</name>
  <value>nutch at heeii.com</value>
  <description>An email address to advertise in the HTTP 'From' request
   header and User-Agent header. A good practice is to mangle this
   address (e.g. 'info at example dot com') to avoid spamming.
  </description>
</property>
</configuration>
</pre>

Edit the crawl-urlfilter.txt file to edit the pattern of the urls that have to 
be fetched.
<pre>
cd ${NUTCH_INSTALL_DIR}/search
vi conf/crawl-urlfilter.txt

change the line that reads:   +^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/
to read:                      +^http://([a-z0-9]*\.)*org/
</pre>

Or if downloading the whole internet is desired edit the nutch-site.xml file so 
that it includes the following property.
<pre>
<property>
  <name>urlfilter.regex.file</name>
  <value>automaton-urlfilter.txt</value>
</property>
</pre>


<h2>Distribute the code and the configuration</h2>
Copy the code and the configuration to the slaves
<pre>
scp -r ${NUTCH_INSTALL_DIR}/search/* [EMAIL 
PROTECTED]:${NUTCH_INSTALL_DIR}/search
</pre>

Copy the keys to the slave machines
<pre>
scp ${NUTCH_INSTALL_DIR}/home/.ssh/authorized_keys [EMAIL 
PROTECTED]:${NUTCH_INSTALL_DIR}/home/.ssh/authorized_keys
</pre>

Check if shhd is ready on the machines
<pre>
ssh ???
hostname
</pre>

<h2>Start Hadoop</h2>
Format the namenode
<pre>
bin/hadoop namenode -format
</pre>

Start all services on all machines.
<pre>
bin/start-all.sh
</pre>

To stop all of the servers you would use the following command:
<pre>
bin/stop-all.sh
</pre>

<h2>Crawling</h2>
To start crawling from a few urls as seeds an url directory is made in which a 
seed file is put with some seed urls. This file is put into the hdfs, to check 
if hdfs has stored the directory use the dfs -ls option of hadoop.
<pre>
mkdir urls
echo "http://lucene.apache.org"; >> urls/seed
echo "http://nl.wikipedia.org"; >> urls/seed
echo "http://en.wikipedia.org"; >> urls/seed
bin/hadoop dfs -put urls urls
bin/hadoop dfs -ls urls
</pre>

Start to crawl
<pre>
bin/nutch crawl urls -dir crawled01 -depth 3
</pre>

On the masternode the progress and status can be viewed with a webbrowser.
[[http://localhost:50030/ http://localhost:50030/]]

<h2>Searching</h2>
To search in the collected webpages the data that is now on the hdfs is best 
copied to the local filesystem for better performance. If an index becomes to 
large for one machine to handle, the index can be split and sepperate machines 
handle a part of the index. First we try to perform a search on one machine.

Because the searching needs different settings for nutch than for crawling, the 
easiest thing to do is to make a sepperate folder for the nutch search part.
<pre>
su
export SEARCH_INSTALL_DIR=/nutch-search-0.9.0
mkdir ${SEARCH_INSTALL_DIR}
chown nutch:users ${SEARCH_INSTALL_DIR}
exit
export SEARCH_INSTALL_DIR=/nutch-search-0.9.0
cp -Rv ${NUTCH_INSTALL_DIR}/search ${SEARCH_INSTALL_DIR}/search
mkdir ${SEARCH_INSTALL_DIR}/local
mkdir ${SEARCH_INSTALL_DIR}/home
</pre>

Copy the data 
<pre>
bin/hadoop dfs -copyToLocal crawled01 ${SEARCH_INSTALL_DIR}/local/
</pre>

Edit the nutch-site.xml in the nutch search directory <pre>
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>

  <property>
    <name>fs.default.name</name>
    <value>local</value>
  </property>

  <property>
    <name>searcher.dir</name>
    <value>${SEARCH_INSTALL_DIR}/local/crawled01</value>
  </property>

</configuration>
</pre>

Edit the hadoop-site.xml file and delete all the properties
<pre>
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>

</configuration>
</pre>

Test if all is configured properly
<pre>
bin/nutch org.apache.nutch.searcher.NutchBean an
</pre>
The last command should give a number of hits. If the query results in 0 hits 
there could be something wrong with the configuration, with the index or there 
are no documents containing the word. Try a few words, if all result in 0 hits 
most probably the configuration is wrong or the index is corrupt. The 
configuration problems I came across were: pointing to the wrong index 
directory and unintentionally using hadoop.

Copy the war file to the tomcat directory
<pre>
rm -rf usr/share/tomcat5/webapps/ROOT*
cp ${SEARCH_INSTALL_DIR}/*.war /usr/share/tomcat5/webapps/ROOT.war
</pre>

Copy the configuration to the tomcat directory
<pre>
cp ${SEARCH_INSTALL_DIR}/search/conf/* 
/usr/share/tomcat5/webapps/ROOT/WEB-INF/classes/
</pre>

Start tomcat 
<pre>
/usr/share/tomcat5/bin/startup.sh
</pre>

Open the search page in a webbrowser 
[[http://localhost:8180/ http://localhost:8180/]]

<h2>Distributed searching</h2>
Copy the search install directory to other machines.
<pre>
scp -R ${SEARCH_INSTALL_DIR}/search [EMAIL 
PROTECTED]:${SEARCH_INSTALL_DIR}/search
</pre>

Edit the nutch-site.xml so that the searcher.dir property points to a directory 
containing a search-servers.txt file with a list of ip adresses and ports.
Edit the search-servers.txt file
<pre>
x.x.x.1 9999
x.x.x.2 9999
x.x.x.3 9999
</pre>

Startup the search service
<pre>
bin/nutch server 9999 ${SEARCH_INSTALL_DIR}/local/crawled01
</pre>

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-cvs mailing list
Nutch-cvs@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-cvs

[Nutch-cvs] [Nutch Wiki] Update of "Nutch0.9-Hadoop0.10-Tutorial" by mozdevil

Reply via email to