Author: siren
Date: Thu Jun  8 13:03:11 2006
New Revision: 412847

URL: http://svn.apache.org/viewvc?rev=412847&view=rev
Log:
updated content from 0.7.2, added page about nightly builds, added hadoop as 
related project

Added:
    lucene/nutch/trunk/src/site/src/documentation/content/xdocs/nightly.xml
    lucene/nutch/trunk/src/site/src/documentation/content/xdocs/tutorial8.xml
Modified:
    lucene/nutch/trunk/src/site/src/documentation/content/xdocs/index.xml
    
lucene/nutch/trunk/src/site/src/documentation/content/xdocs/issue_tracking.xml
    lucene/nutch/trunk/src/site/src/documentation/content/xdocs/site.xml
    lucene/nutch/trunk/src/site/src/documentation/content/xdocs/tutorial.xml

Modified: lucene/nutch/trunk/src/site/src/documentation/content/xdocs/index.xml
URL: 
http://svn.apache.org/viewvc/lucene/nutch/trunk/src/site/src/documentation/content/xdocs/index.xml?rev=412847&r1=412846&r2=412847&view=diff
==============================================================================
--- lucene/nutch/trunk/src/site/src/documentation/content/xdocs/index.xml 
(original)
+++ lucene/nutch/trunk/src/site/src/documentation/content/xdocs/index.xml Thu 
Jun  8 13:03:11 2006
@@ -15,6 +15,22 @@
       <title>News</title>
 
       <section>
+      <title>31 March 2006: Nutch 0.7.2 Released</title>
+      <p>The 0.7.2 release of Nutch is now available. This is a bug fix 
release for 0.7 branch. See
+      <a 
href="http://svn.apache.org/viewcvs.cgi/lucene/nutch/branches/branch-0.7/CHANGES.txt?rev=390158";>
+      CHANGES.txt</a> for details. The release is available
+      <a href="http://lucene.apache.org/nutch/release/";>here</a>.</p>
+      </section>
+   
+      <section>
+      <title>1 October 2005: Nutch 0.7.1 Released</title>
+      <p>The 0.7.1 release of Nutch is now available. This is a bug fix 
release. See
+      <a 
href="http://svn.apache.org/viewcvs.cgi/lucene/nutch/branches/branch-0.7/CHANGES.txt?rev=292986";>
+      CHANGES.txt</a> for details. The release is available
+      <a href="http://lucene.apache.org/nutch/release/";>here</a>.</p>
+      </section>
+
+      <section>
       <title>17 August 2005: Nutch 0.7 Released</title>
       <p>This is the first Nutch release as an Apache Lucene sub-project. See 
       <a 
href="http://svn.apache.org/viewcvs.cgi/lucene/nutch/trunk/CHANGES.txt?rev=233150";>

Modified: 
lucene/nutch/trunk/src/site/src/documentation/content/xdocs/issue_tracking.xml
URL: 
http://svn.apache.org/viewvc/lucene/nutch/trunk/src/site/src/documentation/content/xdocs/issue_tracking.xml?rev=412847&r1=412846&r2=412847&view=diff
==============================================================================
--- 
lucene/nutch/trunk/src/site/src/documentation/content/xdocs/issue_tracking.xml 
(original)
+++ 
lucene/nutch/trunk/src/site/src/documentation/content/xdocs/issue_tracking.xml 
Thu Jun  8 13:03:11 2006
@@ -11,7 +11,7 @@
   <body>
     <p>
       Nutch issues (bugs, as well as enhancement requests) are tracked in 
-      Apache JIRA <a 
href="http://nagoya.apache.org/jira/browse/Nutch";>here</a>.
+      Apache JIRA <a 
href="http://issues.apache.org/jira/browse/Nutch";>here</a>.
       If you aren't sure whether something is a bug, post a question on the
       Nutch user <a href="mailing_lists.html">mailing list</a>.
     </p>

Added: lucene/nutch/trunk/src/site/src/documentation/content/xdocs/nightly.xml
URL: 
http://svn.apache.org/viewvc/lucene/nutch/trunk/src/site/src/documentation/content/xdocs/nightly.xml?rev=412847&view=auto
==============================================================================
--- lucene/nutch/trunk/src/site/src/documentation/content/xdocs/nightly.xml 
(added)
+++ lucene/nutch/trunk/src/site/src/documentation/content/xdocs/nightly.xml Thu 
Jun  8 13:03:11 2006
@@ -0,0 +1,28 @@
+<?xml version="1.0"?>
+
+<!DOCTYPE document PUBLIC "-//APACHE//DTD Documentation V2.0//EN" 
"http://forrest.apache.org/dtd/document-v20.dtd";>
+
+<document>
+  
+  <header>
+    <title>Nightly builds</title>
+  </header>
+  
+  <body>
+    <p>
+    Nightly binary builds contains the latest code available. Nightly
+    binary builds are provided for testing only. They might or might not
+    be functional. 
+    </p>
+    <p>
+    You can track the progress of 0.8-dev version from a <a 
href="http://issues.apache.org/jira/browse/NUTCH?report=com.atlassian.jira.plugin.system.project:roadmap-panel";>jira
 report</a> 
+    </p>
+    <p>
+    To report bugs see <a href="issue_tracking.html">issue tracking</a>
+    </p>
+    <p>
+    <a href="http://people.apache.org/dist/lucene/nutch/nightly/";>Nutch 
nightly builds</a> (0.8-dev)
+    </p>
+  </body>
+  
+</document>
\ No newline at end of file

Modified: lucene/nutch/trunk/src/site/src/documentation/content/xdocs/site.xml
URL: 
http://svn.apache.org/viewvc/lucene/nutch/trunk/src/site/src/documentation/content/xdocs/site.xml?rev=412847&r1=412846&r2=412847&view=diff
==============================================================================
--- lucene/nutch/trunk/src/site/src/documentation/content/xdocs/site.xml 
(original)
+++ lucene/nutch/trunk/src/site/src/documentation/content/xdocs/site.xml Thu 
Jun  8 13:03:11 2006
@@ -26,7 +26,8 @@
   <docs label="Documentation">    
     <faq         label="FAQ"              href="ext:faq" />    
     <wiki        label="Wiki"             href="ext:wiki" />    
-    <tutorial    label="Tutorial"         href="tutorial.html" />
+    <tutorial    label="Tutorial ver. 0.7"     href="tutorial.html" />
+    <tutorial8   label="Tutorial ver. 0.8"     href="tutorial8.html" />
     <webmasters  label="Robot     "       href="bot.html" />
     <i18n        label="i18n"             href="i18n.html" />
     <apidocs     label="API Docs"         href="apidocs/index.html" />
@@ -34,17 +35,21 @@
 
   <resources label="Resources">
     <download    label="Download"         href="release/" />
+    <nightly     label="Nightly builds"   href="nightly.html" />
     <contact     label="Mailing Lists"    href="mailing_lists.html" />
     <issues      label="Issue Tracking"   href="issue_tracking.html" />
     <vcs         label="Version Control"  href="version_control.html" />
   </resources>
 
+
   <projects label="Related Projects">
     <lucene     label="Lucene Java"      href="ext:lucene" />
+    <hadoop     label="Hadoop"      href="ext:hadoop" />
   </projects>
 
   <external-refs>
     <lucene    href="http://lucene.apache.org/java/"; />
+    <hadoop    href="http://lucene.apache.org/hadoop/"; />
     <wiki      href="http://wiki.apache.org/nutch/"; />
     <faq       href="http://wiki.apache.org/nutch/FAQ"; /> 
     <store     href="http://www.cafepress.com/nutch/"; />

Modified: 
lucene/nutch/trunk/src/site/src/documentation/content/xdocs/tutorial.xml
URL: 
http://svn.apache.org/viewvc/lucene/nutch/trunk/src/site/src/documentation/content/xdocs/tutorial.xml?rev=412847&r1=412846&r2=412847&view=diff
==============================================================================
--- lucene/nutch/trunk/src/site/src/documentation/content/xdocs/tutorial.xml 
(original)
+++ lucene/nutch/trunk/src/site/src/documentation/content/xdocs/tutorial.xml 
Thu Jun  8 13:03:11 2006
@@ -6,7 +6,7 @@
 <document>
 
 <header>
-  <title>Nutch tutorial</title> 
+  <title>Nutch version 0.7 tutorial</title> 
 </header> 
 
 <body>
@@ -66,11 +66,11 @@
 
 <ol>
 
-<li>Create a directory with a flat file of root urls.  For example, to
-crawl the <code>nutch</code> site you might start with a file named
-<code>urls/nutch</code> containing the url of just the Nutch home
-page.  All other Nutch pages should be reachable from this page.  The
-<code>urls/nutch</code> file would thus contain:
+<li>Create a flat file of root urls.  For example, to crawl the
+<code>nutch</code> site you might start with a file named
+<code>urls</code> containing just the Nutch home page.  All other
+Nutch pages should be reachable from this page.  The <code>urls</code>
+file would thus look like:
 <source>
 http://lucene.apache.org/nutch/
 </source>
@@ -97,28 +97,24 @@
 
 <ul>
 <li><code>-dir</code> <em>dir</em> names the directory to put the crawl 
in.</li>
-<li><code>-threads</code> <em>threads</em> determines the number of
-threads that will fetch in parallel.</li>
 <li><code>-depth</code> <em>depth</em> indicates the link depth from the root
 page that should be crawled.</li>
-<li><code>-topN</code> <em>N</em> determines the maximum number of pages that
-will be retrieved at each level up to the depth.</li>
+<li><code>-delay</code> <em>delay</em> determines the number of seconds
+between accesses to each host.</li>
+<li><code>-threads</code> <em>threads</em> determines the number of
+threads that will fetch in parallel.</li>
 </ul>
 
 <p>For example, a typical call might be:</p>
 
 <source>
-bin/nutch crawl urls -dir crawl -depth 3 -topN 50
+bin/nutch crawl urls -dir crawl.test -depth 3 >&amp; crawl.log
 </source>
 
-<p>Typically one starts testing one's configuration by crawling at
-shallow depths, sharply limiting the number of pages fetched at each
-level (<code>-topN</code>), and watching the output to check that
-desired pages are fetched and undesirable pages are not.  Once one is
-confident of the configuration, then an appropriate depth for a full
-crawl is around 10.  The number of pages per level
-(<code>-topN</code>) for a full crawl can be from tens of thousands to
-millions, depending on your resources.</p>
+<p>Typically one starts testing one's configuration by crawling at low
+depths, and watching the output to check that desired pages are found.
+Once one is more confident of the configuration, then an appropriate
+depth for a full crawl is around 10.</p>
 
 <p>Once crawling has completed, one can skip to the Searching section
 below.</p>
@@ -135,62 +131,54 @@
 <section>
 <title>Whole-web: Concepts</title>
 
-<p>Nutch data is composed of:</p>
+<p>Nutch data is of two types:</p>
 
 <ol>
-
-  <li>The crawl database, or <em>crawldb</em>.  This contains
-information about every url known to Nutch, including whether it was
-fetched, and, if so, when.</li>
-
-  <li>The link database, or <em>linkdb</em>.  This contains the list
-of known links to each url, including both the source url and anchor
-text of the link.</li>
-
-  <li>A set of <em>segments</em>.  Each segment is a set of urls that are
-fetched as a unit.  Segments are directories with the following
-subdirectories:</li>
-
+  <li>The web database.  This contains information about every
+page known to Nutch, and about links between those pages.</li>
+  <li>A set of segments.  Each segment is a set of pages that are
+fetched and indexed as a unit.  Segment data consists of the
+following types:</li>
   <li><ul>
-    <li>a <em>crawl_generate</em> names a set of urls to be fetched</li>
-    <li>a <em>crawl_fetch</em> contains the status of fetching each url</li>
-    <li>a <em>content</em> contains the content of each url</li>
-    <li>a <em>parse_text</em> contains the parsed text of each url</li>
-    <li>a <em>parse_data</em> contains outlinks and metadata parsed
-    from each url</li>
-    <li>a <em>crawl_parse</em> contains the outlink urls, used to
-    update the crawldb</li>
+    <li>a <em>fetchlist</em> is a file
+that names a set of pages to be fetched</li>
+    <li>the<em> fetcher output</em> is a
+set of files containing the fetched pages</li>
+    <li>the <em>index </em>is a
+Lucene-format index of the fetcher output.</li>
   </ul></li>
-
-<li>The <em>indexes</em>are Lucene-format indexes.</li>
-
 </ol>
+<p>In the following examples we will keep our web database in a directory
+named <code>db</code> and our segments
+in a directory named <code>segments</code>:</p>
+<source>mkdir db
+mkdir segments</source>
 
 </section>
 <section>
 <title>Whole-web: Boostrapping the Web Database</title>
+<p>The admin tool is used to create a new, empty database:</p>
+
+<source>bin/nutch admin db -create</source>
 
-<p>The <em>injector</em> adds urls to the crawldb.  Let's inject URLs
-from the <a href="http://dmoz.org/";>DMOZ</a> Open Directory. First we
-must download and uncompress the file listing all of the DMOZ pages.
-(This is a 200+Mb file, so this will take a few minutes.)</p>
+<p>The <em>injector</em> adds urls into the database.  Let's inject
+URLs from the <a href="http://dmoz.org/";>DMOZ</a> Open
+Directory. First we must download and uncompress the file listing all
+of the DMOZ pages.  (This is a 200+Mb file, so this will take a few
+minutes.)</p>
 
 <source>wget http://rdf.dmoz.org/rdf/content.rdf.u8.gz
 gunzip content.rdf.u8.gz</source>
 
-<p>Next we select a random subset of these pages.
+<p>Next we inject a random subset of these pages into the web database.
  (We use a random subset so that everyone who runs this tutorial
 doesn't hammer the same sites.)  DMOZ contains around three million
-URLs.  We select one out of every 5000, so that we end up with
+URLs.  We inject one out of every 3000, so that we end up with
 around 1000 URLs:</p>
 
-<source>mkdir dmoz
-bin/nutch org.apache.nutch.crawl.DmozParser content.rdf.u8 -subset 5000 &gt; 
dmoz/urls</source>
+<source>bin/nutch inject db -dmozfile content.rdf.u8 -subset 3000</source>
 
-<p>The parser also takes a few minutes, as it must parse the full
-file.  Finally, we initialize the crawl db with the selected urls.</p>
-
-<source>bin/nutch inject crawl/crawldb dmoz</source>
+<p>This also takes a few minutes, as it must parse the full file.</p>
 
 <p>Now we have a web database with around 1000 as-yet unfetched URLs in it.</p>
 
@@ -198,39 +186,39 @@
 <section>
 <title>Whole-web: Fetching</title>
 <p>To fetch, we first generate a fetchlist from the database:</p>
-<source>bin/nutch generate crawl/crawldb crawl/segments
+<source>bin/nutch generate db segments
 </source>
 <p>This generates a fetchlist for all of the pages due to be fetched.
  The fetchlist is placed in a newly created segment directory.
  The segment directory is named by the time it's created.  We
 save the name of this segment in the shell variable <code>s1</code>:</p>
-<source>s1=`ls -d crawl/segments/2* | tail -1`
+<source>s1=`ls -d segments/2* | tail -1`
 echo $s1
 </source>
 <p>Now we run the fetcher on this segment with:</p>
 <source>bin/nutch fetch $s1</source>
 <p>When this is complete, we update the database with the results of the
 fetch:</p>
-<source>bin/nutch updatedb crawl/crawldb $s1</source>
+<source>bin/nutch updatedb db $s1</source>
 <p>Now the database has entries for all of the pages referenced by the
 initial set.</p>
 
 <p>Now we fetch a new segment with the top-scoring 1000 pages:</p>
-<source>bin/nutch generate crawl/crawldb crawl/segments -topN 1000
-s2=`ls -d crawl/segments/2* | tail -1`
+<source>bin/nutch generate db segments -topN 1000
+s2=`ls -d segments/2* | tail -1`
 echo $s2
 
 bin/nutch fetch $s2
-bin/nutch updatedb crawl/crawldb $s2
+bin/nutch updatedb db $s2
 </source>
 <p>Let's fetch one more round:</p>
 <source>
-bin/nutch generate crawl/crawldb crawl/segments -topN 1000
-s3=`ls -d crawl/segments/2* | tail -1`
+bin/nutch generate db segments -topN 1000
+s3=`ls -d segments/2* | tail -1`
 echo $s3
 
 bin/nutch fetch $s3
-bin/nutch updatedb crawl/crawldb $s3
+bin/nutch updatedb db $s3
 </source>
 
 <p>By this point we've fetched a few thousand pages.  Let's index
@@ -239,20 +227,16 @@
 </section>
 <section>
 <title>Whole-web: Indexing</title>
+<p>To index each segment we use the <code>index</code>
+command, as follows:</p>
+<source>bin/nutch index $s1
+bin/nutch index $s2
+bin/nutch index $s3</source>
 
-<p>Before indexing we first invert all of the links, so that we may
-index incoming anchor text with the pages.</p>
-
-<source>bin/nutch invertlinks crawl/linkdb crawl/segments</source>
-
-<p>To index the segments we use the <code>index</code> command, as follows:</p>
-
-<source>bin/nutch index indexes crawl/linkdb crawl/segments/*</source>
-
-<!-- <p>Then, before we can search a set of segments, we need to delete -->
-<!-- duplicate pages.  This is done with:</p> -->
+<p>Then, before we can search a set of segments, we need to delete
+duplicate pages.  This is done with:</p>
 
-<!-- <source>bin/nutch dedup indexes</source> -->
+<source>bin/nutch dedup segments dedup.tmp</source>
 
 <p>Now we're ready to search!</p>
 
@@ -272,8 +256,10 @@
 cp nutch*.war ~/local/tomcat/webapps/ROOT.war
 </source>
 
-<p>The webapp finds its indexes in <code>./crawl</code>, relative
-to where you start Tomcat, so use a command like:</p>
+<p>The webapp finds its indexes in <code>./segments</code>, relative
+to where you start Tomcat, so, if you've done intranet crawling,
+connect to your crawl directory, or, if you've done whole-web
+crawling, don't change directories, and give the command:</p>
 
 <source>~/local/tomcat/bin/catalina.sh start
 </source>
@@ -281,6 +267,8 @@
 <p>Then visit <a href="http://localhost:8080/";>http://localhost:8080/</a>
 and have fun!</p>
 
+<p>More detailed tutorials are available on the Nutch Wiki.
+</p>
 </section>
 </section>
 

Added: lucene/nutch/trunk/src/site/src/documentation/content/xdocs/tutorial8.xml
URL: 
http://svn.apache.org/viewvc/lucene/nutch/trunk/src/site/src/documentation/content/xdocs/tutorial8.xml?rev=412847&view=auto
==============================================================================
--- lucene/nutch/trunk/src/site/src/documentation/content/xdocs/tutorial8.xml 
(added)
+++ lucene/nutch/trunk/src/site/src/documentation/content/xdocs/tutorial8.xml 
Thu Jun  8 13:03:11 2006
@@ -0,0 +1,291 @@
+<?xml version="1.0"?>
+
+<!DOCTYPE document PUBLIC "-//APACHE//DTD Documentation V2.0//EN" 
+          "http://forrest.apache.org/dtd/document-v20.dtd";>
+
+<document>
+
+<header>
+  <title>Nutch version 0.8 tutorial</title> 
+</header> 
+
+<body>
+
+<section>
+<title>Requirements</title>
+<ol>
+  <li>Java 1.4.x, either from <a
+ href="http://java.sun.com/j2se/downloads.html";>Sun</a> or <a
+ href="http://www-106.ibm.com/developerworks/java/jdk/";>IBM</a> on
+ Linux is preferred.  Set <code>NUTCH_JAVA_HOME</code> to the root
+ of your JVM installation.
+  </li>
+  <li>Apache's <a href="http://jakarta.apache.org/tomcat/";>Tomcat</a>
+4.x.</li>
+  <li>On Win32, <a href="http://www.cygwin.com/";>cygwin</a>, for
+shell support.  (If you plan to use Subversion on Win32, be sure to select the 
subversion package when you install, in the "Devel" category.)</li>
+  <li>Up to a gigabyte of free disk space, a high-speed connection, and
+an hour or so.
+  </li>
+</ol>
+</section>
+<section>
+<title>Getting Started</title>
+
+<p>First, you need to get a copy of the Nutch code.  You can download
+a release from <a
+href="http://lucene.apache.org/nutch/release/";>http://lucene.apache.org/nutch/release/</a>.
+Unpack the release and connect to its top-level directory.  Or, check
+out the latest source code from <a
+href="version_control.html">subversion</a> and build it
+with <a href="http://ant.apache.org/";>Ant</a>.</p>
+
+<p>Try the following command:</p>
+<source>bin/nutch</source>
+<p>This will display the documentation for the Nutch command script.</p>
+
+<p>Now we're ready to crawl.  There are two approaches to crawling:</p>
+<ol>
+<li>Intranet crawling, with the <code>crawl</code> command.</li>
+<li>Whole-web crawling, with much greater control, using the lower
+level <code>inject</code>, <code>generate</code>, <code>fetch</code>
+and <code>updatedb</code> commands.</li>
+</ol>
+
+</section>
+<section>
+<title>Intranet Crawling</title>
+
+<p>Intranet crawling is more appropriate when you intend to crawl up to
+around one million pages on a handful of web servers.</p>
+
+<section>
+<title>Intranet: Configuration</title>
+
+<p>To configure things for intranet crawling you must:</p>
+
+<ol>
+
+<li>Create a directory with a flat file of root urls.  For example, to
+crawl the <code>nutch</code> site you might start with a file named
+<code>urls/nutch</code> containing the url of just the Nutch home
+page.  All other Nutch pages should be reachable from this page.  The
+<code>urls/nutch</code> file would thus contain:
+<source>
+http://lucene.apache.org/nutch/
+</source>
+</li>
+
+<li>Edit the file <code>conf/crawl-urlfilter.txt</code> and replace
+<code>MY.DOMAIN.NAME</code> with the name of the domain you wish to
+crawl.  For example, if you wished to limit the crawl to the
+<code>apache.org</code> domain, the line should read:
+<source>
++^http://([a-z0-9]*\.)*apache.org/
+</source>
+This will include any url in the domain <code>apache.org</code>.
+</li>
+
+</ol>
+
+</section>
+<section>
+<title>Intranet: Running the Crawl</title>
+
+<p>Once things are configured, running the crawl is easy.  Just use the
+crawl command.  Its options include:</p>
+
+<ul>
+<li><code>-dir</code> <em>dir</em> names the directory to put the crawl 
in.</li>
+<li><code>-threads</code> <em>threads</em> determines the number of
+threads that will fetch in parallel.</li>
+<li><code>-depth</code> <em>depth</em> indicates the link depth from the root
+page that should be crawled.</li>
+<li><code>-topN</code> <em>N</em> determines the maximum number of pages that
+will be retrieved at each level up to the depth.</li>
+</ul>
+
+<p>For example, a typical call might be:</p>
+
+<source>
+bin/nutch crawl urls -dir crawl -depth 3 -topN 50
+</source>
+
+<p>Typically one starts testing one's configuration by crawling at
+shallow depths, sharply limiting the number of pages fetched at each
+level (<code>-topN</code>), and watching the output to check that
+desired pages are fetched and undesirable pages are not.  Once one is
+confident of the configuration, then an appropriate depth for a full
+crawl is around 10.  The number of pages per level
+(<code>-topN</code>) for a full crawl can be from tens of thousands to
+millions, depending on your resources.</p>
+
+<p>Once crawling has completed, one can skip to the Searching section
+below.</p>
+
+</section>
+</section>
+
+<section>
+<title>Whole-web Crawling</title>
+
+<p>Whole-web crawling is designed to handle very large crawls which may
+take weeks to complete, running on multiple machines.</p>
+
+<section>
+<title>Whole-web: Concepts</title>
+
+<p>Nutch data is composed of:</p>
+
+<ol>
+
+  <li>The crawl database, or <em>crawldb</em>.  This contains
+information about every url known to Nutch, including whether it was
+fetched, and, if so, when.</li>
+
+  <li>The link database, or <em>linkdb</em>.  This contains the list
+of known links to each url, including both the source url and anchor
+text of the link.</li>
+
+  <li>A set of <em>segments</em>.  Each segment is a set of urls that are
+fetched as a unit.  Segments are directories with the following
+subdirectories:</li>
+
+  <li><ul>
+    <li>a <em>crawl_generate</em> names a set of urls to be fetched</li>
+    <li>a <em>crawl_fetch</em> contains the status of fetching each url</li>
+    <li>a <em>content</em> contains the content of each url</li>
+    <li>a <em>parse_text</em> contains the parsed text of each url</li>
+    <li>a <em>parse_data</em> contains outlinks and metadata parsed
+    from each url</li>
+    <li>a <em>crawl_parse</em> contains the outlink urls, used to
+    update the crawldb</li>
+  </ul></li>
+
+<li>The <em>indexes</em>are Lucene-format indexes.</li>
+
+</ol>
+
+</section>
+<section>
+<title>Whole-web: Boostrapping the Web Database</title>
+
+<p>The <em>injector</em> adds urls to the crawldb.  Let's inject URLs
+from the <a href="http://dmoz.org/";>DMOZ</a> Open Directory. First we
+must download and uncompress the file listing all of the DMOZ pages.
+(This is a 200+Mb file, so this will take a few minutes.)</p>
+
+<source>wget http://rdf.dmoz.org/rdf/content.rdf.u8.gz
+gunzip content.rdf.u8.gz</source>
+
+<p>Next we select a random subset of these pages.
+ (We use a random subset so that everyone who runs this tutorial
+doesn't hammer the same sites.)  DMOZ contains around three million
+URLs.  We select one out of every 5000, so that we end up with
+around 1000 URLs:</p>
+
+<source>mkdir dmoz
+bin/nutch org.apache.nutch.tools.DmozParser content.rdf.u8 -subset 5000 &gt; 
dmoz/urls</source>
+
+<p>The parser also takes a few minutes, as it must parse the full
+file.  Finally, we initialize the crawl db with the selected urls.</p>
+
+<source>bin/nutch inject crawl/crawldb dmoz</source>
+
+<p>Now we have a web database with around 1000 as-yet unfetched URLs in it.</p>
+
+</section>
+<section>
+<title>Whole-web: Fetching</title>
+<p>To fetch, we first generate a fetchlist from the database:</p>
+<source>bin/nutch generate crawl/crawldb crawl/segments
+</source>
+<p>This generates a fetchlist for all of the pages due to be fetched.
+ The fetchlist is placed in a newly created segment directory.
+ The segment directory is named by the time it's created.  We
+save the name of this segment in the shell variable <code>s1</code>:</p>
+<source>s1=`ls -d crawl/segments/2* | tail -1`
+echo $s1
+</source>
+<p>Now we run the fetcher on this segment with:</p>
+<source>bin/nutch fetch $s1</source>
+<p>When this is complete, we update the database with the results of the
+fetch:</p>
+<source>bin/nutch updatedb crawl/crawldb $s1</source>
+<p>Now the database has entries for all of the pages referenced by the
+initial set.</p>
+
+<p>Now we fetch a new segment with the top-scoring 1000 pages:</p>
+<source>bin/nutch generate crawl/crawldb crawl/segments -topN 1000
+s2=`ls -d crawl/segments/2* | tail -1`
+echo $s2
+
+bin/nutch fetch $s2
+bin/nutch updatedb crawl/crawldb $s2
+</source>
+<p>Let's fetch one more round:</p>
+<source>
+bin/nutch generate crawl/crawldb crawl/segments -topN 1000
+s3=`ls -d crawl/segments/2* | tail -1`
+echo $s3
+
+bin/nutch fetch $s3
+bin/nutch updatedb crawl/crawldb $s3
+</source>
+
+<p>By this point we've fetched a few thousand pages.  Let's index
+them!</p>
+
+</section>
+<section>
+<title>Whole-web: Indexing</title>
+
+<p>Before indexing we first invert all of the links, so that we may
+index incoming anchor text with the pages.</p>
+
+<source>bin/nutch invertlinks crawl/linkdb crawl/segments</source>
+
+<p>To index the segments we use the <code>index</code> command, as follows:</p>
+
+<source>bin/nutch index indexes crawl/linkdb crawl/segments/*</source>
+
+<!-- <p>Then, before we can search a set of segments, we need to delete -->
+<!-- duplicate pages.  This is done with:</p> -->
+
+<!-- <source>bin/nutch dedup indexes</source> -->
+
+<p>Now we're ready to search!</p>
+
+</section>
+<section>
+<title>Searching</title>
+
+<p>To search you need to put the nutch war file into your servlet
+container.  (If instead of downloading a Nutch release you checked the
+sources out of SVN, then you'll first need to build the war file, with
+the command <code>ant war</code>.)</p>
+
+<p>Assuming you've unpacked Tomcat as ~/local/tomcat, then the Nutch war
+file may be installed with the commands:</p>
+
+<source>rm -rf ~/local/tomcat/webapps/ROOT*
+cp nutch*.war ~/local/tomcat/webapps/ROOT.war
+</source>
+
+<p>The webapp finds its indexes in <code>./crawl</code>, relative
+to where you start Tomcat, so use a command like:</p>
+
+<source>~/local/tomcat/bin/catalina.sh start
+</source>
+
+<p>Then visit <a href="http://localhost:8080/";>http://localhost:8080/</a>
+and have fun!</p>
+
+<p>More detailed tutorials are available on the Nutch Wiki.
+</p>
+
+</section>
+</section>
+
+</body>
+</document>


Reply via email to