Author: siren Date: Thu Jun 8 13:03:11 2006 New Revision: 412847 URL: http://svn.apache.org/viewvc?rev=412847&view=rev Log: updated content from 0.7.2, added page about nightly builds, added hadoop as related project
Added: lucene/nutch/trunk/src/site/src/documentation/content/xdocs/nightly.xml lucene/nutch/trunk/src/site/src/documentation/content/xdocs/tutorial8.xml Modified: lucene/nutch/trunk/src/site/src/documentation/content/xdocs/index.xml lucene/nutch/trunk/src/site/src/documentation/content/xdocs/issue_tracking.xml lucene/nutch/trunk/src/site/src/documentation/content/xdocs/site.xml lucene/nutch/trunk/src/site/src/documentation/content/xdocs/tutorial.xml Modified: lucene/nutch/trunk/src/site/src/documentation/content/xdocs/index.xml URL: http://svn.apache.org/viewvc/lucene/nutch/trunk/src/site/src/documentation/content/xdocs/index.xml?rev=412847&r1=412846&r2=412847&view=diff ============================================================================== --- lucene/nutch/trunk/src/site/src/documentation/content/xdocs/index.xml (original) +++ lucene/nutch/trunk/src/site/src/documentation/content/xdocs/index.xml Thu Jun 8 13:03:11 2006 @@ -15,6 +15,22 @@ <title>News</title> <section> + <title>31 March 2006: Nutch 0.7.2 Released</title> + <p>The 0.7.2 release of Nutch is now available. This is a bug fix release for 0.7 branch. See + <a href="http://svn.apache.org/viewcvs.cgi/lucene/nutch/branches/branch-0.7/CHANGES.txt?rev=390158"> + CHANGES.txt</a> for details. The release is available + <a href="http://lucene.apache.org/nutch/release/">here</a>.</p> + </section> + + <section> + <title>1 October 2005: Nutch 0.7.1 Released</title> + <p>The 0.7.1 release of Nutch is now available. This is a bug fix release. See + <a href="http://svn.apache.org/viewcvs.cgi/lucene/nutch/branches/branch-0.7/CHANGES.txt?rev=292986"> + CHANGES.txt</a> for details. The release is available + <a href="http://lucene.apache.org/nutch/release/">here</a>.</p> + </section> + + <section> <title>17 August 2005: Nutch 0.7 Released</title> <p>This is the first Nutch release as an Apache Lucene sub-project. See <a href="http://svn.apache.org/viewcvs.cgi/lucene/nutch/trunk/CHANGES.txt?rev=233150"> Modified: lucene/nutch/trunk/src/site/src/documentation/content/xdocs/issue_tracking.xml URL: http://svn.apache.org/viewvc/lucene/nutch/trunk/src/site/src/documentation/content/xdocs/issue_tracking.xml?rev=412847&r1=412846&r2=412847&view=diff ============================================================================== --- lucene/nutch/trunk/src/site/src/documentation/content/xdocs/issue_tracking.xml (original) +++ lucene/nutch/trunk/src/site/src/documentation/content/xdocs/issue_tracking.xml Thu Jun 8 13:03:11 2006 @@ -11,7 +11,7 @@ <body> <p> Nutch issues (bugs, as well as enhancement requests) are tracked in - Apache JIRA <a href="http://nagoya.apache.org/jira/browse/Nutch">here</a>. + Apache JIRA <a href="http://issues.apache.org/jira/browse/Nutch">here</a>. If you aren't sure whether something is a bug, post a question on the Nutch user <a href="mailing_lists.html">mailing list</a>. </p> Added: lucene/nutch/trunk/src/site/src/documentation/content/xdocs/nightly.xml URL: http://svn.apache.org/viewvc/lucene/nutch/trunk/src/site/src/documentation/content/xdocs/nightly.xml?rev=412847&view=auto ============================================================================== --- lucene/nutch/trunk/src/site/src/documentation/content/xdocs/nightly.xml (added) +++ lucene/nutch/trunk/src/site/src/documentation/content/xdocs/nightly.xml Thu Jun 8 13:03:11 2006 @@ -0,0 +1,28 @@ +<?xml version="1.0"?> + +<!DOCTYPE document PUBLIC "-//APACHE//DTD Documentation V2.0//EN" "http://forrest.apache.org/dtd/document-v20.dtd"> + +<document> + + <header> + <title>Nightly builds</title> + </header> + + <body> + <p> + Nightly binary builds contains the latest code available. Nightly + binary builds are provided for testing only. They might or might not + be functional. + </p> + <p> + You can track the progress of 0.8-dev version from a <a href="http://issues.apache.org/jira/browse/NUTCH?report=com.atlassian.jira.plugin.system.project:roadmap-panel">jira report</a> + </p> + <p> + To report bugs see <a href="issue_tracking.html">issue tracking</a> + </p> + <p> + <a href="http://people.apache.org/dist/lucene/nutch/nightly/">Nutch nightly builds</a> (0.8-dev) + </p> + </body> + +</document> \ No newline at end of file Modified: lucene/nutch/trunk/src/site/src/documentation/content/xdocs/site.xml URL: http://svn.apache.org/viewvc/lucene/nutch/trunk/src/site/src/documentation/content/xdocs/site.xml?rev=412847&r1=412846&r2=412847&view=diff ============================================================================== --- lucene/nutch/trunk/src/site/src/documentation/content/xdocs/site.xml (original) +++ lucene/nutch/trunk/src/site/src/documentation/content/xdocs/site.xml Thu Jun 8 13:03:11 2006 @@ -26,7 +26,8 @@ <docs label="Documentation"> <faq label="FAQ" href="ext:faq" /> <wiki label="Wiki" href="ext:wiki" /> - <tutorial label="Tutorial" href="tutorial.html" /> + <tutorial label="Tutorial ver. 0.7" href="tutorial.html" /> + <tutorial8 label="Tutorial ver. 0.8" href="tutorial8.html" /> <webmasters label="Robot " href="bot.html" /> <i18n label="i18n" href="i18n.html" /> <apidocs label="API Docs" href="apidocs/index.html" /> @@ -34,17 +35,21 @@ <resources label="Resources"> <download label="Download" href="release/" /> + <nightly label="Nightly builds" href="nightly.html" /> <contact label="Mailing Lists" href="mailing_lists.html" /> <issues label="Issue Tracking" href="issue_tracking.html" /> <vcs label="Version Control" href="version_control.html" /> </resources> + <projects label="Related Projects"> <lucene label="Lucene Java" href="ext:lucene" /> + <hadoop label="Hadoop" href="ext:hadoop" /> </projects> <external-refs> <lucene href="http://lucene.apache.org/java/" /> + <hadoop href="http://lucene.apache.org/hadoop/" /> <wiki href="http://wiki.apache.org/nutch/" /> <faq href="http://wiki.apache.org/nutch/FAQ" /> <store href="http://www.cafepress.com/nutch/" /> Modified: lucene/nutch/trunk/src/site/src/documentation/content/xdocs/tutorial.xml URL: http://svn.apache.org/viewvc/lucene/nutch/trunk/src/site/src/documentation/content/xdocs/tutorial.xml?rev=412847&r1=412846&r2=412847&view=diff ============================================================================== --- lucene/nutch/trunk/src/site/src/documentation/content/xdocs/tutorial.xml (original) +++ lucene/nutch/trunk/src/site/src/documentation/content/xdocs/tutorial.xml Thu Jun 8 13:03:11 2006 @@ -6,7 +6,7 @@ <document> <header> - <title>Nutch tutorial</title> + <title>Nutch version 0.7 tutorial</title> </header> <body> @@ -66,11 +66,11 @@ <ol> -<li>Create a directory with a flat file of root urls. For example, to -crawl the <code>nutch</code> site you might start with a file named -<code>urls/nutch</code> containing the url of just the Nutch home -page. All other Nutch pages should be reachable from this page. The -<code>urls/nutch</code> file would thus contain: +<li>Create a flat file of root urls. For example, to crawl the +<code>nutch</code> site you might start with a file named +<code>urls</code> containing just the Nutch home page. All other +Nutch pages should be reachable from this page. The <code>urls</code> +file would thus look like: <source> http://lucene.apache.org/nutch/ </source> @@ -97,28 +97,24 @@ <ul> <li><code>-dir</code> <em>dir</em> names the directory to put the crawl in.</li> -<li><code>-threads</code> <em>threads</em> determines the number of -threads that will fetch in parallel.</li> <li><code>-depth</code> <em>depth</em> indicates the link depth from the root page that should be crawled.</li> -<li><code>-topN</code> <em>N</em> determines the maximum number of pages that -will be retrieved at each level up to the depth.</li> +<li><code>-delay</code> <em>delay</em> determines the number of seconds +between accesses to each host.</li> +<li><code>-threads</code> <em>threads</em> determines the number of +threads that will fetch in parallel.</li> </ul> <p>For example, a typical call might be:</p> <source> -bin/nutch crawl urls -dir crawl -depth 3 -topN 50 +bin/nutch crawl urls -dir crawl.test -depth 3 >& crawl.log </source> -<p>Typically one starts testing one's configuration by crawling at -shallow depths, sharply limiting the number of pages fetched at each -level (<code>-topN</code>), and watching the output to check that -desired pages are fetched and undesirable pages are not. Once one is -confident of the configuration, then an appropriate depth for a full -crawl is around 10. The number of pages per level -(<code>-topN</code>) for a full crawl can be from tens of thousands to -millions, depending on your resources.</p> +<p>Typically one starts testing one's configuration by crawling at low +depths, and watching the output to check that desired pages are found. +Once one is more confident of the configuration, then an appropriate +depth for a full crawl is around 10.</p> <p>Once crawling has completed, one can skip to the Searching section below.</p> @@ -135,62 +131,54 @@ <section> <title>Whole-web: Concepts</title> -<p>Nutch data is composed of:</p> +<p>Nutch data is of two types:</p> <ol> - - <li>The crawl database, or <em>crawldb</em>. This contains -information about every url known to Nutch, including whether it was -fetched, and, if so, when.</li> - - <li>The link database, or <em>linkdb</em>. This contains the list -of known links to each url, including both the source url and anchor -text of the link.</li> - - <li>A set of <em>segments</em>. Each segment is a set of urls that are -fetched as a unit. Segments are directories with the following -subdirectories:</li> - + <li>The web database. This contains information about every +page known to Nutch, and about links between those pages.</li> + <li>A set of segments. Each segment is a set of pages that are +fetched and indexed as a unit. Segment data consists of the +following types:</li> <li><ul> - <li>a <em>crawl_generate</em> names a set of urls to be fetched</li> - <li>a <em>crawl_fetch</em> contains the status of fetching each url</li> - <li>a <em>content</em> contains the content of each url</li> - <li>a <em>parse_text</em> contains the parsed text of each url</li> - <li>a <em>parse_data</em> contains outlinks and metadata parsed - from each url</li> - <li>a <em>crawl_parse</em> contains the outlink urls, used to - update the crawldb</li> + <li>a <em>fetchlist</em> is a file +that names a set of pages to be fetched</li> + <li>the<em> fetcher output</em> is a +set of files containing the fetched pages</li> + <li>the <em>index </em>is a +Lucene-format index of the fetcher output.</li> </ul></li> - -<li>The <em>indexes</em>are Lucene-format indexes.</li> - </ol> +<p>In the following examples we will keep our web database in a directory +named <code>db</code> and our segments +in a directory named <code>segments</code>:</p> +<source>mkdir db +mkdir segments</source> </section> <section> <title>Whole-web: Boostrapping the Web Database</title> +<p>The admin tool is used to create a new, empty database:</p> + +<source>bin/nutch admin db -create</source> -<p>The <em>injector</em> adds urls to the crawldb. Let's inject URLs -from the <a href="http://dmoz.org/">DMOZ</a> Open Directory. First we -must download and uncompress the file listing all of the DMOZ pages. -(This is a 200+Mb file, so this will take a few minutes.)</p> +<p>The <em>injector</em> adds urls into the database. Let's inject +URLs from the <a href="http://dmoz.org/">DMOZ</a> Open +Directory. First we must download and uncompress the file listing all +of the DMOZ pages. (This is a 200+Mb file, so this will take a few +minutes.)</p> <source>wget http://rdf.dmoz.org/rdf/content.rdf.u8.gz gunzip content.rdf.u8.gz</source> -<p>Next we select a random subset of these pages. +<p>Next we inject a random subset of these pages into the web database. (We use a random subset so that everyone who runs this tutorial doesn't hammer the same sites.) DMOZ contains around three million -URLs. We select one out of every 5000, so that we end up with +URLs. We inject one out of every 3000, so that we end up with around 1000 URLs:</p> -<source>mkdir dmoz -bin/nutch org.apache.nutch.crawl.DmozParser content.rdf.u8 -subset 5000 > dmoz/urls</source> +<source>bin/nutch inject db -dmozfile content.rdf.u8 -subset 3000</source> -<p>The parser also takes a few minutes, as it must parse the full -file. Finally, we initialize the crawl db with the selected urls.</p> - -<source>bin/nutch inject crawl/crawldb dmoz</source> +<p>This also takes a few minutes, as it must parse the full file.</p> <p>Now we have a web database with around 1000 as-yet unfetched URLs in it.</p> @@ -198,39 +186,39 @@ <section> <title>Whole-web: Fetching</title> <p>To fetch, we first generate a fetchlist from the database:</p> -<source>bin/nutch generate crawl/crawldb crawl/segments +<source>bin/nutch generate db segments </source> <p>This generates a fetchlist for all of the pages due to be fetched. The fetchlist is placed in a newly created segment directory. The segment directory is named by the time it's created. We save the name of this segment in the shell variable <code>s1</code>:</p> -<source>s1=`ls -d crawl/segments/2* | tail -1` +<source>s1=`ls -d segments/2* | tail -1` echo $s1 </source> <p>Now we run the fetcher on this segment with:</p> <source>bin/nutch fetch $s1</source> <p>When this is complete, we update the database with the results of the fetch:</p> -<source>bin/nutch updatedb crawl/crawldb $s1</source> +<source>bin/nutch updatedb db $s1</source> <p>Now the database has entries for all of the pages referenced by the initial set.</p> <p>Now we fetch a new segment with the top-scoring 1000 pages:</p> -<source>bin/nutch generate crawl/crawldb crawl/segments -topN 1000 -s2=`ls -d crawl/segments/2* | tail -1` +<source>bin/nutch generate db segments -topN 1000 +s2=`ls -d segments/2* | tail -1` echo $s2 bin/nutch fetch $s2 -bin/nutch updatedb crawl/crawldb $s2 +bin/nutch updatedb db $s2 </source> <p>Let's fetch one more round:</p> <source> -bin/nutch generate crawl/crawldb crawl/segments -topN 1000 -s3=`ls -d crawl/segments/2* | tail -1` +bin/nutch generate db segments -topN 1000 +s3=`ls -d segments/2* | tail -1` echo $s3 bin/nutch fetch $s3 -bin/nutch updatedb crawl/crawldb $s3 +bin/nutch updatedb db $s3 </source> <p>By this point we've fetched a few thousand pages. Let's index @@ -239,20 +227,16 @@ </section> <section> <title>Whole-web: Indexing</title> +<p>To index each segment we use the <code>index</code> +command, as follows:</p> +<source>bin/nutch index $s1 +bin/nutch index $s2 +bin/nutch index $s3</source> -<p>Before indexing we first invert all of the links, so that we may -index incoming anchor text with the pages.</p> - -<source>bin/nutch invertlinks crawl/linkdb crawl/segments</source> - -<p>To index the segments we use the <code>index</code> command, as follows:</p> - -<source>bin/nutch index indexes crawl/linkdb crawl/segments/*</source> - -<!-- <p>Then, before we can search a set of segments, we need to delete --> -<!-- duplicate pages. This is done with:</p> --> +<p>Then, before we can search a set of segments, we need to delete +duplicate pages. This is done with:</p> -<!-- <source>bin/nutch dedup indexes</source> --> +<source>bin/nutch dedup segments dedup.tmp</source> <p>Now we're ready to search!</p> @@ -272,8 +256,10 @@ cp nutch*.war ~/local/tomcat/webapps/ROOT.war </source> -<p>The webapp finds its indexes in <code>./crawl</code>, relative -to where you start Tomcat, so use a command like:</p> +<p>The webapp finds its indexes in <code>./segments</code>, relative +to where you start Tomcat, so, if you've done intranet crawling, +connect to your crawl directory, or, if you've done whole-web +crawling, don't change directories, and give the command:</p> <source>~/local/tomcat/bin/catalina.sh start </source> @@ -281,6 +267,8 @@ <p>Then visit <a href="http://localhost:8080/">http://localhost:8080/</a> and have fun!</p> +<p>More detailed tutorials are available on the Nutch Wiki. +</p> </section> </section> Added: lucene/nutch/trunk/src/site/src/documentation/content/xdocs/tutorial8.xml URL: http://svn.apache.org/viewvc/lucene/nutch/trunk/src/site/src/documentation/content/xdocs/tutorial8.xml?rev=412847&view=auto ============================================================================== --- lucene/nutch/trunk/src/site/src/documentation/content/xdocs/tutorial8.xml (added) +++ lucene/nutch/trunk/src/site/src/documentation/content/xdocs/tutorial8.xml Thu Jun 8 13:03:11 2006 @@ -0,0 +1,291 @@ +<?xml version="1.0"?> + +<!DOCTYPE document PUBLIC "-//APACHE//DTD Documentation V2.0//EN" + "http://forrest.apache.org/dtd/document-v20.dtd"> + +<document> + +<header> + <title>Nutch version 0.8 tutorial</title> +</header> + +<body> + +<section> +<title>Requirements</title> +<ol> + <li>Java 1.4.x, either from <a + href="http://java.sun.com/j2se/downloads.html">Sun</a> or <a + href="http://www-106.ibm.com/developerworks/java/jdk/">IBM</a> on + Linux is preferred. Set <code>NUTCH_JAVA_HOME</code> to the root + of your JVM installation. + </li> + <li>Apache's <a href="http://jakarta.apache.org/tomcat/">Tomcat</a> +4.x.</li> + <li>On Win32, <a href="http://www.cygwin.com/">cygwin</a>, for +shell support. (If you plan to use Subversion on Win32, be sure to select the subversion package when you install, in the "Devel" category.)</li> + <li>Up to a gigabyte of free disk space, a high-speed connection, and +an hour or so. + </li> +</ol> +</section> +<section> +<title>Getting Started</title> + +<p>First, you need to get a copy of the Nutch code. You can download +a release from <a +href="http://lucene.apache.org/nutch/release/">http://lucene.apache.org/nutch/release/</a>. +Unpack the release and connect to its top-level directory. Or, check +out the latest source code from <a +href="version_control.html">subversion</a> and build it +with <a href="http://ant.apache.org/">Ant</a>.</p> + +<p>Try the following command:</p> +<source>bin/nutch</source> +<p>This will display the documentation for the Nutch command script.</p> + +<p>Now we're ready to crawl. There are two approaches to crawling:</p> +<ol> +<li>Intranet crawling, with the <code>crawl</code> command.</li> +<li>Whole-web crawling, with much greater control, using the lower +level <code>inject</code>, <code>generate</code>, <code>fetch</code> +and <code>updatedb</code> commands.</li> +</ol> + +</section> +<section> +<title>Intranet Crawling</title> + +<p>Intranet crawling is more appropriate when you intend to crawl up to +around one million pages on a handful of web servers.</p> + +<section> +<title>Intranet: Configuration</title> + +<p>To configure things for intranet crawling you must:</p> + +<ol> + +<li>Create a directory with a flat file of root urls. For example, to +crawl the <code>nutch</code> site you might start with a file named +<code>urls/nutch</code> containing the url of just the Nutch home +page. All other Nutch pages should be reachable from this page. The +<code>urls/nutch</code> file would thus contain: +<source> +http://lucene.apache.org/nutch/ +</source> +</li> + +<li>Edit the file <code>conf/crawl-urlfilter.txt</code> and replace +<code>MY.DOMAIN.NAME</code> with the name of the domain you wish to +crawl. For example, if you wished to limit the crawl to the +<code>apache.org</code> domain, the line should read: +<source> ++^http://([a-z0-9]*\.)*apache.org/ +</source> +This will include any url in the domain <code>apache.org</code>. +</li> + +</ol> + +</section> +<section> +<title>Intranet: Running the Crawl</title> + +<p>Once things are configured, running the crawl is easy. Just use the +crawl command. Its options include:</p> + +<ul> +<li><code>-dir</code> <em>dir</em> names the directory to put the crawl in.</li> +<li><code>-threads</code> <em>threads</em> determines the number of +threads that will fetch in parallel.</li> +<li><code>-depth</code> <em>depth</em> indicates the link depth from the root +page that should be crawled.</li> +<li><code>-topN</code> <em>N</em> determines the maximum number of pages that +will be retrieved at each level up to the depth.</li> +</ul> + +<p>For example, a typical call might be:</p> + +<source> +bin/nutch crawl urls -dir crawl -depth 3 -topN 50 +</source> + +<p>Typically one starts testing one's configuration by crawling at +shallow depths, sharply limiting the number of pages fetched at each +level (<code>-topN</code>), and watching the output to check that +desired pages are fetched and undesirable pages are not. Once one is +confident of the configuration, then an appropriate depth for a full +crawl is around 10. The number of pages per level +(<code>-topN</code>) for a full crawl can be from tens of thousands to +millions, depending on your resources.</p> + +<p>Once crawling has completed, one can skip to the Searching section +below.</p> + +</section> +</section> + +<section> +<title>Whole-web Crawling</title> + +<p>Whole-web crawling is designed to handle very large crawls which may +take weeks to complete, running on multiple machines.</p> + +<section> +<title>Whole-web: Concepts</title> + +<p>Nutch data is composed of:</p> + +<ol> + + <li>The crawl database, or <em>crawldb</em>. This contains +information about every url known to Nutch, including whether it was +fetched, and, if so, when.</li> + + <li>The link database, or <em>linkdb</em>. This contains the list +of known links to each url, including both the source url and anchor +text of the link.</li> + + <li>A set of <em>segments</em>. Each segment is a set of urls that are +fetched as a unit. Segments are directories with the following +subdirectories:</li> + + <li><ul> + <li>a <em>crawl_generate</em> names a set of urls to be fetched</li> + <li>a <em>crawl_fetch</em> contains the status of fetching each url</li> + <li>a <em>content</em> contains the content of each url</li> + <li>a <em>parse_text</em> contains the parsed text of each url</li> + <li>a <em>parse_data</em> contains outlinks and metadata parsed + from each url</li> + <li>a <em>crawl_parse</em> contains the outlink urls, used to + update the crawldb</li> + </ul></li> + +<li>The <em>indexes</em>are Lucene-format indexes.</li> + +</ol> + +</section> +<section> +<title>Whole-web: Boostrapping the Web Database</title> + +<p>The <em>injector</em> adds urls to the crawldb. Let's inject URLs +from the <a href="http://dmoz.org/">DMOZ</a> Open Directory. First we +must download and uncompress the file listing all of the DMOZ pages. +(This is a 200+Mb file, so this will take a few minutes.)</p> + +<source>wget http://rdf.dmoz.org/rdf/content.rdf.u8.gz +gunzip content.rdf.u8.gz</source> + +<p>Next we select a random subset of these pages. + (We use a random subset so that everyone who runs this tutorial +doesn't hammer the same sites.) DMOZ contains around three million +URLs. We select one out of every 5000, so that we end up with +around 1000 URLs:</p> + +<source>mkdir dmoz +bin/nutch org.apache.nutch.tools.DmozParser content.rdf.u8 -subset 5000 > dmoz/urls</source> + +<p>The parser also takes a few minutes, as it must parse the full +file. Finally, we initialize the crawl db with the selected urls.</p> + +<source>bin/nutch inject crawl/crawldb dmoz</source> + +<p>Now we have a web database with around 1000 as-yet unfetched URLs in it.</p> + +</section> +<section> +<title>Whole-web: Fetching</title> +<p>To fetch, we first generate a fetchlist from the database:</p> +<source>bin/nutch generate crawl/crawldb crawl/segments +</source> +<p>This generates a fetchlist for all of the pages due to be fetched. + The fetchlist is placed in a newly created segment directory. + The segment directory is named by the time it's created. We +save the name of this segment in the shell variable <code>s1</code>:</p> +<source>s1=`ls -d crawl/segments/2* | tail -1` +echo $s1 +</source> +<p>Now we run the fetcher on this segment with:</p> +<source>bin/nutch fetch $s1</source> +<p>When this is complete, we update the database with the results of the +fetch:</p> +<source>bin/nutch updatedb crawl/crawldb $s1</source> +<p>Now the database has entries for all of the pages referenced by the +initial set.</p> + +<p>Now we fetch a new segment with the top-scoring 1000 pages:</p> +<source>bin/nutch generate crawl/crawldb crawl/segments -topN 1000 +s2=`ls -d crawl/segments/2* | tail -1` +echo $s2 + +bin/nutch fetch $s2 +bin/nutch updatedb crawl/crawldb $s2 +</source> +<p>Let's fetch one more round:</p> +<source> +bin/nutch generate crawl/crawldb crawl/segments -topN 1000 +s3=`ls -d crawl/segments/2* | tail -1` +echo $s3 + +bin/nutch fetch $s3 +bin/nutch updatedb crawl/crawldb $s3 +</source> + +<p>By this point we've fetched a few thousand pages. Let's index +them!</p> + +</section> +<section> +<title>Whole-web: Indexing</title> + +<p>Before indexing we first invert all of the links, so that we may +index incoming anchor text with the pages.</p> + +<source>bin/nutch invertlinks crawl/linkdb crawl/segments</source> + +<p>To index the segments we use the <code>index</code> command, as follows:</p> + +<source>bin/nutch index indexes crawl/linkdb crawl/segments/*</source> + +<!-- <p>Then, before we can search a set of segments, we need to delete --> +<!-- duplicate pages. This is done with:</p> --> + +<!-- <source>bin/nutch dedup indexes</source> --> + +<p>Now we're ready to search!</p> + +</section> +<section> +<title>Searching</title> + +<p>To search you need to put the nutch war file into your servlet +container. (If instead of downloading a Nutch release you checked the +sources out of SVN, then you'll first need to build the war file, with +the command <code>ant war</code>.)</p> + +<p>Assuming you've unpacked Tomcat as ~/local/tomcat, then the Nutch war +file may be installed with the commands:</p> + +<source>rm -rf ~/local/tomcat/webapps/ROOT* +cp nutch*.war ~/local/tomcat/webapps/ROOT.war +</source> + +<p>The webapp finds its indexes in <code>./crawl</code>, relative +to where you start Tomcat, so use a command like:</p> + +<source>~/local/tomcat/bin/catalina.sh start +</source> + +<p>Then visit <a href="http://localhost:8080/">http://localhost:8080/</a> +and have fun!</p> + +<p>More detailed tutorials are available on the Nutch Wiki. +</p> + +</section> +</section> + +</body> +</document>