Dear Wiki user, You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.
The "NutchTutorial" page has been changed by riverma: https://wiki.apache.org/nutch/NutchTutorial?action=diff&rev1=65&rev2=66 Comment: Added requirements so that new users understand what software is needed to run or build Nutch. <<TableOfContents(3)>> == Steps == - {{{#!wiki caution This tutorial describes the installation and use of Nutch 1.x (current release is 1.7). How to compile and set up Nutch 2.x with HBase, see Nutch2Tutorial. }}} + == Requirements == + * Unix environment, or Windows-[[https://www.cygwin.com/|Cygwin]] environment + * Java Runtime/Development Environment (1.5+): http://www.oracle.com/technetwork/java/javase/downloads/index-jsp-138363.html + * (Source build only) Apache Ant: http://ant.apache.org/ == 1. Setup Nutch from binary distribution == * Download a binary package (`apache-nutch-1.X-bin.zip`) from [[http://www.apache.org/dyn/closer.cgi/nutch/|here]]. @@ -27, +30 @@ === Set up from the source distribution === Advanced users may also use the source distribution: + * Download a source package (`apache-nutch-1.X-src.zip`) * Unzip * `cd apache-nutch-1.X/` @@ -34, +38 @@ * Now there is a directory `runtime/local` which contains a ready to use Nutch installation. When the source distribution is used `${NUTCH_RUNTIME_HOME}` refers to `apache-nutch-1.X/runtime/local/`. Note that + * config files should be modified in `apache-nutch-1.X/runtime/local/conf/` * `ant clean` will remove this directory (keep copies of modified config files) @@ -63, +68 @@ {{{ export JAVA_HOME=/System/Library/Frameworks/JavaVM.framework/Versions/1.6/Home }}} - On Debian or Ubuntu, you can run the following command or add it to ~/.bashrc: + {{{ export JAVA_HOME=$(readlink -f /usr/bin/java | sed "s:bin/java::") }}} @@ -98, +103 @@ This will include any URL in the domain `nutch.apache.org`. === 3.1 Using the Crawl Command === - {{{#!wiki caution The crawl command is deprecated. Please see section [[#A3.3._Using_the_crawl_script|3.3]] on how to use the crawl script that is intended to replace the crawl command. }}} - Now we are ready to initiate a crawl, use the following parameters: * '''-dir''' ''dir'' names the directory to put the crawl in. @@ -192, +195 @@ {{{ bin/nutch fetch $s1 }}} - Then we parse the entries: {{{ bin/nutch parse $s1 }}} - When this is complete, we update the database with the results of the fetch: {{{ @@ -247, +248 @@ Usage: bin/nutch solrindex <solr url> <crawldb> [-linkdb <linkdb>][-params k1=v1&k2=v2...] (<segment> ...| -dir <segments>) [-noCommit] [-deleteGone] [-filter] [-normalize] Example: bin/nutch solrindex http://localhost:8983/solr crawl/crawldb/ -linkdb crawl/linkdb/ crawl/segments/20131108063838/ -filter -normalize }}} - ==== Step-by-Step: Deleting Duplicates ==== Once indexed the entire contents, it must be disposed of duplicate urls in this way ensures that the urls are unique. @@ -260, +260 @@ Usage: bin/nutch solrdedup <solr url> Example: /bin/nutch solrdedup http://localhost:8983/solr }}} - ==== Step-by-Step: Cleaning Solr ==== The class scans a crawldb directory looking for entries with status DB_GONE (404) and sends delete requests to Solr for those documents. Once Solr receives the request the aforementioned documents are duly deleted. This maintains a healthier quality of Solr index. @@ -268, +267 @@ Usage: bin/nutch solrclean <crawldb> <solrurl> Example: /bin/nutch solrclean crawl/crawldb/ http://localhost:8983/solr }}} - === 3.3. Using the crawl script === - If you have followed the 3.2 section above on how the crawling can be done step by step, you might be wondering how a bash script can be written to automate all the process described above. - Nutch developers have written one for you :), and it is available at [[bin/crawl]]. + Nutch developers have written one for you :), and it is available at [[bin/crawl]]. {{{ Usage: bin/crawl <seedDir> <crawlID> <solrURL> <numberOfRounds> @@ -281, +278 @@ Or you can use: Example: bin/nutch crawl urls -solr http://localhost:8983/solr/ -depth 3 -topN 5 }}} - - The crawl script has lot of parameters set, and you can modify the parameters to your needs. It would be ideal to understand the parameters before setting up big crawls. - == 4. Setup Solr for search == * download binary file from [[http://www.apache.org/dyn/closer.cgi/lucene/solr/|here]] @@ -311, +305 @@ {{{ bin/nutch solrindex http://127.0.0.1:8983/solr/ crawl/crawldb -linkdb crawl/linkdb crawl/segments/* }}} - The call signature for running the solrindex has changed. The linkdb is now optional, so you need to denote it with a "-linkdb" flag on the command line. This will send all crawl data to Solr for indexing. For more information please see [[bin/nutch solrindex]]