Dear Wiki user, You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.
The "NutchTutorial" page has been changed by LewisJohnMcgibbney: https://wiki.apache.org/nutch/NutchTutorial?action=diff&rev1=80&rev2=81 ## page was renamed from Running Nutch 1.3 with Solr Integration ## page was renamed from RunningNutchAndSolr ## Lang: En - == Introduction == + = Introduction = Nutch is a well matured, production ready Web crawler. Nutch 1.x enables fine grained configuration, relying on Apache Hadoop data structures, which are great for batch processing. Being pluggable and modular of course has it's benefits, Nutch provides extensible interfaces such as Parse, Index and ScoringFilter's for custom implementations e.g. Apache Tika for parsing. Additonally, pluggable indexing exists for Apache Solr, Elastic Search, SolrCloud, etc. We can find Web page hyperlinks in an automated manner, reduce lots of maintenance work, for example checking broken links, and create a copy of all the visited pages for searching over. This tutorial explains how to use Nutch with Apache Solr. Solr is an open source full text search framework, with Solr we can search the visited pages from Nutch. Luckily, integration between Nutch and Solr is pretty straightforward. Apache Nutch supports Solr out-the-box, greatly simplifying Nutch-Solr integration. It also removes the legacy dependence upon both Apache Tomcat for running the old Nutch Web Application and upon Apache Lucene for indexing. Just download a binary release from [[http://www.apache.org/dyn/closer.cgi/nutch/|here]]. - == Learning Outcomes == + = Learning Outcomes = By the end of this tutorial you will * Have a configured local Nutch crawler setup to crawl on one machine * Learned how to understand and configure Nutch runtime configuration including seed URL lists, URLFilters, etc. @@ -20, +20 @@ Any issues with this tutorial should be reported to the [[http://nutch.apache.org/mailing_lists.html|Nutch user@]] list. - == Table of Contents == + = Table of Contents = <<TableOfContents(3)>> - == Steps == + = Steps = {{{#!wiki caution This tutorial describes the installation and use of Nutch 1.x (current release is 1.9). How to compile and set up Nutch 2.x with HBase, see Nutch2Tutorial. }}} - == Requirements == + = Requirements = * Unix environment, or Windows-[[https://www.cygwin.com/|Cygwin]] environment * Java Runtime/Development Environment (1.7) * (Source build only) Apache Ant: http://ant.apache.org/ - == Install Nutch == + = Install Nutch = - === Option 1: Setup Nutch from a binary distribution === + == Option 1: Setup Nutch from a binary distribution == * Download a binary package (`apache-nutch-1.X-bin.zip`) from [[http://www.apache.org/dyn/closer.cgi/nutch/|here]]. * Unzip your binary Nutch package. There should be a folder `apache-nutch-1.X`. * `cd apache-nutch-1.X/` From now on, we are going to use `${NUTCH_RUNTIME_HOME}` to refer to the current directory (`apache-nutch-1.X/`). - === Option 2: Set up Nutch from a source distribution === + == Option 2: Set up Nutch from a source distribution == Advanced users may also use the source distribution: * Download a source package (`apache-nutch-1.X-src.zip`) @@ -54, +54 @@ * config files should be modified in `apache-nutch-1.X/runtime/local/conf/` * `ant clean` will remove this directory (keep copies of modified config files) - == Verify your Nutch installation == + = Verify your Nutch installation = * run "`bin/nutch`" - You can confirm a correct installation if you see something similar to the following: {{{ Usage: nutch COMMAND where command is one of: - crawl one-step crawler for intranets (DEPRECATED) readdb read / dump crawl db mergedb merge crawldb-s, with optional filtering readlinkdb read / dump link db @@ -104, +103 @@ Note that the `LMC-032857` above should be replaced with your machine name. - == Crawl your first website == + = Crawl your first website = Nutch requires two configuration changes before a website can be crawled: 1. Customize your crawl properties, where at a minimum, you provide a name for your crawler for external servers to recognize 1. Set a seed list of URLs to crawl - === Customize your crawl properties === + == Customize your crawl properties == * Default crawl properties can be viewed and edited within `conf/nutch-default.xml `- where most of these can be used without modification * The file `conf/nutch-site.xml` serves as a place to add your own custom crawl properties that overwrite `conf/nutch-default.xml`. The only required modification for this file is to override the `value` field of the `http.agent.name ` . i.e. Add your agent name in the `value` field of the `http.agent.name` property in `conf/nutch-site.xml`, for example: @@ -121, +120 @@ <value>My Nutch Spider</value> </property> }}} - === Create a URL seed list === + == Create a URL seed list == * A URL seed list includes a list of websites, one-per-line, which nutch will look to crawl * The file `conf/regex-urlfilter.txt` will provide Regular Expressions that allow nutch to filter and narrow the types of web resources to crawl and download - ==== Create a URL seed list ==== + === Create a URL seed list === * `mkdir -p urls` * `cd urls` * `touch seed.txt` to create a text file `seed.txt` under `urls/` with the following content (one URL per line for each site you want Nutch to crawl). @@ -133, +132 @@ {{{ http://nutch.apache.org/ }}} - ==== (Optional) Configure Regular Expression Filters ==== + === (Optional) Configure Regular Expression Filters === Edit the file `conf/regex-urlfilter.txt` and replace {{{ @@ -149, +148 @@ NOTE: Not specifying any domains to include within regex-urlfilter.txt will lead to all domains linking to your seed URLs file being crawled as well. - === Using Individual Commands for Whole-Web Crawling === + == Using Individual Commands for Whole-Web Crawling == '''NOTE''': If you previously modified the file `conf/regex-urlfilter.txt` as covered [[#A3._Crawl_your_first_website|here]] you will need to change it back. Whole-Web crawling is designed to handle very large crawls which may take weeks to complete, running on multiple machines. This also permits more control over the crawl process, and incremental crawling. It is important to note that whole Web crawling does not necessarily mean crawling the entire World Wide Web. We can limit a whole Web crawl to just a list of the URLs we want to crawl. This is done by using a filter just like the one we used when we did the `crawl` command (above). - ==== Step-by-Step: Concepts ==== + === Step-by-Step: Concepts === Nutch data is composed of: 1. The crawl database, or crawldb. This contains information about every URL known to Nutch, including whether it was fetched, and, if so, when. @@ -167, +166 @@ * a ''parse_data'' contains outlinks and metadata parsed from each URL * a ''crawl_parse'' contains the outlink URLs, used to update the crawldb - ==== Step-by-Step: Seeding the crawldb with a list of URLs ==== + === Step-by-Step: Seeding the crawldb with a list of URLs === - ===== Option 1: Bootstrapping from the DMOZ database. ===== + ==== Option 1: Bootstrapping from the DMOZ database. ==== The injector adds URLs to the crawldb. Let's inject URLs from the DMOZ Open Directory. First we must download and uncompress the file listing all of the DMOZ pages. (This is a 200+ MB file, so this will take a few minutes.) {{{ @@ -188, +187 @@ }}} Now we have a Web database with around 1,000 as-yet unfetched URLs in it. - ===== Option 2. Bootstrapping from an initial seed list. ===== + ==== Option 2. Bootstrapping from an initial seed list. ==== This option shadows the creation of the seed list as covered [[#A3._Crawl_your_first_website|here]]. {{{ bin/nutch inject crawl/crawldb urls }}} - ==== Step-by-Step: Fetching ==== + === Step-by-Step: Fetching === To fetch, we first generate a fetch list from the database: {{{ @@ -247, +246 @@ }}} By this point we've fetched a few thousand pages. Let's invert links and index them! - ==== Step-by-Step: Invertlinks ==== + === Step-by-Step: Invertlinks === Before indexing we first invert all of the links, so that we may index incoming anchor text with the pages. {{{ @@ -255, +254 @@ }}} We are now ready to search with Apache Solr. - ==== Step-by-Step: Indexing into Apache Solr ==== + === Step-by-Step: Indexing into Apache Solr === Note: For this step you should have Solr installation. If you didn't integrate Nutch with Solr. You should read [[#A4._Setup_Solr_for_search|here]]. Now we are ready to go on and index all the resources. For more information see [[http://wiki.apache.org/nutch/bin/nutch%20solrindex|this paper]] @@ -264, +263 @@ Usage: bin/nutch solrindex <solr url> <crawldb> [-linkdb <linkdb>][-params k1=v1&k2=v2...] (<segment> ...| -dir <segments>) [-noCommit] [-deleteGone] [-filter] [-normalize] Example: bin/nutch solrindex http://localhost:8983/solr crawl/crawldb/ -linkdb crawl/linkdb/ crawl/segments/20131108063838/ -filter -normalize }}} - ==== Step-by-Step: Deleting Duplicates ==== + === Step-by-Step: Deleting Duplicates === Once indexed the entire contents, it must be disposed of duplicate urls in this way ensures that the urls are unique. MapReduce: @@ -276, +275 @@ Usage: bin/nutch solrdedup <solr url> Example: /bin/nutch solrdedup http://localhost:8983/solr }}} - ==== Step-by-Step: Cleaning Solr ==== + === Step-by-Step: Cleaning Solr === The class scans a crawldb directory looking for entries with status DB_GONE (404) and sends delete requests to Solr for those documents. Once Solr receives the request the aforementioned documents are duly deleted. This maintains a healthier quality of Solr index. {{{ Usage: bin/nutch solrclean <crawldb> <solrurl> Example: /bin/nutch solrclean crawl/crawldb/ http://localhost:8983/solr }}} - === Using the crawl script === + == Using the crawl script == If you have followed the section above on how the crawling can be done step by step, you might be wondering how a bash script can be written to automate all the process described above. Nutch developers have written one for you :), and it is available at [[bin/crawl]]. {{{ - Usage: bin/crawl <seedDir> <crawlDir> <solrURL> <numberOfRounds> - Example: bin/crawl urls/ TestCrawl/ http://localhost:8983/solr/ 2 + Usage: crawl [-i|--index] [-D "key=value"] <Seed Dir> <Crawl Dir> <Num Rounds> + -i|--index Indexes crawl results into a configured indexer + -D A Java property to pass to Nutch calls + Seed Dir Directory in which to look for a seeds file + Crawl Dir Directory where the crawl/link/segments dirs are saved + Num Rounds The number of rounds to run this crawl for + Example: bin/crawl -i -D solr.server.url=http://localhost:8983/solr/ urls/ TestCrawl/ 2 }}} The crawl script has lot of parameters set, and you can modify the parameters to your needs. It would be ideal to understand the parameters before setting up big crawls. - == Setup Solr for search == + = Setup Solr for search = * download binary file from [[http://www.apache.org/dyn/closer.cgi/lucene/solr/|here]] * unzip to `$HOME/apache-solr`, we will now refer to this as `${APACHE_SOLR_HOME}` * `cd ${APACHE_SOLR_HOME}/example` * `java -jar start.jar` - == Verify Solr installation == + = Verify Solr installation = After you started Solr admin console, you should be able to access the following links: {{{ http://localhost:8983/solr/#/ }}} - == Integrate Solr with Nutch == + = Integrate Solr with Nutch = We have both Nutch and Solr installed and setup correctly. And Nutch already created crawl data from the seed URL(s). Below are the steps to delegate searching to Solr for links to be searchable: * Backup the original Solr example schema.xml:<<BR>> @@ -360, +364 @@ If all has gone to plan, you are now ready to search with http://localhost:8983/solr/admin/. - == Whats Next == + = Whats Next = You may want to check out the documentation for the [[https://wiki.apache.org/nutch/Nutch_1.X_RESTAPI|Nutch 1.X REST API]] to get an overview of the work going on towards providing Apache CXF based REST services for Nutch 1.X branch.