[Nutch Wiki] Trivial Update of "NutchTutorial" by LewisJohnMcgibbney

Apache Wiki Sat, 13 Jun 2015 10:53:34 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.


The "NutchTutorial" page has been changed by LewisJohnMcgibbney:
https://wiki.apache.org/nutch/NutchTutorial?action=diff&rev1=80&rev2=81

  ## page was renamed from Running Nutch 1.3 with Solr Integration
  ## page was renamed from RunningNutchAndSolr
  ## Lang: En
- == Introduction ==
+ = Introduction =
  Nutch is a well matured, production ready Web crawler. Nutch 1.x enables fine 
grained configuration, relying on Apache Hadoop data structures, which are 
great for batch processing.
  Being pluggable and modular of course has it's benefits, Nutch provides 
extensible interfaces such as Parse, Index and ScoringFilter's for custom 
implementations e.g. Apache Tika for parsing. Additonally, pluggable indexing 
exists for Apache Solr, Elastic Search, SolrCloud, etc.
  We can find Web page hyperlinks in an automated manner, reduce lots of 
maintenance work, for example checking broken links, and create a copy of all 
the visited pages for searching over. 
  This tutorial explains how to use Nutch with Apache Solr. Solr is an open 
source full text search framework, with Solr we can search the visited pages 
from Nutch. Luckily, integration between Nutch and Solr is pretty 
straightforward.
  Apache Nutch supports Solr out-the-box, greatly simplifying Nutch-Solr 
integration. It also removes the legacy dependence upon both Apache Tomcat for 
running the old Nutch Web Application and upon Apache Lucene for indexing. Just 
download a binary release from 
[[http://www.apache.org/dyn/closer.cgi/nutch/|here]].
  
- == Learning Outcomes ==
+ = Learning Outcomes =
  By the end of this tutorial you will
   * Have a configured local Nutch crawler setup to crawl on one machine
   * Learned how to understand and configure Nutch runtime configuration 
including seed URL lists, URLFilters, etc.
@@ -20, +20 @@

  
  Any issues with this tutorial should be reported to the 
[[http://nutch.apache.org/mailing_lists.html|Nutch user@]] list.
  
- == Table of Contents ==
+ = Table of Contents =
  <<TableOfContents(3)>>
  
- == Steps ==
+ = Steps =
  {{{#!wiki caution
  This tutorial describes the installation and use of Nutch 1.x (current 
release is 1.9). How to compile and set up Nutch 2.x with HBase, see 
Nutch2Tutorial.
  }}}
- == Requirements ==
+ = Requirements =
   * Unix environment, or Windows-[[https://www.cygwin.com/|Cygwin]] environment
   * Java Runtime/Development Environment (1.7)
   * (Source build only) Apache Ant: http://ant.apache.org/
  
- == Install Nutch ==
+ = Install Nutch =
- === Option 1: Setup Nutch from a binary distribution ===
+ == Option 1: Setup Nutch from a binary distribution ==
   * Download a binary package (`apache-nutch-1.X-bin.zip`) from 
[[http://www.apache.org/dyn/closer.cgi/nutch/|here]].
   * Unzip your binary Nutch package. There should be a folder 
`apache-nutch-1.X`.
   * `cd apache-nutch-1.X/`
  
  From now on, we are going to use `${NUTCH_RUNTIME_HOME}` to refer to the 
current directory (`apache-nutch-1.X/`).
  
- === Option 2: Set up Nutch from a source distribution ===
+ == Option 2: Set up Nutch from a source distribution ==
  Advanced users may also use the source distribution:
  
   * Download a source package (`apache-nutch-1.X-src.zip`)
@@ -54, +54 @@

   * config files should be modified in `apache-nutch-1.X/runtime/local/conf/`
   * `ant clean` will remove this directory (keep copies of modified config 
files)
  
- == Verify your Nutch installation ==
+ = Verify your Nutch installation =
   * run "`bin/nutch`" - You can confirm a correct installation if you see 
something similar to the following:
  
  {{{
  Usage: nutch COMMAND where command is one of:
- crawl             one-step crawler for intranets (DEPRECATED)
  readdb            read / dump crawl db
  mergedb           merge crawldb-s, with optional filtering
  readlinkdb        read / dump link db
@@ -104, +103 @@

  
  Note that the `LMC-032857` above should be replaced with your machine name.
  
- == Crawl your first website ==
+ = Crawl your first website =
  Nutch requires two configuration changes before a website can be crawled:
  
   1. Customize your crawl properties, where at a minimum, you provide a name 
for your crawler for external servers to recognize
   1. Set a seed list of URLs to crawl
  
- === Customize your crawl properties ===
+ == Customize your crawl properties ==
   * Default crawl properties can be viewed and edited within 
`conf/nutch-default.xml `- where most of these can be used without modification
   * The file `conf/nutch-site.xml` serves as a place to add your own custom 
crawl properties that overwrite `conf/nutch-default.xml`. The only required 
modification for this file is to override the `value` field of the 
`http.agent.name     `
    . i.e. Add your agent name in the `value` field of the `http.agent.name` 
property in `conf/nutch-site.xml`, for example:
@@ -121, +120 @@

   <value>My Nutch Spider</value>
  </property>
  }}}
- === Create a URL seed list ===
+ == Create a URL seed list ==
   * A URL seed list includes a list of websites, one-per-line, which nutch 
will look to crawl
   * The file `conf/regex-urlfilter.txt` will provide Regular Expressions that 
allow nutch to filter and narrow the types of web resources to crawl and 
download
  
- ==== Create a URL seed list ====
+ === Create a URL seed list ===
   * `mkdir -p urls`
   * `cd urls`
   * `touch seed.txt` to create a text file `seed.txt` under `urls/` with the 
following content (one URL per line for each site you want Nutch to crawl).
@@ -133, +132 @@

  {{{
  http://nutch.apache.org/
  }}}
- ==== (Optional) Configure Regular Expression Filters ====
+ === (Optional) Configure Regular Expression Filters ===
  Edit the file `conf/regex-urlfilter.txt` and replace
  
  {{{
@@ -149, +148 @@

  
  NOTE: Not specifying any domains to include within regex-urlfilter.txt will 
lead to all domains linking to your seed URLs file being crawled as well.
  
- === Using Individual Commands for Whole-Web Crawling ===
+ == Using Individual Commands for Whole-Web Crawling ==
  '''NOTE''': If you previously modified the file `conf/regex-urlfilter.txt` as 
covered [[#A3._Crawl_your_first_website|here]] you will need to change it back.
  
  Whole-Web crawling is designed to handle very large crawls which may take 
weeks to complete, running on multiple machines.  This also permits more 
control over the crawl process, and incremental crawling.  It is important to 
note that whole Web crawling does not necessarily mean crawling the entire 
World Wide Web.  We can limit a whole Web crawl to just a list of the URLs we 
want to crawl.  This is done by using a filter just like the one we used when 
we did the `crawl` command (above).
  
- ==== Step-by-Step: Concepts ====
+ === Step-by-Step: Concepts ===
  Nutch data is composed of:
  
   1. The crawl database, or crawldb. This contains information about every URL 
known to Nutch, including whether it was fetched, and, if so, when.
@@ -167, +166 @@

    * a ''parse_data'' contains outlinks and metadata parsed from each URL
    * a ''crawl_parse'' contains the outlink URLs, used to update the crawldb
  
- ==== Step-by-Step: Seeding the crawldb with a list of URLs ====
+ === Step-by-Step: Seeding the crawldb with a list of URLs ===
- ===== Option 1:  Bootstrapping from the DMOZ database. =====
+ ==== Option 1:  Bootstrapping from the DMOZ database. ====
  The injector adds URLs to the crawldb. Let's inject URLs from the DMOZ Open 
Directory. First we must download and uncompress the file listing all of the 
DMOZ pages. (This is a 200+ MB file, so this will take a few minutes.)
  
  {{{
@@ -188, +187 @@

  }}}
  Now we have a Web database with around 1,000 as-yet unfetched URLs in it.
  
- ===== Option 2.  Bootstrapping from an initial seed list. =====
+ ==== Option 2.  Bootstrapping from an initial seed list. ====
  This option shadows the creation of the seed list as covered 
[[#A3._Crawl_your_first_website|here]].
  
  {{{
  bin/nutch inject crawl/crawldb urls
  }}}
- ==== Step-by-Step: Fetching ====
+ === Step-by-Step: Fetching ===
  To fetch, we first generate a fetch list from the database:
  
  {{{
@@ -247, +246 @@

  }}}
  By this point we've fetched a few thousand pages. Let's invert links and 
index them!
  
- ==== Step-by-Step: Invertlinks ====
+ === Step-by-Step: Invertlinks ===
  Before indexing we first invert all of the links, so that we may index 
incoming anchor text with the pages.
  
  {{{
@@ -255, +254 @@

  }}}
  We are now ready to search with Apache Solr.
  
- ==== Step-by-Step: Indexing into Apache Solr ====
+ === Step-by-Step: Indexing into Apache Solr ===
  Note: For this step you should have Solr installation. If you didn't 
integrate Nutch with Solr. You should read [[#A4._Setup_Solr_for_search|here]].
  
  Now we are ready to go on and index all the resources. For more information 
see [[http://wiki.apache.org/nutch/bin/nutch%20solrindex|this paper]]
@@ -264, +263 @@

       Usage: bin/nutch solrindex <solr url> <crawldb> [-linkdb 
<linkdb>][-params k1=v1&k2=v2...] (<segment> ...| -dir <segments>) [-noCommit] 
[-deleteGone] [-filter] [-normalize]
       Example: bin/nutch solrindex http://localhost:8983/solr crawl/crawldb/ 
-linkdb crawl/linkdb/ crawl/segments/20131108063838/ -filter -normalize
  }}}
- ==== Step-by-Step: Deleting Duplicates ====
+ === Step-by-Step: Deleting Duplicates ===
  Once indexed the entire contents, it must be disposed of duplicate urls in 
this way ensures that the urls are unique.
  
  MapReduce:
@@ -276, +275 @@

       Usage: bin/nutch solrdedup <solr url>
       Example: /bin/nutch solrdedup http://localhost:8983/solr
  }}}
- ==== Step-by-Step: Cleaning Solr ====
+ === Step-by-Step: Cleaning Solr ===
  The class scans a crawldb directory looking for entries with status DB_GONE 
(404) and sends delete requests to Solr for those documents. Once Solr receives 
the request the aforementioned documents are duly deleted. This maintains a 
healthier quality of Solr index.
  
  {{{
       Usage: bin/nutch solrclean <crawldb> <solrurl>
       Example: /bin/nutch solrclean crawl/crawldb/ http://localhost:8983/solr
  }}}
- === Using the crawl script ===
+ == Using the crawl script ==
  If you have followed the section above on how the crawling can be done step 
by step, you might be wondering how a bash script can be written to automate 
all the process described above.
  
  Nutch developers have written one for you :), and it is available at 
[[bin/crawl]].
  
  {{{
-      Usage: bin/crawl <seedDir> <crawlDir> <solrURL> <numberOfRounds>
-      Example: bin/crawl urls/ TestCrawl/ http://localhost:8983/solr/ 2
+      Usage: crawl [-i|--index] [-D "key=value"] <Seed Dir> <Crawl Dir> <Num 
Rounds>
+       -i|--index      Indexes crawl results into a configured indexer
+       -D              A Java property to pass to Nutch calls
+       Seed Dir        Directory in which to look for a seeds file
+       Crawl Dir       Directory where the crawl/link/segments dirs are saved
+       Num Rounds      The number of rounds to run this crawl for
+      Example: bin/crawl -i -D solr.server.url=http://localhost:8983/solr/ 
urls/ TestCrawl/  2
  }}}
  The crawl script has lot of parameters set, and you can modify the parameters 
to your needs. It would be ideal to understand the parameters before setting up 
big crawls.
  
- == Setup Solr for search ==
+ = Setup Solr for search =
   * download binary file from 
[[http://www.apache.org/dyn/closer.cgi/lucene/solr/|here]]
   * unzip to `$HOME/apache-solr`, we will now refer to this as 
`${APACHE_SOLR_HOME}`
   * `cd ${APACHE_SOLR_HOME}/example`
   * `java -jar start.jar`
  
- == Verify Solr installation ==
+ = Verify Solr installation =
  After you started Solr admin console, you should be able to access the 
following links:
  
  {{{
  http://localhost:8983/solr/#/
  }}}
- == Integrate Solr with Nutch ==
+ = Integrate Solr with Nutch =
  We have both Nutch and Solr installed and setup correctly. And Nutch already 
created crawl data from the seed URL(s). Below are the steps to delegate 
searching to Solr for links to be searchable:
  
   * Backup the original Solr example schema.xml:<<BR>>
@@ -360, +364 @@

  
  If all has gone to plan, you are now ready to search with 
http://localhost:8983/solr/admin/.
  
- == Whats Next ==
+ = Whats Next =
  
  You may want to check out the documentation for the 
[[https://wiki.apache.org/nutch/Nutch_1.X_RESTAPI|Nutch 1.X REST API]] to get 
an overview of the work going on towards providing Apache CXF based REST 
services for Nutch 1.X branch.

[Nutch Wiki] Trivial Update of "NutchTutorial" by LewisJohnMcgibbney

Reply via email to