[Nutch Wiki] Update of "NutchTutorial" by riverma

Apache Wiki Wed, 03 Sep 2014 16:42:06 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.


The "NutchTutorial" page has been changed by riverma:
https://wiki.apache.org/nutch/NutchTutorial?action=diff&rev1=67&rev2=68

Comment:
Reorganized and fixed confusing text within section 3: crawl your first website

  export JAVA_HOME=$(readlink -f /usr/bin/java | sed "s:bin/java::")
  }}}
  == 3. Crawl your first website ==
+ Nutch requires two configuration changes before a website can be crawled:
+ 
+  1. Customize your crawl properties, where at a minimum, you provide a name 
for your crawler for external servers to recognize
+  1. Set a seed list of URLs to crawl
+ 
+ === 3.1 Customize your crawl properties ===
+  * Default crawl properties can be viewed and edited within 
`conf/nutch-default.xml `- where most of these can be used without modification
+  * The file `conf/nutch-site.xml` serves as a place to add your own custom 
crawl properties that overwrite `conf/nutch-default.xml`. The only required 
modification for this file is to override the `value` field of the 
`http.agent.name     `
-  * Add your agent name in the `value` field of the `http.agent.name` property 
in `conf/nutch-site.xml`, for example:
+   . i.e. Add your agent name in the `value` field of the `http.agent.name` 
property in `conf/nutch-site.xml`, for example:
  
  {{{
  <property>
@@ -83, +91 @@

   <value>My Nutch Spider</value>
  </property>
  }}}
+ === 3.2 Create a URL seed list ===
+  * A URL seed list includes a list of websites, one-per-line, which nutch 
will look to crawl
+  * The file `conf/regex-urlfilter.txt` will provide Regular Expressions that 
allow nutch to filter and narrow the types of web resources to crawl and 
download
+ 
+ ==== Create a URL seed list ====
   * `mkdir -p urls`
   * `cd urls`
   * `touch seed.txt` to create a text file `seed.txt` under `urls/` with the 
following content (one URL per line for each site you want Nutch to crawl).
@@ -90, +103 @@

  {{{
  http://nutch.apache.org/
  }}}
+ ==== (Optional) Configure Regular Expression Filters ====
-  * Edit the file `conf/regex-urlfilter.txt` and replace
+ Edit the file `conf/regex-urlfilter.txt` and replace
  
  {{{
  # accept anything else
@@ -103, +117 @@

  }}}
  This will include any URL in the domain `nutch.apache.org`.
  
+ NOTE: Not specifying any domains to include within regex-urlfilter.txt will 
lead to all domains linking to your seed URLs file being crawled as well.
+ 
- === 3.1 Using the Crawl Command ===
+ === 3.3 Using the Crawl Command ===
  {{{#!wiki caution
- The crawl command is deprecated. Please see section 
[[#A3.3._Using_the_crawl_script|3.3]] on how to use the crawl script that is 
intended to replace the crawl command.
+ The crawl command is deprecated. Please see section 
[[#A3.3._Using_the_crawl_script|3.5]] on how to use the crawl script that is 
intended to replace the crawl command.
  }}}
  Now we are ready to initiate a crawl, use the following parameters:
  
@@ -134, +150 @@

  
  Typically one starts testing one's configuration by crawling at shallow 
depths, sharply limiting the number of pages fetched at each level (`-topN`), 
and watching the output to check that desired pages are fetched and undesirable 
pages are not. Once one is confident of the configuration, then an appropriate 
depth for a full crawl is around 10. The number of pages per level (`-topN`) 
for a full crawl can be from tens of thousands to millions, depending on your 
resources.
  
- === 3.2 Using Individual Commands for Whole-Web Crawling ===
+ === 3.4 Using Individual Commands for Whole-Web Crawling ===
  '''NOTE''': If you previously modified the file `conf/regex-urlfilter.txt` as 
covered [[#A3._Crawl_your_first_website|here]] you will need to change it back.
  
  Whole-Web crawling is designed to handle very large crawls which may take 
weeks to complete, running on multiple machines.  This also permits more 
control over the crawl process, and incremental crawling.  It is important to 
note that whole Web crawling does not necessarily mean crawling the entire 
World Wide Web.  We can limit a whole Web crawl to just a list of the URLs we 
want to crawl.  This is done by using a filter just like we the one we used 
when we did the `crawl` command (above).
@@ -268, +284 @@

       Usage: bin/nutch solrclean <crawldb> <solrurl>
       Example: /bin/nutch solrclean crawl/crawldb/ http://localhost:8983/solr
  }}}
- === 3.3. Using the crawl script ===
+ === 3.5. Using the crawl script ===
  If you have followed the 3.2 section above on how the crawling can be done 
step by step, you might be wondering how a bash script can be written to 
automate all the process described above.
  
  Nutch developers have written one for you :), and it is available at 
[[bin/crawl]].

[Nutch Wiki] Update of "NutchTutorial" by riverma

Reply via email to