[Nutch Wiki] Trivial Update of "NutchTutorial" by LewisJohnMcgibbney

Apache Wiki Mon, 02 Mar 2015 09:54:02 -0800

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.


The "NutchTutorial" page has been changed by LewisJohnMcgibbney:
https://wiki.apache.org/nutch/NutchTutorial?action=diff&rev1=77&rev2=78

   * Setup `JAVA_HOME` if you are seeing `JAVA_HOME` not set. On Mac, you can 
run the following command or add it to `~/.bashrc`:
  
  {{{
- export JAVA_HOME=/System/Library/Frameworks/JavaVM.framework/Versions/1.6/Home
+ export JAVA_HOME=/System/Library/Frameworks/JavaVM.framework/Versions/1.7/Home
+ # note that the actual path may be different on your system
  }}}
  On Debian or Ubuntu, you can run the following command or add it to ~/.bashrc:
  
  {{{
  export JAVA_HOME=$(readlink -f /usr/bin/java | sed "s:bin/java::")
  }}}
+ 
+ You may also have to update your /etc/hosts file. If so you can add the 
following
+ 
+ {{{
+ ##
+ # Host Database
+ #
+ # localhost is used to configure the loopback interface
+ # when the system is booting.  Do not change this entry.
+ ##
+ 127.0.0.1       localhost.localdomain localhost LMC-032857
+ ::1             ip6-localhost ip6-loopback
+ fe80::1%lo0     ip6-localhost ip6-loopback
+ }}}
+ 
+ Note that the `LMC-032857` above should be replaced with your machine name.
+ 
  == 3. Crawl your first website ==
  Nutch requires two configuration changes before a website can be crawled:
  
@@ -120, +138 @@

  
  NOTE: Not specifying any domains to include within regex-urlfilter.txt will 
lead to all domains linking to your seed URLs file being crawled as well.
  
- === 3.3 Using the Crawl Command ===
- {{{#!wiki caution
- The crawl command is deprecated. Please see section 
[[#a3.5._Using_the_crawl_script|3.5]] on how to use the crawl script that is 
intended to replace the crawl command.
- }}}
- Now we are ready to initiate a crawl, use the following parameters:
- 
-  * '''-dir''' ''dir'' names the directory to put the crawl in.
-  * '''-threads''' ''threads'' determines the number of threads that will 
fetch in parallel.
-  * '''-depth''' ''depth'' indicates the link depth from the root page that 
should be crawled.
-  * '''-topN''' ''N'' determines the maximum number of pages that will be 
retrieved at each level up to the depth.
-  * Run the following command:
- 
- {{{
- bin/nutch crawl urls -dir crawl -depth 3 -topN 5
- }}}
-  * Now you should be able to see the following directories created:
- 
- {{{
- crawl/crawldb
- crawl/linkdb
- crawl/segments
- }}}
- '''NOTE''': If you have a Solr core already set up and wish to index to it, 
you are required to add the `-solr <solrUrl> parameter` to your `crawl` command 
e.g.
- 
- {{{
- bin/nutch crawl urls -solr http://localhost:8983/solr/ -depth 3 -topN 5
- }}}
- If not then please skip to [[#A4._Setup_Solr_for_search|here]] for how to set 
up your Solr instance and index your crawl data.
- 
- Typically one starts testing one's configuration by crawling at shallow 
depths, sharply limiting the number of pages fetched at each level (`-topN`), 
and watching the output to check that desired pages are fetched and undesirable 
pages are not. Once one is confident of the configuration, then an appropriate 
depth for a full crawl is around 10. The number of pages per level (`-topN`) 
for a full crawl can be from tens of thousands to millions, depending on your 
resources.
- 
- === 3.4 Using Individual Commands for Whole-Web Crawling ===
+ === Using Individual Commands for Whole-Web Crawling ===
  '''NOTE''': If you previously modified the file `conf/regex-urlfilter.txt` as 
covered [[#A3._Crawl_your_first_website|here]] you will need to change it back.
  
  Whole-Web crawling is designed to handle very large crawls which may take 
weeks to complete, running on multiple machines.  This also permits more 
control over the crawl process, and incremental crawling.  It is important to 
note that whole Web crawling does not necessarily mean crawling the entire 
World Wide Web.  We can limit a whole Web crawl to just a list of the URLs we 
want to crawl.  This is done by using a filter just like the one we used when 
we did the `crawl` command (above).

[Nutch Wiki] Trivial Update of "NutchTutorial" by LewisJohnMcgibbney

Reply via email to