[Nutch Wiki] Update of "NutchTutorial" by JulienNioche

Apache Wiki Tue, 12 Jul 2011 02:49:58 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.


The "NutchTutorial" page has been changed by JulienNioche:
http://wiki.apache.org/nutch/NutchTutorial?action=diff&rev1=34&rev2=35

Comment:
explain that nutch-site.xml should be used instead of nutch-default.xml

  
  Good! You are almost ready to crawl. You need to give your crawler a name. 
This is required.
  
-  1. Open up $NUTCH_HOME/conf/nutch-default.xml file
-  1. Search for {{{http.agent.name}}} , and give it value 'YOURNAME Spider'
-  1. Optionally you may also set {{{http.agent.url}}} and 
{{{http.agent.email}}} properties.
+  1. Edit $NUTCH_HOME/conf/nutch-site.xml (or 
$NUTCH_HOME/runtime/local/conf/nutch-site.xml with version >= 1.3) and add
+ 
+ {{{
+ <property>
+   <name>http.agent.name</name>
+   <value>YOUR_CRAWLER_NAME_HERE</value>
+ </property>
+ }}}
+ 
+  1. Replace YOUR_CRAWLER_NAME_HERE with the name you want to give to your 
crawler
+  1. Optionally you may also set the {{{http.agent.url}}} and 
{{{http.agent.email}}} properties so that webmasters can identify who is 
crawling their site and contact you if necessary.
+ 
+ '''''Note''''' : It is advised to specify your parameters in the file 
nutch-site.xml and leave nutch-default.xml as it is. The latter should be used 
as a reference only for checking the list of available parameters and their 
descriptions.
  
  Now we're ready to crawl. There are two approaches to crawling:

[Nutch Wiki] Update of "NutchTutorial" by JulienNioche

Reply via email to