Dear Wiki user, You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.
The "NutchTutorial" page has been changed by JulienNioche: http://wiki.apache.org/nutch/NutchTutorial?action=diff&rev1=34&rev2=35 Comment: explain that nutch-site.xml should be used instead of nutch-default.xml Good! You are almost ready to crawl. You need to give your crawler a name. This is required. - 1. Open up $NUTCH_HOME/conf/nutch-default.xml file - 1. Search for {{{http.agent.name}}} , and give it value 'YOURNAME Spider' - 1. Optionally you may also set {{{http.agent.url}}} and {{{http.agent.email}}} properties. + 1. Edit $NUTCH_HOME/conf/nutch-site.xml (or $NUTCH_HOME/runtime/local/conf/nutch-site.xml with version >= 1.3) and add + + {{{ + <property> + <name>http.agent.name</name> + <value>YOUR_CRAWLER_NAME_HERE</value> + </property> + }}} + + 1. Replace YOUR_CRAWLER_NAME_HERE with the name you want to give to your crawler + 1. Optionally you may also set the {{{http.agent.url}}} and {{{http.agent.email}}} properties so that webmasters can identify who is crawling their site and contact you if necessary. + + '''''Note''''' : It is advised to specify your parameters in the file nutch-site.xml and leave nutch-default.xml as it is. The latter should be used as a reference only for checking the list of available parameters and their descriptions. Now we're ready to crawl. There are two approaches to crawling: