Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.

The following page has been changed by AlessioTomasino:
http://wiki.apache.org/nutch/Nutch_0%2e9_Crawl_Script_Tutorial

------------------------------------------------------------------------------
  Please add comments / corrections to this document. 'cause I don't know what 
the heck I'm doing yet. :)
  One thing I want to figure out, is if I can inject just a subset of urls of 
pages that I know have changed since the last crawl and refetch/index only 
those pages. I think there is a way to do this using the adddays parameter 
maybe? anyone have any insight?
  
+ == How to refetch/index a subset of urls ==
+ 
+ My solution to this common question is to use a filter on the URL we want to 
refetch and have those expire using the -adddays option of 'nutch generate' 
command.
+ In nutch-site.xml you should enable a filter plugin such as urlfilter-regex 
and specify the file which contains the regex filter rules:
+ 
+ <property>
+ 
+ <name>plugin.includes</name> 
+ 
+ 
<value>protocol-http|parse-(xml|text|html|js|pdf)|index-basic|query-(basic|site|url|more)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)|feed
 |'''''urlfilter-regex'''''</value>
+ 
+ </property> 
+ 
+ <property>
+   <name>urlfilter.regex.file</name>
+ 
+   <value>regex-urlfilter.txt</value>
+ </property>
+ 
+ The file regex-urlfilter.txt can contain any regular expression, including 
one or more specific URLs we want to refetch/index, e.g.:
+ 
+ +http://myhostname/myurl.html
+ 
+ At this stage we can use the command "$NUTCH_HOME/bin/nutch generate 
crawl/crawldb crawl/segments -adddays 31" to generate a segment and the output 
should look like:
+ 
+ Fetcher: starting
+ 
+ Fetcher: segment: crawl/segments/20080518090826
+ 
+ Fetcher: threads: 50
+ 
+ fetching http://myhostname/myurl.html
+ 
+ redirectCount=0
+ 
+ 
+ Any comments/feedback welcome!
+ 
+ 
+ 

Reply via email to