Dear Wiki user, You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.
The following page has been changed by AlessioTomasino: http://wiki.apache.org/nutch/Nutch_0%2e9_Crawl_Script_Tutorial ------------------------------------------------------------------------------ Please add comments / corrections to this document. 'cause I don't know what the heck I'm doing yet. :) One thing I want to figure out, is if I can inject just a subset of urls of pages that I know have changed since the last crawl and refetch/index only those pages. I think there is a way to do this using the adddays parameter maybe? anyone have any insight? + == How to refetch/index a subset of urls == + + My solution to this common question is to use a filter on the URL we want to refetch and have those expire using the -adddays option of 'nutch generate' command. + In nutch-site.xml you should enable a filter plugin such as urlfilter-regex and specify the file which contains the regex filter rules: + + <property> + + <name>plugin.includes</name> + + <value>protocol-http|parse-(xml|text|html|js|pdf)|index-basic|query-(basic|site|url|more)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)|feed |'''''urlfilter-regex'''''</value> + + </property> + + <property> + <name>urlfilter.regex.file</name> + + <value>regex-urlfilter.txt</value> + </property> + + The file regex-urlfilter.txt can contain any regular expression, including one or more specific URLs we want to refetch/index, e.g.: + + +http://myhostname/myurl.html + + At this stage we can use the command "$NUTCH_HOME/bin/nutch generate crawl/crawldb crawl/segments -adddays 31" to generate a segment and the output should look like: + + Fetcher: starting + + Fetcher: segment: crawl/segments/20080518090826 + + Fetcher: threads: 50 + + fetching http://myhostname/myurl.html + + redirectCount=0 + + + Any comments/feedback welcome! + + +