Hi All, I tried setting up a local filesystem crawl through nutch-0.9. I am facing problems trying this. Following are the details:
------------------------------ CRAWL OUTPUT: Found 1 items /user/test/urls <dir> crawl started in: crawled rootUrlDir = urls threads = 10 depth = 3 topN = 5 Injector: starting Injector: crawlDb: crawled/crawldb Injector: urlDir: urls Injector: Converting injected urls to crawl db entries. Injector: Merging injected urls into crawl db. Injector: done Generator: Selecting best-scoring urls due for fetch. Generator: starting Generator: segment: crawled/segments/20071026235539 Generator: filtering: false Generator: topN: 5 Generator: 0 records selected for fetching, exiting ... Stopping at depth=0 - no more URLs to fetch. No URLs to fetch - check your seed list and URL filters. crawl finished: crawled urls/seed file : file:///export/home/test/test/tmp file:///export/home/test/test/search file:///export/home/test/test/tmp conf/crawl-urlfilter.txt : # The url filter file used by the crawl command. # Better for intranet crawling. # Be sure to change MY.DOMAIN.NAME to your domain name. # Each non-comment, non-blank line contains a regular expression # prefixed by '+' or '-'. The first matching pattern in the file # determines whether a URL is included or ignored. If no pattern # matches, the URL is ignored. ## skip file:, ftp:, & mailto: urls ##-^(file|ftp|mailto): # skip http:, ftp:, & mailto: urls -^(http|ftp|mailto): # skip image and other suffixes we can't yet parse -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$ # skip URLs containing certain characters as probable queries, etc. [EMAIL PROTECTED] # skip URLs with slash-delimited segment that repeats 3+ times, to break loops #-.*(/.+?)/.*?\1/.*?\1/ # accept hosts in MY.DOMAIN.NAME #+^http://([a-z0-9 <http://%28%5ba-z0-9/>]*\.)*com/ # skip everything else for http #-.* # take everything else for file +.* conf/nutch-site.xml: <configuration> <property> <name>plugin.folders</name> <value>/export/home/test/test/nutch/build/plugins</value> <description>Directories where nutch plugins are located. Each element may be a relative or absolute path. If absolute, it is used as is. If relative, it is searched for on the classpath.</description> </property> <property> <name>plugin.includes </name> <value>protocol-file|urlfilter-regex|parse-(text|html|msword|pdf)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value> <description>Regular expression naming plugin directory names to include. Any plugin not matching this expression is excluded. In any case you need at least include the nutch-extensionpoints plugin. By default Nutch includes crawling just HTML and plain text via HTTP, and basic indexing and search plugins. In order to use HTTPS please enable protocol-httpclient, but be aware of possible intermittent problems with the underlying commons-httpclient library. </description> </property> </configuration> ~ Any hints on how to proceed further ? Prem
