Hi I'm trying to get a nutch crawl to work, and it keeps stopping at depth 1 even though there should be more data to fetch. I can download a list of urls without any problem using FreeGenerator, but the recursive crawl is not working for me.
I have the crawl-urlfilter.txt set up to accept any url, and the plugins configured to use this filter <name>plugin.includes</name> <value>protocol-http|urlfilter-(crawl|regex)|parse-(text|html|js)| index-(basic|anchor)|query-(basic|site|url)|summary-basic|scoring-opic| urlnormalizer-(pass|regex|basic)|feed</value> The only other nutch configs that I've changed are the robot settings. If I inspect the crawldb after a run I see that it's fetched the 3 seed pages and refused to fetch anything else: TOTAL urls: 248 retry 0: 248 min score: 0.0090 avg score: 0.03530645 max score: 2.029 status 1 (db_unfetched): 245 status 2 (db_fetched): 3 How can I get nutch to fetch the rest of the urls? thanks in advance for your help, Barry ps: here's my crawl-urlfilter.txt -^(file|ftp|mailto): # skip image and other suffixes we can't yet parse -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm| tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$ # skip URLs containing certain characters as probable queries, etc. [EMAIL PROTECTED] # skip URLs with slash-delimited segment that repeats 3+ times, to break loops -.*(/[^/]+)/[^/]+\1/[^/]+\1/ # accept hosts in MY.DOMAIN.NAME #+^http://([a-z0-9]*\.)*apache.org/ # skip everything else #-. +.*
