Hi 

I'm trying to get a nutch crawl to work, and it keeps stopping at depth 1 even 
though there should be more data to fetch. I can download a list of urls 
without any problem using FreeGenerator, but the recursive crawl is not 
working for me.

I have the crawl-urlfilter.txt set up to accept any url, and the plugins 
configured to use this filter
<name>plugin.includes</name>
  <value>protocol-http|urlfilter-(crawl|regex)|parse-(text|html|js)|
index-(basic|anchor)|query-(basic|site|url)|summary-basic|scoring-opic|
urlnormalizer-(pass|regex|basic)|feed</value>

The only other nutch configs that I've changed are the robot settings.

If I inspect the crawldb after a run I see that it's fetched the 3 seed pages 
and refused to fetch anything else:

TOTAL urls:     248
retry 0:        248
min score:      0.0090
avg score:      0.03530645
max score:      2.029
status 1 (db_unfetched):        245
status 2 (db_fetched):  3

How can I get nutch to fetch the rest of the urls? 

thanks in advance for your help,

Barry

ps: here's my crawl-urlfilter.txt
-^(file|ftp|mailto):

# skip image and other suffixes we can't yet parse
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|
tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$

# skip URLs containing certain characters as probable queries, etc.
[EMAIL PROTECTED]

# skip URLs with slash-delimited segment that repeats 3+ times, to break loops
-.*(/[^/]+)/[^/]+\1/[^/]+\1/

# accept hosts in MY.DOMAIN.NAME
#+^http://([a-z0-9]*\.)*apache.org/

# skip everything else
#-.
+.*

 

Reply via email to