Can't fetch pages from specific domain

Myname To Mon, 18 May 2009 11:06:29 -0700

Hi

I’m sorry to bother you with this questions:


1. I can't understand why google list thousands (>5000) of relevant sites 
(search e.g."site:http://domain.com/region/topic/";) while Nutch only find a few 
hundredth. I set db.max.outlinks.per.page -1, db.ignore.external.links true and 
comment out #-[...@=]. Start with bin/nutch crawl urls -dir mydir -depth 4 >& 
crawl.log and get:
----------------------
TOTAL urls:     979
retry 0:        971
retry 1:        4
retry 2:        4
min score:      0.0
avg score:      0.4908437
max score:      470.039
status 1 (db_unfetched):        212
status 2 (db_fetched):  279
status 3 (db_gone):     8
status 4 (db_redir_temp):       472
status 5 (db_redir_perm):       8
CrawlDb statistics: done
----------------------

2. Most of the fetched url are from parent-pages e.g. http://domain.com/region/ 
not from the http://domain.com/region/topic/I set (with 
+^http://([a-z0-9]*\.)*domain.com/region/topic/) in urls.txt and 
crawl-urlfilter.txt.

3. What is mean with db_redir_perm 472?

I would really appreciate any support

Nutchnoob

Can't fetch pages from specific domain

Reply via email to