Hi I’m sorry to bother you with this questions:
1. I can't understand why google list thousands (>5000) of relevant sites (search e.g."site:http://domain.com/region/topic/") while Nutch only find a few hundredth. I set db.max.outlinks.per.page -1, db.ignore.external.links true and comment out #-[...@=]. Start with bin/nutch crawl urls -dir mydir -depth 4 >& crawl.log and get: ---------------------- TOTAL urls: 979 retry 0: 971 retry 1: 4 retry 2: 4 min score: 0.0 avg score: 0.4908437 max score: 470.039 status 1 (db_unfetched): 212 status 2 (db_fetched): 279 status 3 (db_gone): 8 status 4 (db_redir_temp): 472 status 5 (db_redir_perm): 8 CrawlDb statistics: done ---------------------- 2. Most of the fetched url are from parent-pages e.g. http://domain.com/region/ not from the http://domain.com/region/topic/I set (with +^http://([a-z0-9]*\.)*domain.com/region/topic/) in urls.txt and crawl-urlfilter.txt. 3. What is mean with db_redir_perm 472? I would really appreciate any support Nutchnoob
