hi again to correct me Google doesn't find more url than Nutch, but I'm still nebulous about 2 and 3.
I would really appreciate any support. ________________________________ Von: Myname To <[email protected]> An: [email protected] Gesendet: Montag, den 18. Mai 2009, 20:05:51 Uhr Betreff: Can't fetch pages from specific domain Hi I’m sorry to bother you with this questions: 1. I can't understand why google list thousands (>5000) of relevant sites (search e.g."site:http://domain.com/region/topic/") while Nutch only find a few hundredth. I set db.max.outlinks.per.page -1, db.ignore.external.links true and comment out #-[...@=]. Start with bin/nutch crawl urls -dir mydir -depth 4 >& crawl.log and get: ---------------------- TOTAL urls: 979 retry 0: 971 retry 1: 4 retry 2: 4 min score: 0.0 avg score: 0.4908437 max score: 470.039 status 1 (db_unfetched): 212 status 2 (db_fetched): 279 status 3 (db_gone): 8 status 4 (db_redir_temp): 472 status 5 (db_redir_perm): 8 CrawlDb statistics: done ---------------------- 2. Most of the fetched url are from parent-pages e.g. http://domain.com/region/ not from the http://domain.com/region/topic/Iset (with +^http://([a-z0-9]*\.)*domain.com/region/topic/) in urls.txt and crawl-urlfilter.txt. 3. What is mean with db_redir_perm 472? I would really appreciate any support Nutchnoob
