AW: Can't fetch pages from specific domain

Myname To Mon, 18 May 2009 18:42:24 -0700

hi again

to correct me Google doesn't find more url than Nutch, but I'm still nebulous 
about 2 and 3.


I would really appreciate any support.




________________________________
Von: Myname To <[email protected]>
An: [email protected]
Gesendet: Montag, den 18. Mai 2009, 20:05:51 Uhr
Betreff: Can't fetch pages from specific domain

Hi

I’m sorry to bother you with this questions:

1. I can't understand why google list thousands (>5000) of relevant sites 
(search e.g."site:http://domain.com/region/topic/";) while Nutch only find a few 
hundredth. I set db.max.outlinks.per.page -1, db.ignore.external.links true and 
comment out #-[...@=]. Start with bin/nutch crawl urls -dir mydir -depth 4 >& 
crawl.log and get:
----------------------
TOTAL urls:     979
retry 0:        971
retry 1:        4
retry 2:        4
min score:      0.0
avg score:      0.4908437
max score:      470.039
status 1 (db_unfetched):        212
status 2 (db_fetched):  279
status 3 (db_gone):     8
status 4 (db_redir_temp):       472
status 5 (db_redir_perm):       8
CrawlDb statistics: done
----------------------

2. Most of the fetched url are from parent-pages e.g. http://domain.com/region/ 
not from the http://domain.com/region/topic/Iset (with 
+^http://([a-z0-9]*\.)*domain.com/region/topic/) in urls.txt and 
crawl-urlfilter.txt.

3. What is mean with db_redir_perm 472?

I would really appreciate any support

Nutchnoob

AW: Can't fetch pages from specific domain

Reply via email to