Original fetch interval: What do you mean? The script starts once a week (of
course only if it is not running). The fetch cycle takes 1-3 days depending
on -topN and -depth. If you mean the attribute next fetch time on each
URLs I didn't change anything - I think 30 days by default.
The high
Hi list,
My seed list contains URLs from about 20 different domains. In the first fetch
cycles everything is all right and all domains will be selected quite equally
distributed. But after about 10-15 cycles one domain starts to prevail. URLs
from all other domains will not be selected
Hi,
you could set generate.max.per.host to a reasonable size to prevent this!
On a default configuration this is set to -1 which means unlimited.
BR
Hannes
---
Hannes Carl Meyer
www.informera.de
On Fri, Jul 8, 2011 at 2:53 PM, Eggebrecht, Thomas (GfK Marktforschung)
thomas.eggebre...@gfk.com
Yes this would limit the number of URLs from any one domain, but it would
not explain why one domain seems to get fetched more after recursive fetches
of some given seed set.
Can you explain more about your crawling operation? Are you executing a
crawl command? If so what arguements are you
4 matches
Mail list logo