Re: Partitioning selected urls for politeness and scoring

2011-07-11 Thread Thomas Eggebrecht
Original fetch interval: What do you mean? The script starts once a week (of course only if it is not running). The fetch cycle takes 1-3 days depending on -topN and -depth. If you mean the attribute next fetch time on each URLs I didn't change anything - I think 30 days by default. The high

Partitioning selected urls for politeness and scoring

2011-07-08 Thread Eggebrecht, Thomas (GfK Marktforschung)
Hi list, My seed list contains URLs from about 20 different domains. In the first fetch cycles everything is all right and all domains will be selected quite equally distributed. But after about 10-15 cycles one domain starts to prevail. URLs from all other domains will not be selected

Re: Partitioning selected urls for politeness and scoring

2011-07-08 Thread Hannes Carl Meyer
Hi, you could set generate.max.per.host to a reasonable size to prevent this! On a default configuration this is set to -1 which means unlimited. BR Hannes --- Hannes Carl Meyer www.informera.de On Fri, Jul 8, 2011 at 2:53 PM, Eggebrecht, Thomas (GfK Marktforschung) thomas.eggebre...@gfk.com

Re: Partitioning selected urls for politeness and scoring

2011-07-08 Thread lewis john mcgibbney
Yes this would limit the number of URLs from any one domain, but it would not explain why one domain seems to get fetched more after recursive fetches of some given seed set. Can you explain more about your crawling operation? Are you executing a crawl command? If so what arguements are you