Using an adaptive setting is a pretty daunting task. Perhaps a nice start would be creating a mechanism that allows exceptional queue settings set *by hand*? A resource file would fit purpose for this. Later on it could be replaced by automatic settings.

On 11/04/2011 01:56 PM, Markus Jelsma wrote:

On Friday 04 November 2011 13:39:25 Ferdy Galema wrote:
Hi Markus,

I was wondering what you exactly mean with dynamic. Is it different per
fetch cycle but for all queues or do you mean a different value for
different queues. (For example, when type is HOST, hostA will have a
different generate max count than hostB).
Yes. I would like to generate more records for domains/hosts with a large
amount of URL's such a big news sites. For small websites we would want to
reduce the amount of generated records.

The rationale behind this is that politeness varies between small, medium and
large sites. We can easily fetch 100 URL's for the big news site but not for a
small site.

Cheers


Ferdy.

On 11/04/2011 12:32 AM, Markus Jelsma wrote:
Hi,

The generate.max.count defines the number of records per tpye of queue.
We're looking for an improvement to make this setting dynamic. The new
variable would be the number of total records for that type of queue
(ip, host, domain).

How can we adapt the generator for this? The problem is that there's no
information on the number of records for a given URL.

Any thoughts? Could we perhaps modify the updater to count the number of
records for a queue and write it to the CrawlDatum without building a new
updater tool based on the information provided by the current
domainstatistics tool?

Thanks

Reply via email to