Re: Per-page crawling policy

Ken Krugler Tue, 17 Jan 2006 13:21:14 -0800

Hi Andrzej,

1. Even with a pretty broad area of interest, you wind up focusingon a subset of all domains. Which then means that the max threadsper host limit (for polite crawling) starts killing your efficiency.
The "policies" approach that I described is able to follow anddistribute the scores along any links, not necessarily within thedomain, so I think we could avoid this.

The issue here is that we score pages statically - so every pagedoesn't start with a score of 1.0f, but rather a score from0.0...1.0f.

Then we then divide this score among (valid) outlinks, and sum theseinto the OPIC scores for the referenced pages.

When we use these scores to rank URLs to fetch, and we constrain (viatopN) the number of URLs fetched each round (to help focus thesearch) we wind up with what looks like an exponential curve forsites/domain - the top few domains wind up with most of the URLs.

We modified Nutch 0.7 to restrict (as a percentage of total URLsbeing fetched) the number of URLs coming from any one domain. I seethat 0.8 has something similar, but it's a max-URLs-per-domain,whereas a percentage makes more sense I think.

We also had to add in "kill the fetch" support. We monitor the fetchthreads, and when the ratio of active threads (fetching) to unactive(blocked) threads drops below a threshold we terminate the fetch.This then compensates for issues where a popular site is also alow-performing site.


-- Ken
--
Ken Krugler
Krugle, Inc.
+1 530-470-9200

Re: Per-page crawling policy

Reply via email to