Hi Andrzej,

1. Even with a pretty broad area of interest, you wind up focusing on a subset of all domains. Which then means that the max threads per host limit (for polite crawling) starts killing your efficiency.

The "policies" approach that I described is able to follow and distribute the scores along any links, not necessarily within the domain, so I think we could avoid this.

The issue here is that we score pages statically - so every page doesn't start with a score of 1.0f, but rather a score from 0.0...1.0f.

Then we then divide this score among (valid) outlinks, and sum these into the OPIC scores for the referenced pages.

When we use these scores to rank URLs to fetch, and we constrain (via topN) the number of URLs fetched each round (to help focus the search) we wind up with what looks like an exponential curve for sites/domain - the top few domains wind up with most of the URLs.

We modified Nutch 0.7 to restrict (as a percentage of total URLs being fetched) the number of URLs coming from any one domain. I see that 0.8 has something similar, but it's a max-URLs-per-domain, whereas a percentage makes more sense I think.

We also had to add in "kill the fetch" support. We monitor the fetch threads, and when the ratio of active threads (fetching) to unactive (blocked) threads drops below a threshold we terminate the fetch. This then compensates for issues where a popular site is also a low-performing site.

-- Ken
--
Ken Krugler
Krugle, Inc.
+1 530-470-9200

Reply via email to