Hi Ken,

First of all, thanks for sharing your insights, that's a very interesting read.

Ken Krugler wrote:
This sounds like the TrustRank algorithm. See http://www.vldb.org/conf/2004/RS15P3.PDF. This talks about trust attenuation via trust dampening (reducing the trust level as you get further from a trusted page) and trust splitting (OPIC-like approach).

Yes, I'm familiar with that paper, that's what got me thinking  ... ;)

1. Even with a pretty broad area of interest, you wind up focusing on a subset of all domains. Which then means that the max threads per host limit (for polite crawling) starts killing your efficiency.

The "policies" approach that I described is able to follow and distribute the scores along any links, not necessarily within the domain, so I think we could avoid this.

The next step is to turn on HTTP keep-alive, which would be pretty easy other than some funky issues we've run into with the http connection pool, and the fact that the current protocol plug-in API doesn't give us a good channel to pass back the info that the fetch thread needs to control the keep alive process.

I'm afraid this is still the case with 0.8, the way protocol plugins interact with the Fetcher and the fetchlist it's not easy to improve it.


2. With a vertical crawl, you seem to wind up at "Big Bob's Server" sooner/more often than with a breadth first crawl. Going wide means you spend a lot more time on sites with lots of pages, and these are typically higher performance/better behaved. With vertical, we seem to hit a lot more slow/nonconforming servers, which then kill our fetch performance.

Typical issues are things like servers sending back an endless stream of HTTP response headers, or trickling data back for a big file.

To work around this, we've implemented support for monitoring thread performance and interrupting threads that are taking too long.

I agree, this is something that needs to be added, especially to HTTP protocol plugins. The code for HTTP plugins has been refactored, so I hope it will be easier to implement this monitoring in 0.8 ...


3. With a vertical crawl, you typically want to keep the number of fetched URLs (per loop) at a pretty low percentage relative to the total number of unfetched URLs in the WebDB. This helps with maintaining good crawl focus.

The problem we've run into is that the percentage of time spent updating the WebDB, once you're past about 10M pages, starts to dominate the total crawl time.

I know what you mean, I went through it with a 20 mln page installation and it wasn't pleasant. This is definitely not the case with 0.8 CrawlDB (which replaces the WebDB). All update times seem to be linearly proportional to the number of entries, and much shorter overall. So, even with a large DB and large updates the updatedb command is pretty fast.

4. We'd like to save multiple scores for a page, for the case where we're applying multiple scoring functions to a page. But modifying the WebDB to support a set of scores per page, while not hurting performance, seems tricky.
Again, this will be different in 0.8 - we are going to add support for arbitrary metadata to CrawlDB entries (formerly known as Page, now CrawlDatum). You will be able to add multiple scores just fine, at a relatively small cost in performance for generate/update operations - it's not possible to avoid at least some performance loss, adding more data always means you need to process it... but overall these operations scale much better in 0.8 than before.

--
Best regards,
Andrzej Bialecki     <><
___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com




-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click
_______________________________________________
Nutch-developers mailing list
Nutch-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to