[Nutch-dev] Re: Per-page crawling policy

Andrzej Bialecki Mon, 16 Jan 2006 10:08:04 -0800

Hi Ken,

First of all, thanks for sharing your insights, that's a veryinteresting read.


Ken Krugler wrote:

This sounds like the TrustRank algorithm. Seehttp://www.vldb.org/conf/2004/RS15P3.PDF. This talks about trustattenuation via trust dampening (reducing the trust level as you getfurther from a trusted page) and trust splitting (OPIC-like approach).


Yes, I'm familiar with that paper, that's what got me thinking  ... ;)

1. Even with a pretty broad area of interest, you wind up focusing ona subset of all domains. Which then means that the max threads perhost limit (for polite crawling) starts killing your efficiency.

The "policies" approach that I described is able to follow anddistribute the scores along any links, not necessarily within thedomain, so I think we could avoid this.

The next step is to turn on HTTP keep-alive, which would be prettyeasy other than some funky issues we've run into with the httpconnection pool, and the fact that the current protocol plug-in APIdoesn't give us a good channel to pass back the info that the fetchthread needs to control the keep alive process.

I'm afraid this is still the case with 0.8, the way protocol pluginsinteract with the Fetcher and the fetchlist it's not easy to improve it.

2. With a vertical crawl, you seem to wind up at "Big Bob's Server"sooner/more often than with a breadth first crawl. Going wide meansyou spend a lot more time on sites with lots of pages, and these aretypically higher performance/better behaved. With vertical, we seem tohit a lot more slow/nonconforming servers, which then kill our fetchperformance.
Typical issues are things like servers sending back an endless streamof HTTP response headers, or trickling data back for a big file.
To work around this, we've implemented support for monitoring threadperformance and interrupting threads that are taking too long.

I agree, this is something that needs to be added, especially to HTTPprotocol plugins. The code for HTTP plugins has been refactored, so Ihope it will be easier to implement this monitoring in 0.8 ...

3. With a vertical crawl, you typically want to keep the number offetched URLs (per loop) at a pretty low percentage relative to thetotal number of unfetched URLs in the WebDB. This helps withmaintaining good crawl focus.
The problem we've run into is that the percentage of time spentupdating the WebDB, once you're past about 10M pages, starts todominate the total crawl time.

I know what you mean, I went through it with a 20 mln page installationand it wasn't pleasant. This is definitely not the case with 0.8 CrawlDB(which replaces the WebDB). All update times seem to be linearlyproportional to the number of entries, and much shorter overall. So,even with a large DB and large updates the updatedb command is pretty fast.

4. We'd like to save multiple scores for a page, for the case wherewe're applying multiple scoring functions to a page. But modifying theWebDB to support a set of scores per page, while not hurtingperformance, seems tricky.

Again, this will be different in 0.8 - we are going to add support forarbitrary metadata to CrawlDB entries (formerly known as Page, nowCrawlDatum). You will be able to add multiple scores just fine, at arelatively small cost in performance for generate/update operations -it's not possible to avoid at least some performance loss, adding moredata always means you need to process it... but overall these operationsscale much better in 0.8 than before.


--
Best regards,
Andrzej Bialecki     <><
___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com




-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click
_______________________________________________
Nutch-developers mailing list
Nutch-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-developers

[Nutch-dev] Re: Per-page crawling policy

Reply via email to