Hi Ken,
First of all, thanks for sharing your insights, that's a very
interesting read.
Ken Krugler wrote:
This sounds like the TrustRank algorithm. See
http://www.vldb.org/conf/2004/RS15P3.PDF. This talks about trust
attenuation via trust dampening (reducing the trust level as you get
further from a trusted page) and trust splitting (OPIC-like approach).
Yes, I'm familiar with that paper, that's what got me thinking ... ;)
1. Even with a pretty broad area of interest, you wind up focusing on
a subset of all domains. Which then means that the max threads per
host limit (for polite crawling) starts killing your efficiency.
The "policies" approach that I described is able to follow and
distribute the scores along any links, not necessarily within the
domain, so I think we could avoid this.
The next step is to turn on HTTP keep-alive, which would be pretty
easy other than some funky issues we've run into with the http
connection pool, and the fact that the current protocol plug-in API
doesn't give us a good channel to pass back the info that the fetch
thread needs to control the keep alive process.
I'm afraid this is still the case with 0.8, the way protocol plugins
interact with the Fetcher and the fetchlist it's not easy to improve it.
2. With a vertical crawl, you seem to wind up at "Big Bob's Server"
sooner/more often than with a breadth first crawl. Going wide means
you spend a lot more time on sites with lots of pages, and these are
typically higher performance/better behaved. With vertical, we seem to
hit a lot more slow/nonconforming servers, which then kill our fetch
performance.
Typical issues are things like servers sending back an endless stream
of HTTP response headers, or trickling data back for a big file.
To work around this, we've implemented support for monitoring thread
performance and interrupting threads that are taking too long.
I agree, this is something that needs to be added, especially to HTTP
protocol plugins. The code for HTTP plugins has been refactored, so I
hope it will be easier to implement this monitoring in 0.8 ...
3. With a vertical crawl, you typically want to keep the number of
fetched URLs (per loop) at a pretty low percentage relative to the
total number of unfetched URLs in the WebDB. This helps with
maintaining good crawl focus.
The problem we've run into is that the percentage of time spent
updating the WebDB, once you're past about 10M pages, starts to
dominate the total crawl time.
I know what you mean, I went through it with a 20 mln page installation
and it wasn't pleasant. This is definitely not the case with 0.8 CrawlDB
(which replaces the WebDB). All update times seem to be linearly
proportional to the number of entries, and much shorter overall. So,
even with a large DB and large updates the updatedb command is pretty fast.
4. We'd like to save multiple scores for a page, for the case where
we're applying multiple scoring functions to a page. But modifying the
WebDB to support a set of scores per page, while not hurting
performance, seems tricky.
Again, this will be different in 0.8 - we are going to add support for
arbitrary metadata to CrawlDB entries (formerly known as Page, now
CrawlDatum). You will be able to add multiple scores just fine, at a
relatively small cost in performance for generate/update operations -
it's not possible to avoid at least some performance loss, adding more
data always means you need to process it... but overall these operations
scale much better in 0.8 than before.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems? Stop! Download the new AJAX search engine that makes
searching your log files as easy as surfing the web. DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click
_______________________________________________
Nutch-developers mailing list
Nutch-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-developers