I do understand that CrawlDb is the source to get all known URLs
from, and from those URLs we can extract host names, domains, etc.
(what DomainStatistics tool does), but I don't understand how you'd
use CrawlDb as the source of per-host data, since CrawlDb does not
have aggregate per-host data.  Shouldn't that live in a separate
file, a file that can updated after every fetch run?

Well, as it happens, map-reduce is exceptionally good at collecting aggregate data :) This is a simple map-reduce job, where we do the following:

Map:    input: <url, CrawlDatum> from CrawlDb
        output: <host, hostStats>

Host is extracted from the current url, and hostStats is extracted from the data in this CrawlDatum

Reduce: input: <host, (hostStats1, hostStats2, ...)>
        output: <host, hostStats> // aggregated

PS. Could you please wrap your lines to 80 chars? I always have to
re-wrap your emails when responding, otherwise they consists of very
very long lines ..

Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply via email to