[EMAIL PROTECTED] wrote:
I do understand that CrawlDb is the source to get all known URLs
from, and from those URLs we can extract host names, domains, etc.
(what DomainStatistics tool does), but I don't understand how you'd
use CrawlDb as the source of per-host data, since CrawlDb does not
have aggregate per-host data. Shouldn't that live in a separate
file, a file that can updated after every fetch run?
Well, as it happens, map-reduce is exceptionally good at collecting
aggregate data :) This is a simple map-reduce job, where we do the
following:
Map: input: <url, CrawlDatum> from CrawlDb
output: <host, hostStats>
Host is extracted from the current url, and hostStats is extracted from
the data in this CrawlDatum
Reduce: input: <host, (hostStats1, hostStats2, ...)>
output: <host, hostStats> // aggregated
PS. Could you please wrap your lines to 80 chars? I always have to
re-wrap your emails when responding, otherwise they consists of very
very long lines ..
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com