Re: [jira] Commented: (NUTCH-628) Host database to keep track of host-level information

Andrzej Bialecki Sat, 19 Apr 2008 15:08:06 -0700

[EMAIL PROTECTED] wrote:

I do understand that CrawlDb is the source to get all known URLs
from, and from those URLs we can extract host names, domains, etc.
(what DomainStatistics tool does), but I don't understand how you'd
use CrawlDb as the source of per-host data, since CrawlDb does not
have aggregate per-host data.  Shouldn't that live in a separate
file, a file that can updated after every fetch run?

Well, as it happens, map-reduce is exceptionally good at collectingaggregate data :) This is a simple map-reduce job, where we do thefollowing:


Map:    input: <url, CrawlDatum> from CrawlDb
        output: <host, hostStats>

Host is extracted from the current url, and hostStats is extracted fromthe data in this CrawlDatum


Reduce: input: <host, (hostStats1, hostStats2, ...)>
        output: <host, hostStats> // aggregated



PS. Could you please wrap your lines to 80 chars? I always have to
re-wrap your emails when responding, otherwise they consists of very
very long lines ..

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: [jira] Commented: (NUTCH-628) Host database to keep track of host-level information

Reply via email to