[ https://issues.apache.org/jira/browse/NUTCH-1325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Markus Jelsma updated NUTCH-1325: --------------------------------- Component/s: hostdb > HostDB for Nutch > ---------------- > > Key: NUTCH-1325 > URL: https://issues.apache.org/jira/browse/NUTCH-1325 > Project: Nutch > Issue Type: New Feature > Components: hostdb > Reporter: Markus Jelsma > Assignee: Markus Jelsma > Fix For: 1.12 > > Attachments: NUTCH-1325-1.6-1.patch, > NUTCH-1325-removed-from-1.8.patch, NUTCH-1325-trunk-v3.patch, > NUTCH-1325-trunk-v4.patch, NUTCH-1325-trunk-v5.patch, NUTCH-1325-v4-v5.patch, > NUTCH-1325.patch, NUTCH-1325.patch, NUTCH-1325.patch, NUTCH-1325.patch, > NUTCH-1325.trunk.v2.path, oi-hostdb.patch, oi-hostdb.patch, oi-hostdb.patch > > > h1. HostDB for Apache Nutch 1.x > * automatically generates a HostDB based on CrawlDB information > * periodically performs DNS lookup for all hosts and keeps track of DNS > failures > * discovers homepage if www.example.org/ is a redirect > * keeps track of host statistics such as number of URL's, 404's, not > modifieds and redirects > * aggregates CrawlDB metadata fields into totals, sums, min, max, average and > configurable percentiles > * can output lists of discovered homepage URL's for seed lists and static > fetch interval > *can output blacklists for hosts that have too many DNS failures to filter > from the CrawlDB using domainblacklist-urlfilter > * just like CrawlDB support for JEXL expressions > h4. Examples > Generate for the first time, or update and existing HostDB: > {code} > bin/nutch updatehostdb -hostdb crawl/hostdb -crawldb crawl/crawldb > {code} > Optional filtering or normalizing: > {code} > bin/nutch updatehostdb -hostdb crawl/hostdb -crawldb crawl/crawldb -filter > -normalize > {code} > Dumping as CSV file: > {code} > bin/nutch readhostdb crawl/hostdb output_directory > {code} > Get only hostnames with have average response time above 50ms: > {code} > bin/nutch readhostdb crawl/hostdb output_directory -dumpHostnames -expr > "(avg._rs_ > 50)" > {code} > Get only hosts that have over 50% 404's: > {code} > bin/nutch readhostdb crawl/hostdb output_directory -dumpHostnames -expr > "(gone / numRecords > 0.5)" > {code} > For JEXL expressions, all host metadata fields are available. All other > fields are also available as: > unfetched -- number of unfetched records > fetched -- number of fetched records > gone -- number of 404's > redirTemp -- number if temporary redirects > redirPerm -- number if permanent redirects > redirs -- total number of redirects (redirTemp + redirPerm) > notModified -- number of not modified records > ok -- number of usable pages (fetched + notModified) > numRecords -- total number of records > dnsFailures -- number of DNS failures > Also, see nutch-default for hostdb.* properties. -- This message was sent by Atlassian JIRA (v6.3.4#6332)