[ https://issues.apache.org/jira/browse/NUTCH-289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12508445 ]
Doğacan Güney commented on NUTCH-289: ------------------------------------- It seems this issue has kind of died down, but this would be a great feature to have. Here is how I think we can do this one (my proposal is _heavily_ based on Stefan Groschupf's work): * Add ip as a field to CrawlDatum * Fetcher always resolves ip and stores it in crawl_fetch (even if CrawlDatum already has an ip). * A similar IpAddressResolver tool that reads crawl_fetch, crawl_parse (and probably crawldb) that (optionally) runs before updatedb. - map: <url, CrawlDatum> -> <host of url, <url, CrawlDatum>> . Add a field to CrawlDatum's metadata to indicate where (crawldb, crawl_fetch, crawl_parse) it is coming from(which will be removed in reduce). No lookup is performed in map(). - reduce: <host, list(<url, CrawlDatum>)> -> <url, CrawlDatum>. If any CrawlDatum already contains an ip address (ip addresses in crawl_fetch having precedence over ones in crawldb) then output all crawl_parse datums with this ip address. Otherwise, perform a lookup. This way, we will not have to resolve ip for most of urls (in a way, we will still be getting the benefits of jvm cache :). A downside of this approach is that we will either have to read crawldb twice or perform ip lookups for hosts in crawldb (but not in crawl_fetch). * use cached ip during generation, if it exists. > CrawlDatum should store IP address > ---------------------------------- > > Key: NUTCH-289 > URL: https://issues.apache.org/jira/browse/NUTCH-289 > Project: Nutch > Issue Type: Bug > Components: fetcher > Affects Versions: 0.8 > Reporter: Doug Cutting > Attachments: ipInCrawlDatumDraftV1.patch, > ipInCrawlDatumDraftV4.patch, ipInCrawlDatumDraftV5.1.patch, > ipInCrawlDatumDraftV5.patch > > > If the CrawlDatum stored the IP address of the host of it's URL, then one > could: > - partition fetch lists on the basis of IP address, for better politeness; > - truncate pages to fetch per IP address, rather than just hostname. This > would be a good way to limit the impact of domain spammers. > The IP addresses could be resolved when a CrawlDatum is first created for a > new outlink, or perhaps during CrawlDB update. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.