[ 
https://issues.apache.org/jira/browse/NUTCH-289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12508445
 ] 

Doğacan Güney commented on NUTCH-289:
-------------------------------------

It seems this issue has kind of died down, but this would be a great feature to 
have. 

Here is how I think we can do this one (my proposal is _heavily_ based on 
Stefan Groschupf's work):

* Add ip as a field to CrawlDatum

* Fetcher always resolves ip and stores it in crawl_fetch (even if CrawlDatum 
already has an ip).

* A similar IpAddressResolver tool that reads crawl_fetch, crawl_parse (and 
probably crawldb) that (optionally) runs before updatedb. 
  - map: <url, CrawlDatum> ->  <host of url, <url, CrawlDatum>> . Add a field 
to CrawlDatum's metadata to indicate where (crawldb, crawl_fetch, crawl_parse) 
it is coming from(which will be removed in reduce). No lookup is performed in 
map().

  - reduce: <host, list(<url, CrawlDatum>)> -> <url, CrawlDatum>. If any 
CrawlDatum already contains an ip address (ip addresses in crawl_fetch having 
precedence over ones in crawldb) then output all crawl_parse datums with this 
ip address. Otherwise, perform a lookup. This way, we will not have to resolve 
ip for most of urls (in a way, we will still be getting the benefits of jvm 
cache :).

A downside of this approach is that we will either have to read crawldb twice 
or perform ip lookups for hosts in crawldb (but not in crawl_fetch).

* use cached ip during generation, if it exists.


> CrawlDatum should store IP address
> ----------------------------------
>
>                 Key: NUTCH-289
>                 URL: https://issues.apache.org/jira/browse/NUTCH-289
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 0.8
>            Reporter: Doug Cutting
>         Attachments: ipInCrawlDatumDraftV1.patch, 
> ipInCrawlDatumDraftV4.patch, ipInCrawlDatumDraftV5.1.patch, 
> ipInCrawlDatumDraftV5.patch
>
>
> If the CrawlDatum stored the IP address of the host of it's URL, then one 
> could:
> - partition fetch lists on the basis of IP address, for better politeness;
> - truncate pages to fetch per IP address, rather than just hostname.  This 
> would be a good way to limit the impact of domain spammers.
> The IP addresses could be resolved when a CrawlDatum is first created for a 
> new outlink, or perhaps during CrawlDB update.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to