[
https://issues.apache.org/jira/browse/NUTCH-289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12508445
]
Doğacan Güney commented on NUTCH-289:
-------------------------------------
It seems this issue has kind of died down, but this would be a great feature to
have.
Here is how I think we can do this one (my proposal is _heavily_ based on
Stefan Groschupf's work):
* Add ip as a field to CrawlDatum
* Fetcher always resolves ip and stores it in crawl_fetch (even if CrawlDatum
already has an ip).
* A similar IpAddressResolver tool that reads crawl_fetch, crawl_parse (and
probably crawldb) that (optionally) runs before updatedb.
- map: <url, CrawlDatum> -> <host of url, <url, CrawlDatum>> . Add a field
to CrawlDatum's metadata to indicate where (crawldb, crawl_fetch, crawl_parse)
it is coming from(which will be removed in reduce). No lookup is performed in
map().
- reduce: <host, list(<url, CrawlDatum>)> -> <url, CrawlDatum>. If any
CrawlDatum already contains an ip address (ip addresses in crawl_fetch having
precedence over ones in crawldb) then output all crawl_parse datums with this
ip address. Otherwise, perform a lookup. This way, we will not have to resolve
ip for most of urls (in a way, we will still be getting the benefits of jvm
cache :).
A downside of this approach is that we will either have to read crawldb twice
or perform ip lookups for hosts in crawldb (but not in crawl_fetch).
* use cached ip during generation, if it exists.
> CrawlDatum should store IP address
> ----------------------------------
>
> Key: NUTCH-289
> URL: https://issues.apache.org/jira/browse/NUTCH-289
> Project: Nutch
> Issue Type: Bug
> Components: fetcher
> Affects Versions: 0.8
> Reporter: Doug Cutting
> Attachments: ipInCrawlDatumDraftV1.patch,
> ipInCrawlDatumDraftV4.patch, ipInCrawlDatumDraftV5.1.patch,
> ipInCrawlDatumDraftV5.patch
>
>
> If the CrawlDatum stored the IP address of the host of it's URL, then one
> could:
> - partition fetch lists on the basis of IP address, for better politeness;
> - truncate pages to fetch per IP address, rather than just hostname. This
> would be a good way to limit the impact of domain spammers.
> The IP addresses could be resolved when a CrawlDatum is first created for a
> new outlink, or perhaps during CrawlDB update.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers