[ http://issues.apache.org/jira/browse/NUTCH-289?page=comments#action_12414114 ]
Doug Cutting commented on NUTCH-289: ------------------------------------ It should be possible to partition by IP and limit fetchlists by IP. Resolving only in the fetcher is too late to implement these features. Ideally we should arrange things for good DNS cache utilization, so that urls with the same host are resolved in a single map or reduce task. Currently this is the case during fetchlist generation, where lists are partitioned by host. Might that be a good place to insert DNS resolution? The fetchlists would need to be processed one more time, to re-partition and re-limit by IP, but fetchlists are relatively small, so this might not slow things too much. The map task itself could directly cache IP addresses, and perhaps even avoid many DNS lookups by using the IP from another CrawlDatum from the same host. A multi-threaded mapper might also be used to allow for network latencies. This should, at least initially, be an optional feature, and thus the IP should probably initially be stored in the metadata. I think it might be added as a re-generate step without changing any other code. > CrawlDatum should store IP address > ---------------------------------- > > Key: NUTCH-289 > URL: http://issues.apache.org/jira/browse/NUTCH-289 > Project: Nutch > Type: Bug > Components: fetcher > Versions: 0.8-dev > Reporter: Doug Cutting > > If the CrawlDatum stored the IP address of the host of it's URL, then one > could: > - partition fetch lists on the basis of IP address, for better politeness; > - truncate pages to fetch per IP address, rather than just hostname. This > would be a good way to limit the impact of domain spammers. > The IP addresses could be resolved when a CrawlDatum is first created for a > new outlink, or perhaps during CrawlDB update. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira ------------------------------------------------------- All the advantages of Linux Managed Hosting--Without the Cost and Risk! Fully trained technicians. The highest number of Red Hat certifications in the hosting industry. Fanatical Support. Click to learn more http://sel.as-us.falkag.net/sel?cmd=lnk&kid=107521&bid=248729&dat=121642 _______________________________________________ Nutch-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-developers
