[ http://issues.apache.org/jira/browse/NUTCH-289?page=comments#action_12414273 ]
Stefan Groschupf commented on NUTCH-289: ---------------------------------------- Andrzej, I'm afraid I was not able to clearly communicate my ideas and we may be misunderstand each other. Resolve the ip in Parseoutputformat would be only necessary for the new links discovered in the content. Since by default we parse during fetching we would have the chance to use the jvm dns cache, since I guess many new urls point to the same host where we fetched a particular page from. Means if we do not parse separately we would have the best jvm cache usage. We do not lookup IPs of urls we fetch at this time, since these urls already have a ip that was resoved when these urls was first time discovered in a parse process. The only problem we need to handle is what happens in case a ip of a host change. We can simple lookup the ip of a url that throws a protocol error and compare cached and lookup ip. An alternative aproche would be to lookup ip's during crawldb update just for the new urls. Sorry I hope that describe my ideas more clearly. My personal point of view is to store the ip into the crawldatum not into the meta data. > CrawlDatum should store IP address > ---------------------------------- > > Key: NUTCH-289 > URL: http://issues.apache.org/jira/browse/NUTCH-289 > Project: Nutch > Type: Bug > Components: fetcher > Versions: 0.8-dev > Reporter: Doug Cutting > > If the CrawlDatum stored the IP address of the host of it's URL, then one > could: > - partition fetch lists on the basis of IP address, for better politeness; > - truncate pages to fetch per IP address, rather than just hostname. This > would be a good way to limit the impact of domain spammers. > The IP addresses could be resolved when a CrawlDatum is first created for a > new outlink, or perhaps during CrawlDB update. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira