[ http://issues.apache.org/jira/browse/NUTCH-289?page=comments#action_12413940 ]
Stefan Groschupf commented on NUTCH-289: ---------------------------------------- +1 Andrzej, I agree that lookup the ip in ParseOutputFormat would be the best as Doug suggested. The biggest problem nutch has at the moment is spam. The most often seen spam method is to setup a dns return the same ip for all subdomains and than deliver dynamically generated content. Than spammers just randomly generate subdomains within the content. Also it happens often that they have many url but all of them pointing to the same server == ip. Buying more ip addresses is possible but in the moment more expansive than buying more domains. Limit the urls by Ip is a great approach to prevent the crawler staying in honey pots with ten thousends of urls pointing to the same ip. However to do so we need to have the ip already until generation and not lookup it when fetching. We would be able to reuse the ip in the fetcher, also we can try catch the parts in the fetcher and in case the ip is not available we can re lookup the ip. I don't think round robbing dns are huge problem, since only large sites have them and in such a case each ip is able to handle requests. In any case storing the ip in crawl-datum and use it for urls by ip limitations will be a gib step forward to in the fight against web spam. > CrawlDatum should store IP address > ---------------------------------- > > Key: NUTCH-289 > URL: http://issues.apache.org/jira/browse/NUTCH-289 > Project: Nutch > Type: Bug > Components: fetcher > Versions: 0.8-dev > Reporter: Doug Cutting > > If the CrawlDatum stored the IP address of the host of it's URL, then one > could: > - partition fetch lists on the basis of IP address, for better politeness; > - truncate pages to fetch per IP address, rather than just hostname. This > would be a good way to limit the impact of domain spammers. > The IP addresses could be resolved when a CrawlDatum is first created for a > new outlink, or perhaps during CrawlDB update. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira ------------------------------------------------------- All the advantages of Linux Managed Hosting--Without the Cost and Risk! Fully trained technicians. The highest number of Red Hat certifications in the hosting industry. Fanatical Support. Click to learn more http://sel.as-us.falkag.net/sel?cmd=lnk&kid=107521&bid=248729&dat=121642 _______________________________________________ Nutch-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-developers
