[ http://issues.apache.org/jira/browse/NUTCH-13?page=comments#action_63395 ] byron miller commented on NUTCH-13: -----------------------------------
Would it make sense to ignore all IP based URLs? Typically for me IP urls are short lived, mirror servers, load balanced sites, proxy hosts or misconfigured sites. An option to "ignore_ip_based_urls" or something.. > If dns points to 127.0.0.1, the url is also crawled > --------------------------------------------------- > > Key: NUTCH-13 > URL: http://issues.apache.org/jira/browse/NUTCH-13 > Project: Nutch > Type: Bug > Components: fetcher > Reporter: Matthias Jaekle > Priority: Minor > > For example www.tik24.de points to 127.0.0.1. > If you follow a link to www.tik24.de fetcher will crawl content from your own > machine. > Wrong DNS entries could create unwanted entries in segments. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira ------------------------------------------------------- This SF.Net email is sponsored by: New Crystal Reports XI. Version 11 adds new functionality designed to reduce time involved in creating, integrating, and deploying reporting solutions. Free runtime info, new features, or free trial, at: http://www.businessobjects.com/devxi/728 _______________________________________________ Nutch-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-developers
