[ https://issues.apache.org/jira/browse/NUTCH-2527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16453899#comment-16453899 ]
Hudson commented on NUTCH-2527: ------------------------------- SUCCESS: Integrated in Jenkins build Nutch-nutchgora #1607 (See [https://builds.apache.org/job/Nutch-nutchgora/1607/]) NUTCH-2527 URL filter: provide rules to exclude localhost and private (snagel: [https://github.com/apache/nutch/commit/d62ece00469fd6b2012418062602246f090e10c5]) * (edit) conf/regex-urlfilter.txt.template > URL filter: provide rules to exclude localhost and private address spaces > ------------------------------------------------------------------------- > > Key: NUTCH-2527 > URL: https://issues.apache.org/jira/browse/NUTCH-2527 > Project: Nutch > Issue Type: Improvement > Affects Versions: 2.3.1, 1.14 > Reporter: Sebastian Nagel > Assignee: Sebastian Nagel > Priority: Minor > Fix For: 2.4, 1.15 > > > While checking the log files of a large web crawl, I've found hundreds of > (luckily failed) requests of local or private content: > {noformat} > 2018-02-18 04:48:34,022 INFO [FetcherThread] > org.apache.nutch.fetcher.Fetcher: fetching http://127.0.0.42/ ... > 018-02-18 04:48:34,022 INFO [FetcherThread] org.apache.nutch.fetcher.Fetcher: > fetch of http://127.0.0.42/ failed with: java.net.ConnectException: > Connection refused (Connection refused) > {noformat} > URLs pointing to localhost, loop-back addresses, private address spaces > should be blocked for a wider web crawl where links are not controlled to > avoid that information is leaked by links or redirects pointing to web > interfaces of services running on the crawling machines (e.g., HDFS, Hadoop > YARN). > Of course, this must be optional. For testing it's quite common to crawl your > local machine. -- This message was sent by Atlassian JIRA (v7.6.3#76005)