[ 
https://issues.apache.org/jira/browse/NUTCH-2527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16453899#comment-16453899
 ] 

Hudson commented on NUTCH-2527:
-------------------------------

SUCCESS: Integrated in Jenkins build Nutch-nutchgora #1607 (See 
[https://builds.apache.org/job/Nutch-nutchgora/1607/])
NUTCH-2527 URL filter: provide rules to exclude localhost and private (snagel: 
[https://github.com/apache/nutch/commit/d62ece00469fd6b2012418062602246f090e10c5])
* (edit) conf/regex-urlfilter.txt.template


> URL filter: provide rules to exclude localhost and private address spaces
> -------------------------------------------------------------------------
>
>                 Key: NUTCH-2527
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2527
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 2.3.1, 1.14
>            Reporter: Sebastian Nagel
>            Assignee: Sebastian Nagel
>            Priority: Minor
>             Fix For: 2.4, 1.15
>
>
> While checking the log files of a large web crawl, I've found hundreds of 
> (luckily failed) requests of local or private content:
> {noformat}
> 2018-02-18 04:48:34,022 INFO [FetcherThread] 
> org.apache.nutch.fetcher.Fetcher: fetching http://127.0.0.42/ ...
> 018-02-18 04:48:34,022 INFO [FetcherThread] org.apache.nutch.fetcher.Fetcher: 
> fetch of http://127.0.0.42/ failed with: java.net.ConnectException: 
> Connection refused (Connection refused)
> {noformat}
> URLs pointing to localhost, loop-back addresses, private address spaces 
> should be blocked for a wider web crawl where links are not controlled to 
> avoid that information is leaked by links or redirects pointing to web 
> interfaces of services running on the crawling machines (e.g., HDFS, Hadoop 
> YARN).
> Of course, this must be optional. For testing it's quite common to crawl your 
> local machine.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to