Thamme Gowda N created NUTCH-2144:
-------------------------------------

             Summary: Plugin to override db.ignore.external to exempt 
interesting external domain URLs
                 Key: NUTCH-2144
                 URL: https://issues.apache.org/jira/browse/NUTCH-2144
             Project: Nutch
          Issue Type: New Feature
          Components: crawldb, fetcher
            Reporter: Thamme Gowda N
            Priority: Minor


Create a rule based urlfilter plugin that allows focused crawler 
(db.ignore.external.links=true) to fetch static resources from external domains.
The generalized version of this: This plugin should permit interesting URLs 
from external domains (by overriding db.ignore.external). The interesting urls 
are decided from a combination of regex and mime-type rules.


Concrete use case:
  When using Nutch to crawl images from a set of domains, the crawler needs to 
fetch all images which may be linked from CDNs and other domains. In this 
scenario, allowing all external links and then writing hundreds of regular 
expressions is not feasible for large number of domains.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to