[jira] [Created] (NUTCH-2144) Plugin to override db.ignore.external to exempt interesting external domain URLs

Thamme Gowda N (JIRA) Sun, 18 Oct 2015 23:55:07 -0700

Thamme Gowda N created NUTCH-2144:
-------------------------------------

             Summary: Plugin to override db.ignore.external to exempt 
interesting external domain URLs
                 Key: NUTCH-2144
                 URL: https://issues.apache.org/jira/browse/NUTCH-2144
             Project: Nutch
          Issue Type: New Feature
          Components: crawldb, fetcher
            Reporter: Thamme Gowda N
            Priority: Minor



Create a rule based urlfilter plugin that allows focused crawler 
(db.ignore.external.links=true) to fetch static resources from external domains.
The generalized version of this: This plugin should permit interesting URLs 
from external domains (by overriding db.ignore.external). The interesting urls 
are decided from a combination of regex and mime-type rules.


Concrete use case:
  When using Nutch to crawl images from a set of domains, the crawler needs to 
fetch all images which may be linked from CDNs and other domains. In this 
scenario, allowing all external links and then writing hundreds of regular 
expressions is not feasible for large number of domains.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (NUTCH-2144) Plugin to override db.ignore.external to exempt interesting external domain URLs

Reply via email to