[jira] [Commented] (NUTCH-2144) Plugin to override db.ignore.external to exempt interesting external domain URLs

Markus Jelsma (JIRA) Mon, 19 Oct 2015 14:41:42 -0700

    [ 
https://issues.apache.org/jira/browse/NUTCH-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14964110#comment-14964110
 ]


Markus Jelsma commented on NUTCH-2144:
--------------------------------------

Yes, this is much more readable indeed.

In ExemptionUrlFilter there is a TODO for getting a content type in the Nutch 
way. It looks like you just want to get a content type for a given URL. Since 
you are using a built-in httpclient to do a head request, and want to do it via 
the fetcher, this means you are going to to many additional requests. This is 
bad, we need to find a way to get the content type for any URL via the 
CrawlDatum.

I had some thoughts about this earlier, the fact that URL filters are missing 
context completely, which we should fix some day anyway! But since this is 
about external items, it makes it much harder because there is no information 
about them in the CrawlDB to begin with.

Any of our other committers to share some thoughts about these issues?

> Plugin to override db.ignore.external to exempt interesting external domain 
> URLs
> --------------------------------------------------------------------------------
>
>                 Key: NUTCH-2144
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2144
>             Project: Nutch
>          Issue Type: New Feature
>          Components: crawldb, fetcher
>            Reporter: Thamme Gowda N
>            Priority: Minor
>         Attachments: ignore-exempt.patch, ignore-exempt.patch
>
>
> Create a rule based urlfilter plugin that allows focused crawler 
> (db.ignore.external.links=true) to fetch static resources from external 
> domains.
> The generalized version of this: This plugin should permit interesting URLs 
> from external domains (by overriding db.ignore.external). The interesting 
> urls are decided from a combination of regex and mime-type rules.
> Concrete use case:
>   When using Nutch to crawl images from a set of domains, the crawler needs 
> to fetch all images which may be linked from CDNs and other domains. In this 
> scenario, allowing all external links and then writing hundreds of regular 
> expressions is not feasible for large number of domains.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (NUTCH-2144) Plugin to override db.ignore.external to exempt interesting external domain URLs

Reply via email to