[jira] [Commented] (NUTCH-2144) Plugin to override db.ignore.external to exempt interesting external domain URLs
[ https://issues.apache.org/jira/browse/NUTCH-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15171503#comment-15171503 ] ASF GitHub Bot commented on NUTCH-2144: --- Github user asfgit closed the pull request at: https://github.com/apache/nutch/pull/93 > Plugin to override db.ignore.external to exempt interesting external domain > URLs > > > Key: NUTCH-2144 > URL: https://issues.apache.org/jira/browse/NUTCH-2144 > Project: Nutch > Issue Type: New Feature > Components: crawldb, fetcher >Reporter: Thamme Gowda N >Assignee: Chris A. Mattmann >Priority: Minor > Fix For: 1.12 > > Attachments: ignore-exempt.patch, ignore-exempt.patch > > > Create a rule based urlfilter plugin that allows focused crawler > (db.ignore.external.links=true) to fetch static resources from external > domains. > The generalized version of this: This plugin should permit interesting URLs > from external domains (by overriding db.ignore.external). The interesting > urls are decided from a combination of regex and mime-type rules. > Concrete use case: > When using Nutch to crawl images from a set of domains, the crawler needs > to fetch all images which may be linked from CDNs and other domains. In this > scenario, allowing all external links and then writing hundreds of regular > expressions is not feasible for large number of domains. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2144) Plugin to override db.ignore.external to exempt interesting external domain URLs
[ https://issues.apache.org/jira/browse/NUTCH-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15171369#comment-15171369 ] ASF GitHub Bot commented on NUTCH-2144: --- Github user thammegowda closed the pull request at: https://github.com/apache/nutch/pull/89 > Plugin to override db.ignore.external to exempt interesting external domain > URLs > > > Key: NUTCH-2144 > URL: https://issues.apache.org/jira/browse/NUTCH-2144 > Project: Nutch > Issue Type: New Feature > Components: crawldb, fetcher >Reporter: Thamme Gowda N >Assignee: Chris A. Mattmann >Priority: Minor > Fix For: 1.12 > > Attachments: ignore-exempt.patch, ignore-exempt.patch > > > Create a rule based urlfilter plugin that allows focused crawler > (db.ignore.external.links=true) to fetch static resources from external > domains. > The generalized version of this: This plugin should permit interesting URLs > from external domains (by overriding db.ignore.external). The interesting > urls are decided from a combination of regex and mime-type rules. > Concrete use case: > When using Nutch to crawl images from a set of domains, the crawler needs > to fetch all images which may be linked from CDNs and other domains. In this > scenario, allowing all external links and then writing hundreds of regular > expressions is not feasible for large number of domains. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2144) Plugin to override db.ignore.external to exempt interesting external domain URLs
[ https://issues.apache.org/jira/browse/NUTCH-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15171366#comment-15171366 ] ASF GitHub Bot commented on NUTCH-2144: --- GitHub user thammegowda opened a pull request: https://github.com/apache/nutch/pull/93 NUTCH-2144 Added an extension point and a plugin to accept external links This PR is a duplicate of #89 Recreated due to the issues caused while moving to writable git. @chrismattmann You can merge this pull request into a Git repository by running: $ git pull https://github.com/thammegowda/nutch NUTCH-2144 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/nutch/pull/93.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #93 commit 2015703cfd32cae98b14d2fd6af5ac4396237c48 Author: Thamme GowdaDate: 2016-02-29T03:23:26Z NUTCH-2144 Added an extension point and a plugin that overrides db.ignore.external to accept external links commit 9a284c0d6d2aec86b00016a8abeddc07e5292ee9 Author: Thamme Gowda Date: 2016-02-29T03:29:09Z Add a sample config > Plugin to override db.ignore.external to exempt interesting external domain > URLs > > > Key: NUTCH-2144 > URL: https://issues.apache.org/jira/browse/NUTCH-2144 > Project: Nutch > Issue Type: New Feature > Components: crawldb, fetcher >Reporter: Thamme Gowda N >Assignee: Chris A. Mattmann >Priority: Minor > Fix For: 1.12 > > Attachments: ignore-exempt.patch, ignore-exempt.patch > > > Create a rule based urlfilter plugin that allows focused crawler > (db.ignore.external.links=true) to fetch static resources from external > domains. > The generalized version of this: This plugin should permit interesting URLs > from external domains (by overriding db.ignore.external). The interesting > urls are decided from a combination of regex and mime-type rules. > Concrete use case: > When using Nutch to crawl images from a set of domains, the crawler needs > to fetch all images which may be linked from CDNs and other domains. In this > scenario, allowing all external links and then writing hundreds of regular > expressions is not feasible for large number of domains. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2144) Plugin to override db.ignore.external to exempt interesting external domain URLs
[ https://issues.apache.org/jira/browse/NUTCH-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15167626#comment-15167626 ] ASF GitHub Bot commented on NUTCH-2144: --- GitHub user thammegowda reopened a pull request: https://github.com/apache/nutch/pull/89 NUTCH-2144 : override db.ignore.external to exempt interesting external domain URLs + Add extension point org.apache.nutch.net.URLExemptionFilter + Modify FetcherThread and ParseOutputFormat to integrate new extension point + Add extension urlfilter-ignoreexempt + build configs modified to include new extension Resolves https://issues.apache.org/jira/browse/NUTCH-2144 You can merge this pull request into a Git repository by running: $ git pull https://github.com/thammegowda/nutch NUTCH-2144-ignore-exempt Alternatively you can review and apply these changes as the patch at: https://github.com/apache/nutch/pull/89.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #89 commit 29c7ae4ec088f0f428ab95992e10af9d87a231ad Author: Thamme GowdaDate: 2015-10-19T14:31:04Z Add an extension point and an extension to override 'db.ignore.external.links'. + Add extension point org.apache.nutch.net.URLExemptionFilter + Modify FetcherThread and ParseOutputFormat to integrate new extension point + Add extension urlfilter-ignoreexempt + build configs modified to include new extension commit 43583f7af19c21dd7553c5e41c026cb852f0cfa1 Author: Thamme Gowda Date: 2016-02-11T00:03:03Z Added ignore exemption extension point and an extension commit 559eb4d905a081e34095f0e9d5ee3e805363ccc3 Author: Thamme Gowda Date: 2016-02-11T00:21:08Z README updated commit 3cf887befc56c3cf4127f45385e83c47248dd6e9 Author: Thamme Gowda Date: 2016-02-11T00:23:12Z Add an example rule fiile commit b5cf404bf451fa80186ebb4120cfd39aa2c0f00b Author: Thamme Gowda Date: 2016-02-12T01:01:02Z Added License header commit 3a555b106a4cef9bf0c0e0699f79aedd14ef9fa1 Author: Thamme Gowda Date: 2016-02-14T02:06:51Z Code reviewers suggestion incorporated + Reusing the rules and format from urlfilter-regex commit 6bd026c8482f98b14a56b9b9bff78307f6998189 Author: Thamme Gowda Date: 2016-02-25T02:16:17Z merge upstream changed and Resolve all conflicts > Plugin to override db.ignore.external to exempt interesting external domain > URLs > > > Key: NUTCH-2144 > URL: https://issues.apache.org/jira/browse/NUTCH-2144 > Project: Nutch > Issue Type: New Feature > Components: crawldb, fetcher >Reporter: Thamme Gowda N >Assignee: Chris A. Mattmann >Priority: Minor > Fix For: 1.12 > > Attachments: ignore-exempt.patch, ignore-exempt.patch > > > Create a rule based urlfilter plugin that allows focused crawler > (db.ignore.external.links=true) to fetch static resources from external > domains. > The generalized version of this: This plugin should permit interesting URLs > from external domains (by overriding db.ignore.external). The interesting > urls are decided from a combination of regex and mime-type rules. > Concrete use case: > When using Nutch to crawl images from a set of domains, the crawler needs > to fetch all images which may be linked from CDNs and other domains. In this > scenario, allowing all external links and then writing hundreds of regular > expressions is not feasible for large number of domains. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2144) Plugin to override db.ignore.external to exempt interesting external domain URLs
[ https://issues.apache.org/jira/browse/NUTCH-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15167600#comment-15167600 ] ASF GitHub Bot commented on NUTCH-2144: --- Github user asfgit closed the pull request at: https://github.com/apache/nutch/pull/89 > Plugin to override db.ignore.external to exempt interesting external domain > URLs > > > Key: NUTCH-2144 > URL: https://issues.apache.org/jira/browse/NUTCH-2144 > Project: Nutch > Issue Type: New Feature > Components: crawldb, fetcher >Reporter: Thamme Gowda N >Assignee: Chris A. Mattmann >Priority: Minor > Fix For: 1.12 > > Attachments: ignore-exempt.patch, ignore-exempt.patch > > > Create a rule based urlfilter plugin that allows focused crawler > (db.ignore.external.links=true) to fetch static resources from external > domains. > The generalized version of this: This plugin should permit interesting URLs > from external domains (by overriding db.ignore.external). The interesting > urls are decided from a combination of regex and mime-type rules. > Concrete use case: > When using Nutch to crawl images from a set of domains, the crawler needs > to fetch all images which may be linked from CDNs and other domains. In this > scenario, allowing all external links and then writing hundreds of regular > expressions is not feasible for large number of domains. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2144) Plugin to override db.ignore.external to exempt interesting external domain URLs
[ https://issues.apache.org/jira/browse/NUTCH-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15166447#comment-15166447 ] Thamme Gowda N commented on NUTCH-2144: --- Hi [~wastl-nagel], Were you able to test this plugin? I agree on both the points. The supplied plugin is just a start and we can have sophisticated plugins with this extension point. > Plugin to override db.ignore.external to exempt interesting external domain > URLs > > > Key: NUTCH-2144 > URL: https://issues.apache.org/jira/browse/NUTCH-2144 > Project: Nutch > Issue Type: New Feature > Components: crawldb, fetcher >Reporter: Thamme Gowda N >Assignee: Chris A. Mattmann >Priority: Minor > Fix For: 1.12 > > Attachments: ignore-exempt.patch, ignore-exempt.patch > > > Create a rule based urlfilter plugin that allows focused crawler > (db.ignore.external.links=true) to fetch static resources from external > domains. > The generalized version of this: This plugin should permit interesting URLs > from external domains (by overriding db.ignore.external). The interesting > urls are decided from a combination of regex and mime-type rules. > Concrete use case: > When using Nutch to crawl images from a set of domains, the crawler needs > to fetch all images which may be linked from CDNs and other domains. In this > scenario, allowing all external links and then writing hundreds of regular > expressions is not feasible for large number of domains. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2144) Plugin to override db.ignore.external to exempt interesting external domain URLs
[ https://issues.apache.org/jira/browse/NUTCH-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15158684#comment-15158684 ] Markus Jelsma commented on NUTCH-2144: -- ParseOutputFormat.filterNormalize() signature has changed since NUTCH-2221. Parameter ignoreInternalLinks was added. The parameter is read from db.ignore.internal.links configuration directive. > Plugin to override db.ignore.external to exempt interesting external domain > URLs > > > Key: NUTCH-2144 > URL: https://issues.apache.org/jira/browse/NUTCH-2144 > Project: Nutch > Issue Type: New Feature > Components: crawldb, fetcher >Reporter: Thamme Gowda N >Assignee: Chris A. Mattmann >Priority: Minor > Fix For: 1.12 > > Attachments: ignore-exempt.patch, ignore-exempt.patch > > > Create a rule based urlfilter plugin that allows focused crawler > (db.ignore.external.links=true) to fetch static resources from external > domains. > The generalized version of this: This plugin should permit interesting URLs > from external domains (by overriding db.ignore.external). The interesting > urls are decided from a combination of regex and mime-type rules. > Concrete use case: > When using Nutch to crawl images from a set of domains, the crawler needs > to fetch all images which may be linked from CDNs and other domains. In this > scenario, allowing all external links and then writing hundreds of regular > expressions is not feasible for large number of domains. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2144) Plugin to override db.ignore.external to exempt interesting external domain URLs
[ https://issues.apache.org/jira/browse/NUTCH-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15146685#comment-15146685 ] Sebastian Nagel commented on NUTCH-2144: Hi [~thammegowda], thanks! Everything looks good with the changes. It's definitely a good idea to reuse the code from urlfilter-regex, and users will appreciate if rules/regexes work the same way. The ant build files are ok, afaics, but I'll try to test the plugin tomorrow. Two points, I would like to bring up for discussion now, since this plugin will introduce a new interface, and interfaces aren't easily changed later: # currently the filter(...) method takes fromUrl and toUrl as arguments. The interface could be more powerful and adaptible to further use cases if we add ## the tag name where the link comes from ("a", "img", "form", etc.). Currently the tag name is not available in ParseOutputFormat, we would have to pass it via Outlink from the parser where tag names are already used to filter links, cf. property "parser.html.outlinks.ignore_tags". Tag names would be the easier way to distinguish between page resources and real outlinks. ## similar whether it's a link or a redirect: could be used to follow redirects when a site has moved to a different host and is now redirected, while still ignoring external outlinks # the naming could be more explicit: "URLExemptionFilter" or "urlfilter-ignoreexempt" do make clear that it's about an exemption from the "db.ignore.external.links" property. Only the config file "conf/db-ignore-external-exemptions.txt" is sufficiently precise. To avoid overlong names (e.g., "IgnoreExternalLinksExemptionUrlFilter"), maybe resolve the double negation to something like "AcceptExternalUrlFilter" or "urlfilter-externallink". As said both points are just for discussion, or for later improvements. > Plugin to override db.ignore.external to exempt interesting external domain > URLs > > > Key: NUTCH-2144 > URL: https://issues.apache.org/jira/browse/NUTCH-2144 > Project: Nutch > Issue Type: New Feature > Components: crawldb, fetcher >Reporter: Thamme Gowda N >Assignee: Chris A. Mattmann >Priority: Minor > Fix For: 1.12 > > Attachments: ignore-exempt.patch, ignore-exempt.patch > > > Create a rule based urlfilter plugin that allows focused crawler > (db.ignore.external.links=true) to fetch static resources from external > domains. > The generalized version of this: This plugin should permit interesting URLs > from external domains (by overriding db.ignore.external). The interesting > urls are decided from a combination of regex and mime-type rules. > Concrete use case: > When using Nutch to crawl images from a set of domains, the crawler needs > to fetch all images which may be linked from CDNs and other domains. In this > scenario, allowing all external links and then writing hundreds of regular > expressions is not feasible for large number of domains. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2144) Plugin to override db.ignore.external to exempt interesting external domain URLs
[ https://issues.apache.org/jira/browse/NUTCH-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15146181#comment-15146181 ] ASF GitHub Bot commented on NUTCH-2144: --- Github user sebastian-nagel commented on a diff in the pull request: https://github.com/apache/nutch/pull/89#discussion_r52833038 --- Diff: src/plugin/urlfilter-ignoreexempt/src/java/org/apache/nutch/urlfilter/ignoreexempt/ExemptionUrlFilter.java --- @@ -0,0 +1,144 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.nutch.urlfilter.ignoreexempt; + +import org.apache.commons.io.IOUtils; +import org.apache.hadoop.conf.Configuration; +import org.apache.nutch.net.URLExemptionFilter; +import org.apache.nutch.util.NutchConfiguration; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.io.IOException; +import java.io.InputStream; +import java.util.Arrays; +import java.util.regex.Pattern; +import java.util.List; +import java.util.ArrayList; + + +/** + * This implementation of {@link org.apache.nutch.net.URLExemptionFilter} uses regex configuration + * to check if URL is eligible for exemption from 'db.ignore.external'. + * When this filter is enabled, urls will be checked against configured sequence of regex rules. + * + * The exemption rule file defaults to db-ignore-external-exemptions.txt in the classpath but can be + * overridden using the property "db.ignore.external.exemptions.file" in ./conf/nutch-*.xml + * + * + * The exemption rules are specified in plain text file where each line is a rule. + * + * When the url matches regex it is exempted from 'db.ignore.external...' + * Examples: + * + * + * Exempt urls ending with .jpg or .png or gif + * .*\.(jpg|JPG|png$|PNG|gif|GIF)$ + * + * + + * + * @since Feb 10, 2016 + * @version 1 + * @see URLExemptionFilter + */ +public class ExemptionUrlFilter implements URLExemptionFilter { + + public static final String DB_IGNORE_EXTERNAL_EXEMPTIONS_FILE = "db.ignore.external.exemptions.file"; + private static final Logger LOG = LoggerFactory.getLogger(ExemptionUrlFilter.class); + private static ExemptionUrlFilter INSTANCE; + + private List exemptions; + private Configuration conf; + private boolean enabled; + + public static ExemptionUrlFilter getInstance() { --- End diff -- Nutch automatically caches a single instance of each plugin class. The main() method could also call the constructor and then setConf(). > Plugin to override db.ignore.external to exempt interesting external domain > URLs > > > Key: NUTCH-2144 > URL: https://issues.apache.org/jira/browse/NUTCH-2144 > Project: Nutch > Issue Type: New Feature > Components: crawldb, fetcher >Reporter: Thamme Gowda N >Assignee: Chris A. Mattmann >Priority: Minor > Fix For: 1.12 > > Attachments: ignore-exempt.patch, ignore-exempt.patch > > > Create a rule based urlfilter plugin that allows focused crawler > (db.ignore.external.links=true) to fetch static resources from external > domains. > The generalized version of this: This plugin should permit interesting URLs > from external domains (by overriding db.ignore.external). The interesting > urls are decided from a combination of regex and mime-type rules. > Concrete use case: > When using Nutch to crawl images from a set of domains, the crawler needs > to fetch all images which may be linked from CDNs and other domains. In this > scenario, allowing all external links and then writing hundreds of regular > expressions is not feasible for large number of domains. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2144) Plugin to override db.ignore.external to exempt interesting external domain URLs
[ https://issues.apache.org/jira/browse/NUTCH-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15146184#comment-15146184 ] ASF GitHub Bot commented on NUTCH-2144: --- Github user sebastian-nagel commented on a diff in the pull request: https://github.com/apache/nutch/pull/89#discussion_r52833199 --- Diff: src/plugin/urlfilter-ignoreexempt/src/java/org/apache/nutch/urlfilter/ignoreexempt/ExemptionUrlFilter.java --- @@ -0,0 +1,144 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.nutch.urlfilter.ignoreexempt; + +import org.apache.commons.io.IOUtils; +import org.apache.hadoop.conf.Configuration; +import org.apache.nutch.net.URLExemptionFilter; +import org.apache.nutch.util.NutchConfiguration; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.io.IOException; +import java.io.InputStream; +import java.util.Arrays; +import java.util.regex.Pattern; +import java.util.List; +import java.util.ArrayList; + + +/** + * This implementation of {@link org.apache.nutch.net.URLExemptionFilter} uses regex configuration + * to check if URL is eligible for exemption from 'db.ignore.external'. + * When this filter is enabled, urls will be checked against configured sequence of regex rules. + * + * The exemption rule file defaults to db-ignore-external-exemptions.txt in the classpath but can be + * overridden using the property "db.ignore.external.exemptions.file" in ./conf/nutch-*.xml + * + * + * The exemption rules are specified in plain text file where each line is a rule. + * + * When the url matches regex it is exempted from 'db.ignore.external...' + * Examples: + * + * + * Exempt urls ending with .jpg or .png or gif + * .*\.(jpg|JPG|png$|PNG|gif|GIF)$ + * + * + + * + * @since Feb 10, 2016 + * @version 1 + * @see URLExemptionFilter + */ +public class ExemptionUrlFilter implements URLExemptionFilter { + + public static final String DB_IGNORE_EXTERNAL_EXEMPTIONS_FILE = "db.ignore.external.exemptions.file"; + private static final Logger LOG = LoggerFactory.getLogger(ExemptionUrlFilter.class); + private static ExemptionUrlFilter INSTANCE; + + private List exemptions; + private Configuration conf; + private boolean enabled; --- End diff -- Is this variable necessary? > Plugin to override db.ignore.external to exempt interesting external domain > URLs > > > Key: NUTCH-2144 > URL: https://issues.apache.org/jira/browse/NUTCH-2144 > Project: Nutch > Issue Type: New Feature > Components: crawldb, fetcher >Reporter: Thamme Gowda N >Assignee: Chris A. Mattmann >Priority: Minor > Fix For: 1.12 > > Attachments: ignore-exempt.patch, ignore-exempt.patch > > > Create a rule based urlfilter plugin that allows focused crawler > (db.ignore.external.links=true) to fetch static resources from external > domains. > The generalized version of this: This plugin should permit interesting URLs > from external domains (by overriding db.ignore.external). The interesting > urls are decided from a combination of regex and mime-type rules. > Concrete use case: > When using Nutch to crawl images from a set of domains, the crawler needs > to fetch all images which may be linked from CDNs and other domains. In this > scenario, allowing all external links and then writing hundreds of regular > expressions is not feasible for large number of domains. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2144) Plugin to override db.ignore.external to exempt interesting external domain URLs
[ https://issues.apache.org/jira/browse/NUTCH-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15146201#comment-15146201 ] ASF GitHub Bot commented on NUTCH-2144: --- Github user sebastian-nagel commented on a diff in the pull request: https://github.com/apache/nutch/pull/89#discussion_r52833653 --- Diff: src/java/org/apache/nutch/parse/ParseOutputFormat.java --- @@ -338,6 +340,9 @@ public static String filterNormalize(String fromUrl, String toUrl, } catch (MalformedURLException e1) { return null; // skip it } + + + if ("bydomain".equalsIgnoreCase(ignoreExternalLinksMode)) { String toDomain = URLUtil.getDomainName(targetURL).toLowerCase(); if (toDomain == null || !toDomain.equals(origin)) { --- End diff -- Shouldn't this case also be covered (db.ignore.external.links == true and db.ignore.external.links.mode == byDomain)? > Plugin to override db.ignore.external to exempt interesting external domain > URLs > > > Key: NUTCH-2144 > URL: https://issues.apache.org/jira/browse/NUTCH-2144 > Project: Nutch > Issue Type: New Feature > Components: crawldb, fetcher >Reporter: Thamme Gowda N >Assignee: Chris A. Mattmann >Priority: Minor > Fix For: 1.12 > > Attachments: ignore-exempt.patch, ignore-exempt.patch > > > Create a rule based urlfilter plugin that allows focused crawler > (db.ignore.external.links=true) to fetch static resources from external > domains. > The generalized version of this: This plugin should permit interesting URLs > from external domains (by overriding db.ignore.external). The interesting > urls are decided from a combination of regex and mime-type rules. > Concrete use case: > When using Nutch to crawl images from a set of domains, the crawler needs > to fetch all images which may be linked from CDNs and other domains. In this > scenario, allowing all external links and then writing hundreds of regular > expressions is not feasible for large number of domains. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2144) Plugin to override db.ignore.external to exempt interesting external domain URLs
[ https://issues.apache.org/jira/browse/NUTCH-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15146208#comment-15146208 ] ASF GitHub Bot commented on NUTCH-2144: --- Github user sebastian-nagel commented on a diff in the pull request: https://github.com/apache/nutch/pull/89#discussion_r52833853 --- Diff: src/plugin/urlfilter-ignoreexempt/src/java/org/apache/nutch/urlfilter/ignoreexempt/ExemptionUrlFilter.java --- @@ -0,0 +1,144 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.nutch.urlfilter.ignoreexempt; + +import org.apache.commons.io.IOUtils; +import org.apache.hadoop.conf.Configuration; +import org.apache.nutch.net.URLExemptionFilter; +import org.apache.nutch.util.NutchConfiguration; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.io.IOException; +import java.io.InputStream; +import java.util.Arrays; +import java.util.regex.Pattern; +import java.util.List; +import java.util.ArrayList; + + +/** + * This implementation of {@link org.apache.nutch.net.URLExemptionFilter} uses regex configuration + * to check if URL is eligible for exemption from 'db.ignore.external'. + * When this filter is enabled, urls will be checked against configured sequence of regex rules. + * + * The exemption rule file defaults to db-ignore-external-exemptions.txt in the classpath but can be + * overridden using the property "db.ignore.external.exemptions.file" in ./conf/nutch-*.xml + * + * + * The exemption rules are specified in plain text file where each line is a rule. + * + * When the url matches regex it is exempted from 'db.ignore.external...' + * Examples: + * + * + * Exempt urls ending with .jpg or .png or gif + * .*\.(jpg|JPG|png$|PNG|gif|GIF)$ + * + * + + * + * @since Feb 10, 2016 + * @version 1 + * @see URLExemptionFilter + */ +public class ExemptionUrlFilter implements URLExemptionFilter { + + public static final String DB_IGNORE_EXTERNAL_EXEMPTIONS_FILE = "db.ignore.external.exemptions.file"; + private static final Logger LOG = LoggerFactory.getLogger(ExemptionUrlFilter.class); + private static ExemptionUrlFilter INSTANCE; + + private List exemptions; + private Configuration conf; + private boolean enabled; + + public static ExemptionUrlFilter getInstance() { +if(INSTANCE == null) { + synchronized (ExemptionUrlFilter.class) { +if (INSTANCE == null) { + INSTANCE = new ExemptionUrlFilter(); + INSTANCE.setConf(NutchConfiguration.create()); +} + } +} +return INSTANCE; + } + + public boolean isEnabled() { +return enabled; + } + + public List getExemptions() { +return exemptions; + } + + @Override + public boolean filter(String fromUrl, String toUrl) { +//this implementation doesnt do anything with fromUrl +if (exemptions != null) { + for (Pattern pattern : exemptions) { +if (pattern.matcher(toUrl).matches()) { --- End diff -- Could possibly use Pattern.find() instead of Pattern.matches(): - would make patterns often shorter: `\.jpg$` instead of `.*\.jpg` (or `.*\.jpg$`) - (more important) the syntax would be the same as for urlfilter-regex rules > Plugin to override db.ignore.external to exempt interesting external domain > URLs > > > Key: NUTCH-2144 > URL: https://issues.apache.org/jira/browse/NUTCH-2144 > Project: Nutch > Issue Type: New Feature > Components: crawldb, fetcher >Reporter: Thamme Gowda N >Assignee: Chris A. Mattmann >Priority: Minor > Fix For: 1.12 > > Attachments: ignore-exempt.patch,
[jira] [Commented] (NUTCH-2144) Plugin to override db.ignore.external to exempt interesting external domain URLs
[ https://issues.apache.org/jira/browse/NUTCH-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15146209#comment-15146209 ] ASF GitHub Bot commented on NUTCH-2144: --- Github user sebastian-nagel commented on a diff in the pull request: https://github.com/apache/nutch/pull/89#discussion_r52833909 --- Diff: conf/db-ignore-external-exemptions.txt --- @@ -0,0 +1,37 @@ +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# +# +# Exemption rules to db.ignore.external.links +# +# Format : +# +# UrlRegex1 +# UrlRegex2 +# UrlRegex3 + + +# NOTE :: +# 1. When the url matches any of the regex then that url is exempted. +# 2. # in the beginning makes it a comment line +# 3. To Test the regex, update this file and use the below command +# bin/nutch plugin urlfilter-ignoreexempt org.apache.nutch.urlfilter.ignoreexempt.ExemptionUrlFilter +# 4. Dont forget to enable this plugin in nutch-site.xml + + +# Example 1: +#-- +# To exempt urls ending with image extensions, uncomment the below line +#.*\.(jpg|JPG|png$|PNG|gif|GIF)$ --- End diff -- Regex could be simplified to `#(?i).*\.(?:jpg|png|gif)` (or `#(?i)\.(?:jpg|png|gif)$` if Pattern.find() is used). `(?i)` makes the pattern case insensitive, cf. [NUTCH-2035](https://issues.apache.org/jira/browse/NUTCH-2035) > Plugin to override db.ignore.external to exempt interesting external domain > URLs > > > Key: NUTCH-2144 > URL: https://issues.apache.org/jira/browse/NUTCH-2144 > Project: Nutch > Issue Type: New Feature > Components: crawldb, fetcher >Reporter: Thamme Gowda N >Assignee: Chris A. Mattmann >Priority: Minor > Fix For: 1.12 > > Attachments: ignore-exempt.patch, ignore-exempt.patch > > > Create a rule based urlfilter plugin that allows focused crawler > (db.ignore.external.links=true) to fetch static resources from external > domains. > The generalized version of this: This plugin should permit interesting URLs > from external domains (by overriding db.ignore.external). The interesting > urls are decided from a combination of regex and mime-type rules. > Concrete use case: > When using Nutch to crawl images from a set of domains, the crawler needs > to fetch all images which may be linked from CDNs and other domains. In this > scenario, allowing all external links and then writing hundreds of regular > expressions is not feasible for large number of domains. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2144) Plugin to override db.ignore.external to exempt interesting external domain URLs
[ https://issues.apache.org/jira/browse/NUTCH-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15146210#comment-15146210 ] ASF GitHub Bot commented on NUTCH-2144: --- Github user sebastian-nagel commented on a diff in the pull request: https://github.com/apache/nutch/pull/89#discussion_r52833922 --- Diff: src/plugin/urlfilter-ignoreexempt/README.md --- @@ -0,0 +1,52 @@ +urlfilter-ignoreexempt +== + This plugin allows certain urls to be exempted when the external links are configured to be ignored. + This is useful when focused crawl is setup but some resources like static files are linked from CDNs (external domains). + +How to enable ? +== +Add `urlfilter-ignoreexempt` value to `plugin.includes` property +```xml + + plugin.includes + protocol-http|urlfilter-(regex|ignoreexempt)... + +``` + +How to configure rules? + + +open `conf/db-ignore-external-exemptions.txt` and add rules + + Format : + +``` +UrlRegex1 +UrlRegex2 +UrlRegex3 +``` + + + NOTE :: + 1. If an url matches any of the given regexps then that url is exempted. + 2. \# in the beginning makes it a comment line + 3. To Test the regex, update this file and use the below command +bin/nutch plugin urlfilter-ignoreexempt org.apache.nutch.urlfilter.ignoreexempt.ExemptionUrlFilter + + + Example : + + To exempt urls ending with image extensions, use this rule + +`.*\.(jpg|JPG|png$|PNG|gif|GIF)$# Testing` --- End diff -- dito > Plugin to override db.ignore.external to exempt interesting external domain > URLs > > > Key: NUTCH-2144 > URL: https://issues.apache.org/jira/browse/NUTCH-2144 > Project: Nutch > Issue Type: New Feature > Components: crawldb, fetcher >Reporter: Thamme Gowda N >Assignee: Chris A. Mattmann >Priority: Minor > Fix For: 1.12 > > Attachments: ignore-exempt.patch, ignore-exempt.patch > > > Create a rule based urlfilter plugin that allows focused crawler > (db.ignore.external.links=true) to fetch static resources from external > domains. > The generalized version of this: This plugin should permit interesting URLs > from external domains (by overriding db.ignore.external). The interesting > urls are decided from a combination of regex and mime-type rules. > Concrete use case: > When using Nutch to crawl images from a set of domains, the crawler needs > to fetch all images which may be linked from CDNs and other domains. In this > scenario, allowing all external links and then writing hundreds of regular > expressions is not feasible for large number of domains. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2144) Plugin to override db.ignore.external to exempt interesting external domain URLs
[ https://issues.apache.org/jira/browse/NUTCH-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15146260#comment-15146260 ] ASF GitHub Bot commented on NUTCH-2144: --- Github user thammegowda commented on a diff in the pull request: https://github.com/apache/nutch/pull/89#discussion_r52835489 --- Diff: src/plugin/urlfilter-ignoreexempt/src/java/org/apache/nutch/urlfilter/ignoreexempt/ExemptionUrlFilter.java --- @@ -0,0 +1,144 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.nutch.urlfilter.ignoreexempt; + +import org.apache.commons.io.IOUtils; +import org.apache.hadoop.conf.Configuration; +import org.apache.nutch.net.URLExemptionFilter; +import org.apache.nutch.util.NutchConfiguration; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.io.IOException; +import java.io.InputStream; +import java.util.Arrays; +import java.util.regex.Pattern; +import java.util.List; +import java.util.ArrayList; + + +/** + * This implementation of {@link org.apache.nutch.net.URLExemptionFilter} uses regex configuration + * to check if URL is eligible for exemption from 'db.ignore.external'. + * When this filter is enabled, urls will be checked against configured sequence of regex rules. + * + * The exemption rule file defaults to db-ignore-external-exemptions.txt in the classpath but can be + * overridden using the property "db.ignore.external.exemptions.file" in ./conf/nutch-*.xml + * + * + * The exemption rules are specified in plain text file where each line is a rule. + * + * When the url matches regex it is exempted from 'db.ignore.external...' + * Examples: + * + * + * Exempt urls ending with .jpg or .png or gif + * .*\.(jpg|JPG|png$|PNG|gif|GIF)$ + * + * + + * + * @since Feb 10, 2016 + * @version 1 + * @see URLExemptionFilter + */ +public class ExemptionUrlFilter implements URLExemptionFilter { + + public static final String DB_IGNORE_EXTERNAL_EXEMPTIONS_FILE = "db.ignore.external.exemptions.file"; + private static final Logger LOG = LoggerFactory.getLogger(ExemptionUrlFilter.class); + private static ExemptionUrlFilter INSTANCE; + + private List exemptions; + private Configuration conf; + private boolean enabled; + + public static ExemptionUrlFilter getInstance() { +if(INSTANCE == null) { + synchronized (ExemptionUrlFilter.class) { +if (INSTANCE == null) { + INSTANCE = new ExemptionUrlFilter(); + INSTANCE.setConf(NutchConfiguration.create()); +} + } +} +return INSTANCE; + } + + public boolean isEnabled() { +return enabled; + } + + public List getExemptions() { +return exemptions; + } + + @Override + public boolean filter(String fromUrl, String toUrl) { +//this implementation doesnt do anything with fromUrl +if (exemptions != null) { + for (Pattern pattern : exemptions) { +if (pattern.matcher(toUrl).matches()) { + return true; //If a regex matches, then exempted +} + } +} +//not exempted +return false; + } + + @Override + public void setConf(Configuration conf) { +this.conf = conf; +LOG.info("Ignore exemptions enabled"); +String fileName = this.conf.get(DB_IGNORE_EXTERNAL_EXEMPTIONS_FILE); +InputStream stream = this.conf.getConfResourceAsInputStream(fileName); +if (stream == null) { + throw new RuntimeException("Couldn't find config file :" + fileName); +} +try { + this.exemptions = new ArrayList(); + List lines = IOUtils.readLines(stream); + for (String line : lines) { +line = line.trim(); +if (line.startsWith("#") || line.isEmpty()) { +
[jira] [Commented] (NUTCH-2144) Plugin to override db.ignore.external to exempt interesting external domain URLs
[ https://issues.apache.org/jira/browse/NUTCH-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15146262#comment-15146262 ] ASF GitHub Bot commented on NUTCH-2144: --- Github user thammegowda commented on a diff in the pull request: https://github.com/apache/nutch/pull/89#discussion_r52835493 --- Diff: src/plugin/urlfilter-ignoreexempt/src/java/org/apache/nutch/urlfilter/ignoreexempt/ExemptionUrlFilter.java --- @@ -0,0 +1,144 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.nutch.urlfilter.ignoreexempt; + +import org.apache.commons.io.IOUtils; +import org.apache.hadoop.conf.Configuration; +import org.apache.nutch.net.URLExemptionFilter; +import org.apache.nutch.util.NutchConfiguration; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.io.IOException; +import java.io.InputStream; +import java.util.Arrays; +import java.util.regex.Pattern; +import java.util.List; +import java.util.ArrayList; + + +/** + * This implementation of {@link org.apache.nutch.net.URLExemptionFilter} uses regex configuration + * to check if URL is eligible for exemption from 'db.ignore.external'. + * When this filter is enabled, urls will be checked against configured sequence of regex rules. + * + * The exemption rule file defaults to db-ignore-external-exemptions.txt in the classpath but can be + * overridden using the property "db.ignore.external.exemptions.file" in ./conf/nutch-*.xml + * + * + * The exemption rules are specified in plain text file where each line is a rule. + * + * When the url matches regex it is exempted from 'db.ignore.external...' + * Examples: + * + * + * Exempt urls ending with .jpg or .png or gif + * .*\.(jpg|JPG|png$|PNG|gif|GIF)$ + * + * + + * + * @since Feb 10, 2016 + * @version 1 + * @see URLExemptionFilter + */ +public class ExemptionUrlFilter implements URLExemptionFilter { + + public static final String DB_IGNORE_EXTERNAL_EXEMPTIONS_FILE = "db.ignore.external.exemptions.file"; + private static final Logger LOG = LoggerFactory.getLogger(ExemptionUrlFilter.class); + private static ExemptionUrlFilter INSTANCE; + + private List exemptions; + private Configuration conf; + private boolean enabled; + + public static ExemptionUrlFilter getInstance() { --- End diff -- :+1: Agreed. > Plugin to override db.ignore.external to exempt interesting external domain > URLs > > > Key: NUTCH-2144 > URL: https://issues.apache.org/jira/browse/NUTCH-2144 > Project: Nutch > Issue Type: New Feature > Components: crawldb, fetcher >Reporter: Thamme Gowda N >Assignee: Chris A. Mattmann >Priority: Minor > Fix For: 1.12 > > Attachments: ignore-exempt.patch, ignore-exempt.patch > > > Create a rule based urlfilter plugin that allows focused crawler > (db.ignore.external.links=true) to fetch static resources from external > domains. > The generalized version of this: This plugin should permit interesting URLs > from external domains (by overriding db.ignore.external). The interesting > urls are decided from a combination of regex and mime-type rules. > Concrete use case: > When using Nutch to crawl images from a set of domains, the crawler needs > to fetch all images which may be linked from CDNs and other domains. In this > scenario, allowing all external links and then writing hundreds of regular > expressions is not feasible for large number of domains. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2144) Plugin to override db.ignore.external to exempt interesting external domain URLs
[ https://issues.apache.org/jira/browse/NUTCH-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15146263#comment-15146263 ] ASF GitHub Bot commented on NUTCH-2144: --- Github user thammegowda commented on a diff in the pull request: https://github.com/apache/nutch/pull/89#discussion_r52835517 --- Diff: src/plugin/urlfilter-ignoreexempt/src/java/org/apache/nutch/urlfilter/ignoreexempt/ExemptionUrlFilter.java --- @@ -0,0 +1,144 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.nutch.urlfilter.ignoreexempt; + +import org.apache.commons.io.IOUtils; +import org.apache.hadoop.conf.Configuration; +import org.apache.nutch.net.URLExemptionFilter; +import org.apache.nutch.util.NutchConfiguration; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.io.IOException; +import java.io.InputStream; +import java.util.Arrays; +import java.util.regex.Pattern; +import java.util.List; +import java.util.ArrayList; + + +/** + * This implementation of {@link org.apache.nutch.net.URLExemptionFilter} uses regex configuration + * to check if URL is eligible for exemption from 'db.ignore.external'. + * When this filter is enabled, urls will be checked against configured sequence of regex rules. + * + * The exemption rule file defaults to db-ignore-external-exemptions.txt in the classpath but can be + * overridden using the property "db.ignore.external.exemptions.file" in ./conf/nutch-*.xml + * + * + * The exemption rules are specified in plain text file where each line is a rule. + * + * When the url matches regex it is exempted from 'db.ignore.external...' + * Examples: + * + * + * Exempt urls ending with .jpg or .png or gif + * .*\.(jpg|JPG|png$|PNG|gif|GIF)$ + * + * + + * + * @since Feb 10, 2016 + * @version 1 + * @see URLExemptionFilter + */ +public class ExemptionUrlFilter implements URLExemptionFilter { + + public static final String DB_IGNORE_EXTERNAL_EXEMPTIONS_FILE = "db.ignore.external.exemptions.file"; + private static final Logger LOG = LoggerFactory.getLogger(ExemptionUrlFilter.class); + private static ExemptionUrlFilter INSTANCE; + + private List exemptions; + private Configuration conf; + private boolean enabled; --- End diff -- Will be removed. Thanks for pointing out. > Plugin to override db.ignore.external to exempt interesting external domain > URLs > > > Key: NUTCH-2144 > URL: https://issues.apache.org/jira/browse/NUTCH-2144 > Project: Nutch > Issue Type: New Feature > Components: crawldb, fetcher >Reporter: Thamme Gowda N >Assignee: Chris A. Mattmann >Priority: Minor > Fix For: 1.12 > > Attachments: ignore-exempt.patch, ignore-exempt.patch > > > Create a rule based urlfilter plugin that allows focused crawler > (db.ignore.external.links=true) to fetch static resources from external > domains. > The generalized version of this: This plugin should permit interesting URLs > from external domains (by overriding db.ignore.external). The interesting > urls are decided from a combination of regex and mime-type rules. > Concrete use case: > When using Nutch to crawl images from a set of domains, the crawler needs > to fetch all images which may be linked from CDNs and other domains. In this > scenario, allowing all external links and then writing hundreds of regular > expressions is not feasible for large number of domains. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2144) Plugin to override db.ignore.external to exempt interesting external domain URLs
[ https://issues.apache.org/jira/browse/NUTCH-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15146271#comment-15146271 ] ASF GitHub Bot commented on NUTCH-2144: --- Github user thammegowda commented on a diff in the pull request: https://github.com/apache/nutch/pull/89#discussion_r52835613 --- Diff: src/java/org/apache/nutch/parse/ParseOutputFormat.java --- @@ -338,6 +340,9 @@ public static String filterNormalize(String fromUrl, String toUrl, } catch (MalformedURLException e1) { return null; // skip it } + + + if ("bydomain".equalsIgnoreCase(ignoreExternalLinksMode)) { String toDomain = URLUtil.getDomainName(targetURL).toLowerCase(); if (toDomain == null || !toDomain.equals(origin)) { --- End diff -- I did not have a chance to test `db.ignore.external.links.mode : byDomain` feature. However, if this new feature should also be covered for overriding, then I am ready to do so. > Plugin to override db.ignore.external to exempt interesting external domain > URLs > > > Key: NUTCH-2144 > URL: https://issues.apache.org/jira/browse/NUTCH-2144 > Project: Nutch > Issue Type: New Feature > Components: crawldb, fetcher >Reporter: Thamme Gowda N >Assignee: Chris A. Mattmann >Priority: Minor > Fix For: 1.12 > > Attachments: ignore-exempt.patch, ignore-exempt.patch > > > Create a rule based urlfilter plugin that allows focused crawler > (db.ignore.external.links=true) to fetch static resources from external > domains. > The generalized version of this: This plugin should permit interesting URLs > from external domains (by overriding db.ignore.external). The interesting > urls are decided from a combination of regex and mime-type rules. > Concrete use case: > When using Nutch to crawl images from a set of domains, the crawler needs > to fetch all images which may be linked from CDNs and other domains. In this > scenario, allowing all external links and then writing hundreds of regular > expressions is not feasible for large number of domains. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2144) Plugin to override db.ignore.external to exempt interesting external domain URLs
[ https://issues.apache.org/jira/browse/NUTCH-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15146276#comment-15146276 ] ASF GitHub Bot commented on NUTCH-2144: --- Github user thammegowda commented on a diff in the pull request: https://github.com/apache/nutch/pull/89#discussion_r52835676 --- Diff: src/plugin/urlfilter-ignoreexempt/src/java/org/apache/nutch/urlfilter/ignoreexempt/ExemptionUrlFilter.java --- @@ -0,0 +1,144 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.nutch.urlfilter.ignoreexempt; + +import org.apache.commons.io.IOUtils; +import org.apache.hadoop.conf.Configuration; +import org.apache.nutch.net.URLExemptionFilter; +import org.apache.nutch.util.NutchConfiguration; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.io.IOException; +import java.io.InputStream; +import java.util.Arrays; +import java.util.regex.Pattern; +import java.util.List; +import java.util.ArrayList; + + +/** + * This implementation of {@link org.apache.nutch.net.URLExemptionFilter} uses regex configuration + * to check if URL is eligible for exemption from 'db.ignore.external'. + * When this filter is enabled, urls will be checked against configured sequence of regex rules. + * + * The exemption rule file defaults to db-ignore-external-exemptions.txt in the classpath but can be + * overridden using the property "db.ignore.external.exemptions.file" in ./conf/nutch-*.xml + * + * + * The exemption rules are specified in plain text file where each line is a rule. + * + * When the url matches regex it is exempted from 'db.ignore.external...' + * Examples: + * + * + * Exempt urls ending with .jpg or .png or gif + * .*\.(jpg|JPG|png$|PNG|gif|GIF)$ + * + * + + * + * @since Feb 10, 2016 + * @version 1 + * @see URLExemptionFilter + */ +public class ExemptionUrlFilter implements URLExemptionFilter { + + public static final String DB_IGNORE_EXTERNAL_EXEMPTIONS_FILE = "db.ignore.external.exemptions.file"; + private static final Logger LOG = LoggerFactory.getLogger(ExemptionUrlFilter.class); + private static ExemptionUrlFilter INSTANCE; + + private List exemptions; + private Configuration conf; + private boolean enabled; + + public static ExemptionUrlFilter getInstance() { +if(INSTANCE == null) { + synchronized (ExemptionUrlFilter.class) { +if (INSTANCE == null) { + INSTANCE = new ExemptionUrlFilter(); + INSTANCE.setConf(NutchConfiguration.create()); +} + } +} +return INSTANCE; + } + + public boolean isEnabled() { +return enabled; + } + + public List getExemptions() { +return exemptions; + } + + @Override + public boolean filter(String fromUrl, String toUrl) { +//this implementation doesnt do anything with fromUrl +if (exemptions != null) { + for (Pattern pattern : exemptions) { +if (pattern.matcher(toUrl).matches()) { --- End diff -- Totally agreed :+1: My previous attempt to extend `org.apache.nutch.urlfilter.api.RegexURLFilterBase` , failed due to my limited knowledge of ant build. I will give it another try. > Plugin to override db.ignore.external to exempt interesting external domain > URLs > > > Key: NUTCH-2144 > URL: https://issues.apache.org/jira/browse/NUTCH-2144 > Project: Nutch > Issue Type: New Feature > Components: crawldb, fetcher >Reporter: Thamme Gowda N >Assignee: Chris A. Mattmann >Priority: Minor > Fix For: 1.12 > > Attachments: ignore-exempt.patch, ignore-exempt.patch > > > Create a rule based
[jira] [Commented] (NUTCH-2144) Plugin to override db.ignore.external to exempt interesting external domain URLs
[ https://issues.apache.org/jira/browse/NUTCH-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15143793#comment-15143793 ] ASF GitHub Bot commented on NUTCH-2144: --- Github user lewismc commented on a diff in the pull request: https://github.com/apache/nutch/pull/89#discussion_r52692152 --- Diff: conf/db-ignore-external-exemptions.txt --- @@ -0,0 +1,21 @@ +# Exemption rules to db.ignore.external.links --- End diff -- License header please > Plugin to override db.ignore.external to exempt interesting external domain > URLs > > > Key: NUTCH-2144 > URL: https://issues.apache.org/jira/browse/NUTCH-2144 > Project: Nutch > Issue Type: New Feature > Components: crawldb, fetcher >Reporter: Thamme Gowda N >Assignee: Chris A. Mattmann >Priority: Minor > Fix For: 1.12 > > Attachments: ignore-exempt.patch, ignore-exempt.patch > > > Create a rule based urlfilter plugin that allows focused crawler > (db.ignore.external.links=true) to fetch static resources from external > domains. > The generalized version of this: This plugin should permit interesting URLs > from external domains (by overriding db.ignore.external). The interesting > urls are decided from a combination of regex and mime-type rules. > Concrete use case: > When using Nutch to crawl images from a set of domains, the crawler needs > to fetch all images which may be linked from CDNs and other domains. In this > scenario, allowing all external links and then writing hundreds of regular > expressions is not feasible for large number of domains. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2144) Plugin to override db.ignore.external to exempt interesting external domain URLs
[ https://issues.apache.org/jira/browse/NUTCH-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15141181#comment-15141181 ] Thamme Gowda N commented on NUTCH-2144: --- +1 sounds great > Plugin to override db.ignore.external to exempt interesting external domain > URLs > > > Key: NUTCH-2144 > URL: https://issues.apache.org/jira/browse/NUTCH-2144 > Project: Nutch > Issue Type: New Feature > Components: crawldb, fetcher >Reporter: Thamme Gowda N >Assignee: Chris A. Mattmann >Priority: Minor > Fix For: 1.12 > > Attachments: ignore-exempt.patch, ignore-exempt.patch > > > Create a rule based urlfilter plugin that allows focused crawler > (db.ignore.external.links=true) to fetch static resources from external > domains. > The generalized version of this: This plugin should permit interesting URLs > from external domains (by overriding db.ignore.external). The interesting > urls are decided from a combination of regex and mime-type rules. > Concrete use case: > When using Nutch to crawl images from a set of domains, the crawler needs > to fetch all images which may be linked from CDNs and other domains. In this > scenario, allowing all external links and then writing hundreds of regular > expressions is not feasible for large number of domains. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2144) Plugin to override db.ignore.external to exempt interesting external domain URLs
[ https://issues.apache.org/jira/browse/NUTCH-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15141172#comment-15141172 ] Chris A. Mattmann commented on NUTCH-2144: -- I am +1 for this patch, and enabled only by the user (and not by default). This is a critical patch for us in MEMEX and I think it adds a lot of value here to the community. [~lewismc] and I will work to get this committed in the next 48 hours. Thank you [~thammegowda]! > Plugin to override db.ignore.external to exempt interesting external domain > URLs > > > Key: NUTCH-2144 > URL: https://issues.apache.org/jira/browse/NUTCH-2144 > Project: Nutch > Issue Type: New Feature > Components: crawldb, fetcher >Reporter: Thamme Gowda N >Priority: Minor > Attachments: ignore-exempt.patch, ignore-exempt.patch > > > Create a rule based urlfilter plugin that allows focused crawler > (db.ignore.external.links=true) to fetch static resources from external > domains. > The generalized version of this: This plugin should permit interesting URLs > from external domains (by overriding db.ignore.external). The interesting > urls are decided from a combination of regex and mime-type rules. > Concrete use case: > When using Nutch to crawl images from a set of domains, the crawler needs > to fetch all images which may be linked from CDNs and other domains. In this > scenario, allowing all external links and then writing hundreds of regular > expressions is not feasible for large number of domains. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2144) Plugin to override db.ignore.external to exempt interesting external domain URLs
[ https://issues.apache.org/jira/browse/NUTCH-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15141213#comment-15141213 ] Lewis John McGibbney commented on NUTCH-2144: - Hi [~thammegowda], limitations I see are as follows * as mentioned, the HEAD is going to slow stuff down. I see you're FIXME. I have a suggestion for the time being. Lets think about initially addressing the case where we don't bother with HEAD, we just reply upon mimeType detection through evaluation of URL suffix. What do you think about this? * I feel that the invocation of this entire plugin could be extended to also deal with db.ignore.internal. The exact same may apply for the use case when we wish to crawl images from a set of domains, the crawler needs to fetch all images which may be linked internally but I have a list of say 5000 of these domains. In this scenario, allowing all internal links and then writing hundreds of regular expressions is not feasible for large number of domains. This is a nice patch and a lot of work. I like the extension point. > Plugin to override db.ignore.external to exempt interesting external domain > URLs > > > Key: NUTCH-2144 > URL: https://issues.apache.org/jira/browse/NUTCH-2144 > Project: Nutch > Issue Type: New Feature > Components: crawldb, fetcher >Reporter: Thamme Gowda N >Assignee: Chris A. Mattmann >Priority: Minor > Fix For: 1.12 > > Attachments: ignore-exempt.patch, ignore-exempt.patch > > > Create a rule based urlfilter plugin that allows focused crawler > (db.ignore.external.links=true) to fetch static resources from external > domains. > The generalized version of this: This plugin should permit interesting URLs > from external domains (by overriding db.ignore.external). The interesting > urls are decided from a combination of regex and mime-type rules. > Concrete use case: > When using Nutch to crawl images from a set of domains, the crawler needs > to fetch all images which may be linked from CDNs and other domains. In this > scenario, allowing all external links and then writing hundreds of regular > expressions is not feasible for large number of domains. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2144) Plugin to override db.ignore.external to exempt interesting external domain URLs
[ https://issues.apache.org/jira/browse/NUTCH-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15141285#comment-15141285 ] Thamme Gowda N commented on NUTCH-2144: --- Hi [~lewismc] * I think relying on URL suffix based mimetype detection is a nice precision trade-off for the gained speed. I did this for one of my homework and to be honest, i disabled HEAD based MIME type detection because it was taking lot of time. This patch is using basic java Regex to filter. [~chrismattmann] I am not sure if Tika can take an URL and guess possible mime type without making a HEAD call. Can you point me to an example, if there is one? * Agreed. The same logic can be applied to permit certain urls in intra-domain. I am glad you liked it, Let me know what improvements needed to make this useful for wide audience. > Plugin to override db.ignore.external to exempt interesting external domain > URLs > > > Key: NUTCH-2144 > URL: https://issues.apache.org/jira/browse/NUTCH-2144 > Project: Nutch > Issue Type: New Feature > Components: crawldb, fetcher >Reporter: Thamme Gowda N >Assignee: Chris A. Mattmann >Priority: Minor > Fix For: 1.12 > > Attachments: ignore-exempt.patch, ignore-exempt.patch > > > Create a rule based urlfilter plugin that allows focused crawler > (db.ignore.external.links=true) to fetch static resources from external > domains. > The generalized version of this: This plugin should permit interesting URLs > from external domains (by overriding db.ignore.external). The interesting > urls are decided from a combination of regex and mime-type rules. > Concrete use case: > When using Nutch to crawl images from a set of domains, the crawler needs > to fetch all images which may be linked from CDNs and other domains. In this > scenario, allowing all external links and then writing hundreds of regular > expressions is not feasible for large number of domains. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2144) Plugin to override db.ignore.external to exempt interesting external domain URLs
[ https://issues.apache.org/jira/browse/NUTCH-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15141293#comment-15141293 ] Chris A. Mattmann commented on NUTCH-2144: -- Agreed and agreed. Thamme can you submit a new version of the patch/pull request as Lewis suggests using just suffix checking. Thamme - if you turn off MIME magic in the tika config, then it will default to glob pattern and URL regex matching. However, I wouldn't even bother with it in this case, and just doing a simple URL/regex check in Nutch will satisfy the speed gains. Looking forward to the new version of the PR. > Plugin to override db.ignore.external to exempt interesting external domain > URLs > > > Key: NUTCH-2144 > URL: https://issues.apache.org/jira/browse/NUTCH-2144 > Project: Nutch > Issue Type: New Feature > Components: crawldb, fetcher >Reporter: Thamme Gowda N >Assignee: Chris A. Mattmann >Priority: Minor > Fix For: 1.12 > > Attachments: ignore-exempt.patch, ignore-exempt.patch > > > Create a rule based urlfilter plugin that allows focused crawler > (db.ignore.external.links=true) to fetch static resources from external > domains. > The generalized version of this: This plugin should permit interesting URLs > from external domains (by overriding db.ignore.external). The interesting > urls are decided from a combination of regex and mime-type rules. > Concrete use case: > When using Nutch to crawl images from a set of domains, the crawler needs > to fetch all images which may be linked from CDNs and other domains. In this > scenario, allowing all external links and then writing hundreds of regular > expressions is not feasible for large number of domains. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2144) Plugin to override db.ignore.external to exempt interesting external domain URLs
[ https://issues.apache.org/jira/browse/NUTCH-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15141359#comment-15141359 ] Thamme Gowda N commented on NUTCH-2144: --- Thanks. Yes, I will submit a new patch. > Plugin to override db.ignore.external to exempt interesting external domain > URLs > > > Key: NUTCH-2144 > URL: https://issues.apache.org/jira/browse/NUTCH-2144 > Project: Nutch > Issue Type: New Feature > Components: crawldb, fetcher >Reporter: Thamme Gowda N >Assignee: Chris A. Mattmann >Priority: Minor > Fix For: 1.12 > > Attachments: ignore-exempt.patch, ignore-exempt.patch > > > Create a rule based urlfilter plugin that allows focused crawler > (db.ignore.external.links=true) to fetch static resources from external > domains. > The generalized version of this: This plugin should permit interesting URLs > from external domains (by overriding db.ignore.external). The interesting > urls are decided from a combination of regex and mime-type rules. > Concrete use case: > When using Nutch to crawl images from a set of domains, the crawler needs > to fetch all images which may be linked from CDNs and other domains. In this > scenario, allowing all external links and then writing hundreds of regular > expressions is not feasible for large number of domains. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2144) Plugin to override db.ignore.external to exempt interesting external domain URLs
[ https://issues.apache.org/jira/browse/NUTCH-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15141296#comment-15141296 ] Lewis John McGibbney commented on NUTCH-2144: - bq. [~chrismattmann] I am not sure if Tika can take an URL and guess possible mime type without making a HEAD call. Can you point me to an example, if there is one? Yes here you go https://tika.apache.org/1.11/api/index.html?org/apache/tika/detect/NameDetector.html > Plugin to override db.ignore.external to exempt interesting external domain > URLs > > > Key: NUTCH-2144 > URL: https://issues.apache.org/jira/browse/NUTCH-2144 > Project: Nutch > Issue Type: New Feature > Components: crawldb, fetcher >Reporter: Thamme Gowda N >Assignee: Chris A. Mattmann >Priority: Minor > Fix For: 1.12 > > Attachments: ignore-exempt.patch, ignore-exempt.patch > > > Create a rule based urlfilter plugin that allows focused crawler > (db.ignore.external.links=true) to fetch static resources from external > domains. > The generalized version of this: This plugin should permit interesting URLs > from external domains (by overriding db.ignore.external). The interesting > urls are decided from a combination of regex and mime-type rules. > Concrete use case: > When using Nutch to crawl images from a set of domains, the crawler needs > to fetch all images which may be linked from CDNs and other domains. In this > scenario, allowing all external links and then writing hundreds of regular > expressions is not feasible for large number of domains. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2144) Plugin to override db.ignore.external to exempt interesting external domain URLs
[ https://issues.apache.org/jira/browse/NUTCH-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15142021#comment-15142021 ] ASF GitHub Bot commented on NUTCH-2144: --- GitHub user thammegowda opened a pull request: https://github.com/apache/nutch/pull/89 NUTCH-2144 : override db.ignore.external to exempt interesting external domain URLs + Add extension point org.apache.nutch.net.URLExemptionFilter + Modify FetcherThread and ParseOutputFormat to integrate new extension point + Add extension urlfilter-ignoreexempt + build configs modified to include new extension Resolves https://issues.apache.org/jira/browse/NUTCH-2144 You can merge this pull request into a Git repository by running: $ git pull https://github.com/thammegowda/nutch NUTCH-2144-ignore-exempt Alternatively you can review and apply these changes as the patch at: https://github.com/apache/nutch/pull/89.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #89 commit 29c7ae4ec088f0f428ab95992e10af9d87a231ad Author: Thamme GowdaDate: 2015-10-19T14:31:04Z Add an extension point and an extension to override 'db.ignore.external.links'. + Add extension point org.apache.nutch.net.URLExemptionFilter + Modify FetcherThread and ParseOutputFormat to integrate new extension point + Add extension urlfilter-ignoreexempt + build configs modified to include new extension commit 43583f7af19c21dd7553c5e41c026cb852f0cfa1 Author: Thamme Gowda Date: 2016-02-11T00:03:03Z Added ignore exemption extension point and an extension commit 559eb4d905a081e34095f0e9d5ee3e805363ccc3 Author: Thamme Gowda Date: 2016-02-11T00:21:08Z README updated commit 3cf887befc56c3cf4127f45385e83c47248dd6e9 Author: Thamme Gowda Date: 2016-02-11T00:23:12Z Add an example rule fiile > Plugin to override db.ignore.external to exempt interesting external domain > URLs > > > Key: NUTCH-2144 > URL: https://issues.apache.org/jira/browse/NUTCH-2144 > Project: Nutch > Issue Type: New Feature > Components: crawldb, fetcher >Reporter: Thamme Gowda N >Assignee: Chris A. Mattmann >Priority: Minor > Fix For: 1.12 > > Attachments: ignore-exempt.patch, ignore-exempt.patch > > > Create a rule based urlfilter plugin that allows focused crawler > (db.ignore.external.links=true) to fetch static resources from external > domains. > The generalized version of this: This plugin should permit interesting URLs > from external domains (by overriding db.ignore.external). The interesting > urls are decided from a combination of regex and mime-type rules. > Concrete use case: > When using Nutch to crawl images from a set of domains, the crawler needs > to fetch all images which may be linked from CDNs and other domains. In this > scenario, allowing all external links and then writing hundreds of regular > expressions is not feasible for large number of domains. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2144) Plugin to override db.ignore.external to exempt interesting external domain URLs
[ https://issues.apache.org/jira/browse/NUTCH-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14963082#comment-14963082 ] Markus Jelsma commented on NUTCH-2144: -- Hi - i like the purpose of this plugin. The patch, however, is hardly readable, it contains various diffs for the same files. Can you provide a clean patch? > Plugin to override db.ignore.external to exempt interesting external domain > URLs > > > Key: NUTCH-2144 > URL: https://issues.apache.org/jira/browse/NUTCH-2144 > Project: Nutch > Issue Type: New Feature > Components: crawldb, fetcher >Reporter: Thamme Gowda N >Priority: Minor > Attachments: ignore-exempt.patch > > > Create a rule based urlfilter plugin that allows focused crawler > (db.ignore.external.links=true) to fetch static resources from external > domains. > The generalized version of this: This plugin should permit interesting URLs > from external domains (by overriding db.ignore.external). The interesting > urls are decided from a combination of regex and mime-type rules. > Concrete use case: > When using Nutch to crawl images from a set of domains, the crawler needs > to fetch all images which may be linked from CDNs and other domains. In this > scenario, allowing all external links and then writing hundreds of regular > expressions is not feasible for large number of domains. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2144) Plugin to override db.ignore.external to exempt interesting external domain URLs
[ https://issues.apache.org/jira/browse/NUTCH-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14964110#comment-14964110 ] Markus Jelsma commented on NUTCH-2144: -- Yes, this is much more readable indeed. In ExemptionUrlFilter there is a TODO for getting a content type in the Nutch way. It looks like you just want to get a content type for a given URL. Since you are using a built-in httpclient to do a head request, and want to do it via the fetcher, this means you are going to to many additional requests. This is bad, we need to find a way to get the content type for any URL via the CrawlDatum. I had some thoughts about this earlier, the fact that URL filters are missing context completely, which we should fix some day anyway! But since this is about external items, it makes it much harder because there is no information about them in the CrawlDB to begin with. Any of our other committers to share some thoughts about these issues? > Plugin to override db.ignore.external to exempt interesting external domain > URLs > > > Key: NUTCH-2144 > URL: https://issues.apache.org/jira/browse/NUTCH-2144 > Project: Nutch > Issue Type: New Feature > Components: crawldb, fetcher >Reporter: Thamme Gowda N >Priority: Minor > Attachments: ignore-exempt.patch, ignore-exempt.patch > > > Create a rule based urlfilter plugin that allows focused crawler > (db.ignore.external.links=true) to fetch static resources from external > domains. > The generalized version of this: This plugin should permit interesting URLs > from external domains (by overriding db.ignore.external). The interesting > urls are decided from a combination of regex and mime-type rules. > Concrete use case: > When using Nutch to crawl images from a set of domains, the crawler needs > to fetch all images which may be linked from CDNs and other domains. In this > scenario, allowing all external links and then writing hundreds of regular > expressions is not feasible for large number of domains. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2144) Plugin to override db.ignore.external to exempt interesting external domain URLs
[ https://issues.apache.org/jira/browse/NUTCH-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14964175#comment-14964175 ] Thamme Gowda N commented on NUTCH-2144: --- Thanks for your feedback. I agree that the content-type will be missing for newly discovered URLs. I am double sure that HTTP HEAD requests add lots of overhead. First I apply regex filter to limit urls. So HEAD call is made only on the URLs matched by regex. However that was still not sufficient, so content type filter is made optional. For example, {color:red} image/jpeg,image/png,image/gif=.*\.(jpg|JPG|png$|PNG|gif|GIF)$ {color} the above rule makes HTTP HEAD call to the urls matched by regex to determine content type. However, {color:red} =.*\.(jpg|JPG|png$|PNG|gif|GIF)$ {color} rule doesn't make HTTP HEAD, because the content type filter is not applied. > Plugin to override db.ignore.external to exempt interesting external domain > URLs > > > Key: NUTCH-2144 > URL: https://issues.apache.org/jira/browse/NUTCH-2144 > Project: Nutch > Issue Type: New Feature > Components: crawldb, fetcher >Reporter: Thamme Gowda N >Priority: Minor > Attachments: ignore-exempt.patch, ignore-exempt.patch > > > Create a rule based urlfilter plugin that allows focused crawler > (db.ignore.external.links=true) to fetch static resources from external > domains. > The generalized version of this: This plugin should permit interesting URLs > from external domains (by overriding db.ignore.external). The interesting > urls are decided from a combination of regex and mime-type rules. > Concrete use case: > When using Nutch to crawl images from a set of domains, the crawler needs > to fetch all images which may be linked from CDNs and other domains. In this > scenario, allowing all external links and then writing hundreds of regular > expressions is not feasible for large number of domains. -- This message was sent by Atlassian JIRA (v6.3.4#6332)