[jira] [Commented] (NUTCH-2144) Plugin to override db.ignore.external to exempt interesting external domain URLs

2016-02-28 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15171503#comment-15171503
 ] 

ASF GitHub Bot commented on NUTCH-2144:
---

Github user asfgit closed the pull request at:

https://github.com/apache/nutch/pull/93


> Plugin to override db.ignore.external to exempt interesting external domain 
> URLs
> 
>
> Key: NUTCH-2144
> URL: https://issues.apache.org/jira/browse/NUTCH-2144
> Project: Nutch
>  Issue Type: New Feature
>  Components: crawldb, fetcher
>Reporter: Thamme Gowda N
>Assignee: Chris A. Mattmann
>Priority: Minor
> Fix For: 1.12
>
> Attachments: ignore-exempt.patch, ignore-exempt.patch
>
>
> Create a rule based urlfilter plugin that allows focused crawler 
> (db.ignore.external.links=true) to fetch static resources from external 
> domains.
> The generalized version of this: This plugin should permit interesting URLs 
> from external domains (by overriding db.ignore.external). The interesting 
> urls are decided from a combination of regex and mime-type rules.
> Concrete use case:
>   When using Nutch to crawl images from a set of domains, the crawler needs 
> to fetch all images which may be linked from CDNs and other domains. In this 
> scenario, allowing all external links and then writing hundreds of regular 
> expressions is not feasible for large number of domains.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2144) Plugin to override db.ignore.external to exempt interesting external domain URLs

2016-02-28 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15171369#comment-15171369
 ] 

ASF GitHub Bot commented on NUTCH-2144:
---

Github user thammegowda closed the pull request at:

https://github.com/apache/nutch/pull/89


> Plugin to override db.ignore.external to exempt interesting external domain 
> URLs
> 
>
> Key: NUTCH-2144
> URL: https://issues.apache.org/jira/browse/NUTCH-2144
> Project: Nutch
>  Issue Type: New Feature
>  Components: crawldb, fetcher
>Reporter: Thamme Gowda N
>Assignee: Chris A. Mattmann
>Priority: Minor
> Fix For: 1.12
>
> Attachments: ignore-exempt.patch, ignore-exempt.patch
>
>
> Create a rule based urlfilter plugin that allows focused crawler 
> (db.ignore.external.links=true) to fetch static resources from external 
> domains.
> The generalized version of this: This plugin should permit interesting URLs 
> from external domains (by overriding db.ignore.external). The interesting 
> urls are decided from a combination of regex and mime-type rules.
> Concrete use case:
>   When using Nutch to crawl images from a set of domains, the crawler needs 
> to fetch all images which may be linked from CDNs and other domains. In this 
> scenario, allowing all external links and then writing hundreds of regular 
> expressions is not feasible for large number of domains.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2144) Plugin to override db.ignore.external to exempt interesting external domain URLs

2016-02-28 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15171366#comment-15171366
 ] 

ASF GitHub Bot commented on NUTCH-2144:
---

GitHub user thammegowda opened a pull request:

https://github.com/apache/nutch/pull/93

NUTCH-2144 Added an extension point and a plugin to accept external links

This PR is a duplicate of #89 
Recreated due to the issues caused while moving to writable git.


@chrismattmann 

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/thammegowda/nutch NUTCH-2144

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/nutch/pull/93.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #93


commit 2015703cfd32cae98b14d2fd6af5ac4396237c48
Author: Thamme Gowda 
Date:   2016-02-29T03:23:26Z

NUTCH-2144 Added an extension point and a plugin that overrides 
db.ignore.external to accept external links

commit 9a284c0d6d2aec86b00016a8abeddc07e5292ee9
Author: Thamme Gowda 
Date:   2016-02-29T03:29:09Z

Add a sample config




> Plugin to override db.ignore.external to exempt interesting external domain 
> URLs
> 
>
> Key: NUTCH-2144
> URL: https://issues.apache.org/jira/browse/NUTCH-2144
> Project: Nutch
>  Issue Type: New Feature
>  Components: crawldb, fetcher
>Reporter: Thamme Gowda N
>Assignee: Chris A. Mattmann
>Priority: Minor
> Fix For: 1.12
>
> Attachments: ignore-exempt.patch, ignore-exempt.patch
>
>
> Create a rule based urlfilter plugin that allows focused crawler 
> (db.ignore.external.links=true) to fetch static resources from external 
> domains.
> The generalized version of this: This plugin should permit interesting URLs 
> from external domains (by overriding db.ignore.external). The interesting 
> urls are decided from a combination of regex and mime-type rules.
> Concrete use case:
>   When using Nutch to crawl images from a set of domains, the crawler needs 
> to fetch all images which may be linked from CDNs and other domains. In this 
> scenario, allowing all external links and then writing hundreds of regular 
> expressions is not feasible for large number of domains.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2144) Plugin to override db.ignore.external to exempt interesting external domain URLs

2016-02-25 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15167626#comment-15167626
 ] 

ASF GitHub Bot commented on NUTCH-2144:
---

GitHub user thammegowda reopened a pull request:

https://github.com/apache/nutch/pull/89

NUTCH-2144 : override db.ignore.external to exempt interesting external 
domain URLs

 + Add extension point
  org.apache.nutch.net.URLExemptionFilter
 + Modify FetcherThread and ParseOutputFormat to
  integrate new extension point
 + Add extension urlfilter-ignoreexempt
 + build configs modified to include new extension


Resolves https://issues.apache.org/jira/browse/NUTCH-2144

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/thammegowda/nutch NUTCH-2144-ignore-exempt

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/nutch/pull/89.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #89


commit 29c7ae4ec088f0f428ab95992e10af9d87a231ad
Author: Thamme Gowda 
Date:   2015-10-19T14:31:04Z

Add an extension point and an extension to
 override 'db.ignore.external.links'.

 + Add extension point
  org.apache.nutch.net.URLExemptionFilter
 + Modify FetcherThread and ParseOutputFormat to
  integrate new extension point
 + Add extension urlfilter-ignoreexempt
 + build configs modified to include new extension

commit 43583f7af19c21dd7553c5e41c026cb852f0cfa1
Author: Thamme Gowda 
Date:   2016-02-11T00:03:03Z

Added ignore exemption extension point and an extension

commit 559eb4d905a081e34095f0e9d5ee3e805363ccc3
Author: Thamme Gowda 
Date:   2016-02-11T00:21:08Z

README updated

commit 3cf887befc56c3cf4127f45385e83c47248dd6e9
Author: Thamme Gowda 
Date:   2016-02-11T00:23:12Z

Add an example rule fiile

commit b5cf404bf451fa80186ebb4120cfd39aa2c0f00b
Author: Thamme Gowda 
Date:   2016-02-12T01:01:02Z

Added License header

commit 3a555b106a4cef9bf0c0e0699f79aedd14ef9fa1
Author: Thamme Gowda 
Date:   2016-02-14T02:06:51Z

Code reviewers suggestion incorporated

+ Reusing the rules and format from urlfilter-regex

commit 6bd026c8482f98b14a56b9b9bff78307f6998189
Author: Thamme Gowda 
Date:   2016-02-25T02:16:17Z

merge upstream changed and Resolve all conflicts




> Plugin to override db.ignore.external to exempt interesting external domain 
> URLs
> 
>
> Key: NUTCH-2144
> URL: https://issues.apache.org/jira/browse/NUTCH-2144
> Project: Nutch
>  Issue Type: New Feature
>  Components: crawldb, fetcher
>Reporter: Thamme Gowda N
>Assignee: Chris A. Mattmann
>Priority: Minor
> Fix For: 1.12
>
> Attachments: ignore-exempt.patch, ignore-exempt.patch
>
>
> Create a rule based urlfilter plugin that allows focused crawler 
> (db.ignore.external.links=true) to fetch static resources from external 
> domains.
> The generalized version of this: This plugin should permit interesting URLs 
> from external domains (by overriding db.ignore.external). The interesting 
> urls are decided from a combination of regex and mime-type rules.
> Concrete use case:
>   When using Nutch to crawl images from a set of domains, the crawler needs 
> to fetch all images which may be linked from CDNs and other domains. In this 
> scenario, allowing all external links and then writing hundreds of regular 
> expressions is not feasible for large number of domains.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2144) Plugin to override db.ignore.external to exempt interesting external domain URLs

2016-02-25 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15167600#comment-15167600
 ] 

ASF GitHub Bot commented on NUTCH-2144:
---

Github user asfgit closed the pull request at:

https://github.com/apache/nutch/pull/89


> Plugin to override db.ignore.external to exempt interesting external domain 
> URLs
> 
>
> Key: NUTCH-2144
> URL: https://issues.apache.org/jira/browse/NUTCH-2144
> Project: Nutch
>  Issue Type: New Feature
>  Components: crawldb, fetcher
>Reporter: Thamme Gowda N
>Assignee: Chris A. Mattmann
>Priority: Minor
> Fix For: 1.12
>
> Attachments: ignore-exempt.patch, ignore-exempt.patch
>
>
> Create a rule based urlfilter plugin that allows focused crawler 
> (db.ignore.external.links=true) to fetch static resources from external 
> domains.
> The generalized version of this: This plugin should permit interesting URLs 
> from external domains (by overriding db.ignore.external). The interesting 
> urls are decided from a combination of regex and mime-type rules.
> Concrete use case:
>   When using Nutch to crawl images from a set of domains, the crawler needs 
> to fetch all images which may be linked from CDNs and other domains. In this 
> scenario, allowing all external links and then writing hundreds of regular 
> expressions is not feasible for large number of domains.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2144) Plugin to override db.ignore.external to exempt interesting external domain URLs

2016-02-24 Thread Thamme Gowda N (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15166447#comment-15166447
 ] 

Thamme Gowda N commented on NUTCH-2144:
---

Hi [~wastl-nagel],
Were you able to test this plugin?

I agree on both the points.
The supplied plugin is just a start and we can have sophisticated plugins with 
this extension point. 

> Plugin to override db.ignore.external to exempt interesting external domain 
> URLs
> 
>
> Key: NUTCH-2144
> URL: https://issues.apache.org/jira/browse/NUTCH-2144
> Project: Nutch
>  Issue Type: New Feature
>  Components: crawldb, fetcher
>Reporter: Thamme Gowda N
>Assignee: Chris A. Mattmann
>Priority: Minor
> Fix For: 1.12
>
> Attachments: ignore-exempt.patch, ignore-exempt.patch
>
>
> Create a rule based urlfilter plugin that allows focused crawler 
> (db.ignore.external.links=true) to fetch static resources from external 
> domains.
> The generalized version of this: This plugin should permit interesting URLs 
> from external domains (by overriding db.ignore.external). The interesting 
> urls are decided from a combination of regex and mime-type rules.
> Concrete use case:
>   When using Nutch to crawl images from a set of domains, the crawler needs 
> to fetch all images which may be linked from CDNs and other domains. In this 
> scenario, allowing all external links and then writing hundreds of regular 
> expressions is not feasible for large number of domains.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2144) Plugin to override db.ignore.external to exempt interesting external domain URLs

2016-02-23 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15158684#comment-15158684
 ] 

Markus Jelsma commented on NUTCH-2144:
--

ParseOutputFormat.filterNormalize() signature has changed since NUTCH-2221. 
Parameter ignoreInternalLinks was added. The parameter is read from 
db.ignore.internal.links configuration directive.

> Plugin to override db.ignore.external to exempt interesting external domain 
> URLs
> 
>
> Key: NUTCH-2144
> URL: https://issues.apache.org/jira/browse/NUTCH-2144
> Project: Nutch
>  Issue Type: New Feature
>  Components: crawldb, fetcher
>Reporter: Thamme Gowda N
>Assignee: Chris A. Mattmann
>Priority: Minor
> Fix For: 1.12
>
> Attachments: ignore-exempt.patch, ignore-exempt.patch
>
>
> Create a rule based urlfilter plugin that allows focused crawler 
> (db.ignore.external.links=true) to fetch static resources from external 
> domains.
> The generalized version of this: This plugin should permit interesting URLs 
> from external domains (by overriding db.ignore.external). The interesting 
> urls are decided from a combination of regex and mime-type rules.
> Concrete use case:
>   When using Nutch to crawl images from a set of domains, the crawler needs 
> to fetch all images which may be linked from CDNs and other domains. In this 
> scenario, allowing all external links and then writing hundreds of regular 
> expressions is not feasible for large number of domains.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2144) Plugin to override db.ignore.external to exempt interesting external domain URLs

2016-02-14 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15146685#comment-15146685
 ] 

Sebastian Nagel commented on NUTCH-2144:


Hi [~thammegowda],
thanks! Everything looks good with the changes. It's definitely a good idea to 
reuse the code from urlfilter-regex, and users will appreciate if rules/regexes 
work the same way. The ant build files are ok, afaics, but I'll try to test the 
plugin tomorrow.

Two points, I would like to bring up for discussion now, since this plugin will 
introduce a new interface, and interfaces aren't easily changed later:
# currently the filter(...) method takes fromUrl and toUrl as arguments. The 
interface could be more powerful and adaptible to further use cases if we add
## the tag name where the link comes from ("a", "img", "form", etc.). Currently 
the tag name is not available in ParseOutputFormat, we would have to pass it 
via Outlink from the parser where tag names are already used to filter links, 
cf. property "parser.html.outlinks.ignore_tags". Tag names would be the easier 
way to distinguish between page resources and real outlinks.
## similar whether it's a link or a redirect: could be used to follow redirects 
when a site has moved to a different host and is now redirected, while still 
ignoring external outlinks
# the naming could be more explicit: "URLExemptionFilter" or 
"urlfilter-ignoreexempt" do make clear that it's about an exemption from the 
"db.ignore.external.links" property. Only the config file 
"conf/db-ignore-external-exemptions.txt" is sufficiently precise. To avoid 
overlong names (e.g., "IgnoreExternalLinksExemptionUrlFilter"), maybe resolve 
the double negation to something like "AcceptExternalUrlFilter" or 
"urlfilter-externallink".

As said both points are just for discussion, or for later improvements.

> Plugin to override db.ignore.external to exempt interesting external domain 
> URLs
> 
>
> Key: NUTCH-2144
> URL: https://issues.apache.org/jira/browse/NUTCH-2144
> Project: Nutch
>  Issue Type: New Feature
>  Components: crawldb, fetcher
>Reporter: Thamme Gowda N
>Assignee: Chris A. Mattmann
>Priority: Minor
> Fix For: 1.12
>
> Attachments: ignore-exempt.patch, ignore-exempt.patch
>
>
> Create a rule based urlfilter plugin that allows focused crawler 
> (db.ignore.external.links=true) to fetch static resources from external 
> domains.
> The generalized version of this: This plugin should permit interesting URLs 
> from external domains (by overriding db.ignore.external). The interesting 
> urls are decided from a combination of regex and mime-type rules.
> Concrete use case:
>   When using Nutch to crawl images from a set of domains, the crawler needs 
> to fetch all images which may be linked from CDNs and other domains. In this 
> scenario, allowing all external links and then writing hundreds of regular 
> expressions is not feasible for large number of domains.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2144) Plugin to override db.ignore.external to exempt interesting external domain URLs

2016-02-13 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15146181#comment-15146181
 ] 

ASF GitHub Bot commented on NUTCH-2144:
---

Github user sebastian-nagel commented on a diff in the pull request:

https://github.com/apache/nutch/pull/89#discussion_r52833038
  
--- Diff: 
src/plugin/urlfilter-ignoreexempt/src/java/org/apache/nutch/urlfilter/ignoreexempt/ExemptionUrlFilter.java
 ---
@@ -0,0 +1,144 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.nutch.urlfilter.ignoreexempt;
+
+import org.apache.commons.io.IOUtils;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.nutch.net.URLExemptionFilter;
+import org.apache.nutch.util.NutchConfiguration;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import java.io.IOException;
+import java.io.InputStream;
+import java.util.Arrays;
+import java.util.regex.Pattern;
+import java.util.List;
+import java.util.ArrayList;
+
+
+/**
+ * This implementation of {@link org.apache.nutch.net.URLExemptionFilter} 
uses regex configuration
+ * to check if URL is eligible for exemption from 'db.ignore.external'.
+ * When this filter is enabled, urls will be checked against configured 
sequence of regex rules.
+ *
+ * The exemption rule file defaults to db-ignore-external-exemptions.txt 
in the classpath but can be
+ * overridden using the property  
"db.ignore.external.exemptions.file" in ./conf/nutch-*.xml
+ *
+ *
+ * The exemption rules are specified in plain text file where each line is 
a rule.
+ *
+ * When the url matches regex it is exempted from 
'db.ignore.external...'
+ * Examples:
+ * 
+ *   
+ * Exempt urls ending with .jpg or .png or gif
+ *  .*\.(jpg|JPG|png$|PNG|gif|GIF)$
+ *   
+ * 
+ 
+ *
+ * @since Feb 10, 2016
+ * @version 1
+ * @see URLExemptionFilter
+ */
+public class ExemptionUrlFilter implements URLExemptionFilter {
+
+  public static final String DB_IGNORE_EXTERNAL_EXEMPTIONS_FILE = 
"db.ignore.external.exemptions.file";
+  private static final Logger LOG = 
LoggerFactory.getLogger(ExemptionUrlFilter.class);
+  private static ExemptionUrlFilter INSTANCE;
+
+  private List exemptions;
+  private Configuration conf;
+  private boolean enabled;
+
+  public static ExemptionUrlFilter getInstance() {
--- End diff --

Nutch automatically caches a single instance of each plugin class. The 
main() method could also call the constructor and then setConf().


> Plugin to override db.ignore.external to exempt interesting external domain 
> URLs
> 
>
> Key: NUTCH-2144
> URL: https://issues.apache.org/jira/browse/NUTCH-2144
> Project: Nutch
>  Issue Type: New Feature
>  Components: crawldb, fetcher
>Reporter: Thamme Gowda N
>Assignee: Chris A. Mattmann
>Priority: Minor
> Fix For: 1.12
>
> Attachments: ignore-exempt.patch, ignore-exempt.patch
>
>
> Create a rule based urlfilter plugin that allows focused crawler 
> (db.ignore.external.links=true) to fetch static resources from external 
> domains.
> The generalized version of this: This plugin should permit interesting URLs 
> from external domains (by overriding db.ignore.external). The interesting 
> urls are decided from a combination of regex and mime-type rules.
> Concrete use case:
>   When using Nutch to crawl images from a set of domains, the crawler needs 
> to fetch all images which may be linked from CDNs and other domains. In this 
> scenario, allowing all external links and then writing hundreds of regular 
> expressions is not feasible for large number of domains.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2144) Plugin to override db.ignore.external to exempt interesting external domain URLs

2016-02-13 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15146184#comment-15146184
 ] 

ASF GitHub Bot commented on NUTCH-2144:
---

Github user sebastian-nagel commented on a diff in the pull request:

https://github.com/apache/nutch/pull/89#discussion_r52833199
  
--- Diff: 
src/plugin/urlfilter-ignoreexempt/src/java/org/apache/nutch/urlfilter/ignoreexempt/ExemptionUrlFilter.java
 ---
@@ -0,0 +1,144 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.nutch.urlfilter.ignoreexempt;
+
+import org.apache.commons.io.IOUtils;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.nutch.net.URLExemptionFilter;
+import org.apache.nutch.util.NutchConfiguration;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import java.io.IOException;
+import java.io.InputStream;
+import java.util.Arrays;
+import java.util.regex.Pattern;
+import java.util.List;
+import java.util.ArrayList;
+
+
+/**
+ * This implementation of {@link org.apache.nutch.net.URLExemptionFilter} 
uses regex configuration
+ * to check if URL is eligible for exemption from 'db.ignore.external'.
+ * When this filter is enabled, urls will be checked against configured 
sequence of regex rules.
+ *
+ * The exemption rule file defaults to db-ignore-external-exemptions.txt 
in the classpath but can be
+ * overridden using the property  
"db.ignore.external.exemptions.file" in ./conf/nutch-*.xml
+ *
+ *
+ * The exemption rules are specified in plain text file where each line is 
a rule.
+ *
+ * When the url matches regex it is exempted from 
'db.ignore.external...'
+ * Examples:
+ * 
+ *   
+ * Exempt urls ending with .jpg or .png or gif
+ *  .*\.(jpg|JPG|png$|PNG|gif|GIF)$
+ *   
+ * 
+ 
+ *
+ * @since Feb 10, 2016
+ * @version 1
+ * @see URLExemptionFilter
+ */
+public class ExemptionUrlFilter implements URLExemptionFilter {
+
+  public static final String DB_IGNORE_EXTERNAL_EXEMPTIONS_FILE = 
"db.ignore.external.exemptions.file";
+  private static final Logger LOG = 
LoggerFactory.getLogger(ExemptionUrlFilter.class);
+  private static ExemptionUrlFilter INSTANCE;
+
+  private List exemptions;
+  private Configuration conf;
+  private boolean enabled;
--- End diff --

Is this variable necessary?


> Plugin to override db.ignore.external to exempt interesting external domain 
> URLs
> 
>
> Key: NUTCH-2144
> URL: https://issues.apache.org/jira/browse/NUTCH-2144
> Project: Nutch
>  Issue Type: New Feature
>  Components: crawldb, fetcher
>Reporter: Thamme Gowda N
>Assignee: Chris A. Mattmann
>Priority: Minor
> Fix For: 1.12
>
> Attachments: ignore-exempt.patch, ignore-exempt.patch
>
>
> Create a rule based urlfilter plugin that allows focused crawler 
> (db.ignore.external.links=true) to fetch static resources from external 
> domains.
> The generalized version of this: This plugin should permit interesting URLs 
> from external domains (by overriding db.ignore.external). The interesting 
> urls are decided from a combination of regex and mime-type rules.
> Concrete use case:
>   When using Nutch to crawl images from a set of domains, the crawler needs 
> to fetch all images which may be linked from CDNs and other domains. In this 
> scenario, allowing all external links and then writing hundreds of regular 
> expressions is not feasible for large number of domains.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2144) Plugin to override db.ignore.external to exempt interesting external domain URLs

2016-02-13 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15146201#comment-15146201
 ] 

ASF GitHub Bot commented on NUTCH-2144:
---

Github user sebastian-nagel commented on a diff in the pull request:

https://github.com/apache/nutch/pull/89#discussion_r52833653
  
--- Diff: src/java/org/apache/nutch/parse/ParseOutputFormat.java ---
@@ -338,6 +340,9 @@ public static String filterNormalize(String fromUrl, 
String toUrl,
   } catch (MalformedURLException e1) {
 return null; // skip it
   }
+
+
+
   if ("bydomain".equalsIgnoreCase(ignoreExternalLinksMode)) {
 String toDomain = URLUtil.getDomainName(targetURL).toLowerCase();
 if (toDomain == null || !toDomain.equals(origin)) {
--- End diff --

Shouldn't this case also be covered (db.ignore.external.links == true and 
db.ignore.external.links.mode == byDomain)?


> Plugin to override db.ignore.external to exempt interesting external domain 
> URLs
> 
>
> Key: NUTCH-2144
> URL: https://issues.apache.org/jira/browse/NUTCH-2144
> Project: Nutch
>  Issue Type: New Feature
>  Components: crawldb, fetcher
>Reporter: Thamme Gowda N
>Assignee: Chris A. Mattmann
>Priority: Minor
> Fix For: 1.12
>
> Attachments: ignore-exempt.patch, ignore-exempt.patch
>
>
> Create a rule based urlfilter plugin that allows focused crawler 
> (db.ignore.external.links=true) to fetch static resources from external 
> domains.
> The generalized version of this: This plugin should permit interesting URLs 
> from external domains (by overriding db.ignore.external). The interesting 
> urls are decided from a combination of regex and mime-type rules.
> Concrete use case:
>   When using Nutch to crawl images from a set of domains, the crawler needs 
> to fetch all images which may be linked from CDNs and other domains. In this 
> scenario, allowing all external links and then writing hundreds of regular 
> expressions is not feasible for large number of domains.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2144) Plugin to override db.ignore.external to exempt interesting external domain URLs

2016-02-13 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15146208#comment-15146208
 ] 

ASF GitHub Bot commented on NUTCH-2144:
---

Github user sebastian-nagel commented on a diff in the pull request:

https://github.com/apache/nutch/pull/89#discussion_r52833853
  
--- Diff: 
src/plugin/urlfilter-ignoreexempt/src/java/org/apache/nutch/urlfilter/ignoreexempt/ExemptionUrlFilter.java
 ---
@@ -0,0 +1,144 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.nutch.urlfilter.ignoreexempt;
+
+import org.apache.commons.io.IOUtils;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.nutch.net.URLExemptionFilter;
+import org.apache.nutch.util.NutchConfiguration;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import java.io.IOException;
+import java.io.InputStream;
+import java.util.Arrays;
+import java.util.regex.Pattern;
+import java.util.List;
+import java.util.ArrayList;
+
+
+/**
+ * This implementation of {@link org.apache.nutch.net.URLExemptionFilter} 
uses regex configuration
+ * to check if URL is eligible for exemption from 'db.ignore.external'.
+ * When this filter is enabled, urls will be checked against configured 
sequence of regex rules.
+ *
+ * The exemption rule file defaults to db-ignore-external-exemptions.txt 
in the classpath but can be
+ * overridden using the property  
"db.ignore.external.exemptions.file" in ./conf/nutch-*.xml
+ *
+ *
+ * The exemption rules are specified in plain text file where each line is 
a rule.
+ *
+ * When the url matches regex it is exempted from 
'db.ignore.external...'
+ * Examples:
+ * 
+ *   
+ * Exempt urls ending with .jpg or .png or gif
+ *  .*\.(jpg|JPG|png$|PNG|gif|GIF)$
+ *   
+ * 
+ 
+ *
+ * @since Feb 10, 2016
+ * @version 1
+ * @see URLExemptionFilter
+ */
+public class ExemptionUrlFilter implements URLExemptionFilter {
+
+  public static final String DB_IGNORE_EXTERNAL_EXEMPTIONS_FILE = 
"db.ignore.external.exemptions.file";
+  private static final Logger LOG = 
LoggerFactory.getLogger(ExemptionUrlFilter.class);
+  private static ExemptionUrlFilter INSTANCE;
+
+  private List exemptions;
+  private Configuration conf;
+  private boolean enabled;
+
+  public static ExemptionUrlFilter getInstance() {
+if(INSTANCE == null) {
+  synchronized (ExemptionUrlFilter.class) {
+if (INSTANCE == null) {
+  INSTANCE = new ExemptionUrlFilter();
+  INSTANCE.setConf(NutchConfiguration.create());
+}
+  }
+}
+return INSTANCE;
+  }
+
+  public boolean isEnabled() {
+return enabled;
+  }
+
+  public List getExemptions() {
+return exemptions;
+  }
+
+  @Override
+  public boolean filter(String fromUrl, String toUrl) {
+//this implementation doesnt do anything with fromUrl
+if (exemptions != null) {
+  for (Pattern pattern : exemptions) {
+if (pattern.matcher(toUrl).matches()) {
--- End diff --

Could possibly use Pattern.find() instead of Pattern.matches():
- would make patterns often shorter: `\.jpg$` instead of `.*\.jpg` (or 
`.*\.jpg$`)
- (more important) the syntax would be the same as for urlfilter-regex rules


> Plugin to override db.ignore.external to exempt interesting external domain 
> URLs
> 
>
> Key: NUTCH-2144
> URL: https://issues.apache.org/jira/browse/NUTCH-2144
> Project: Nutch
>  Issue Type: New Feature
>  Components: crawldb, fetcher
>Reporter: Thamme Gowda N
>Assignee: Chris A. Mattmann
>Priority: Minor
> Fix For: 1.12
>
> Attachments: ignore-exempt.patch, 

[jira] [Commented] (NUTCH-2144) Plugin to override db.ignore.external to exempt interesting external domain URLs

2016-02-13 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15146209#comment-15146209
 ] 

ASF GitHub Bot commented on NUTCH-2144:
---

Github user sebastian-nagel commented on a diff in the pull request:

https://github.com/apache/nutch/pull/89#discussion_r52833909
  
--- Diff: conf/db-ignore-external-exemptions.txt ---
@@ -0,0 +1,37 @@
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+#
+# Exemption rules to db.ignore.external.links
+#
+# Format :
+#
+# UrlRegex1
+# UrlRegex2
+# UrlRegex3
+
+
+# NOTE ::
+# 1. When the url matches any of the regex then that url is exempted.
+# 2. # in the beginning makes it a comment line
+# 3. To Test the regex, update this file and use the below command
+#   bin/nutch plugin urlfilter-ignoreexempt 
org.apache.nutch.urlfilter.ignoreexempt.ExemptionUrlFilter 
+# 4. Dont forget to enable this plugin in nutch-site.xml
+
+
+# Example 1:
+#--
+# To exempt urls ending with image extensions, uncomment the below line
+#.*\.(jpg|JPG|png$|PNG|gif|GIF)$
--- End diff --

Regex could be simplified to `#(?i).*\.(?:jpg|png|gif)` (or 
`#(?i)\.(?:jpg|png|gif)$` if Pattern.find() is used). `(?i)` makes the pattern 
case insensitive, cf. 
[NUTCH-2035](https://issues.apache.org/jira/browse/NUTCH-2035)



> Plugin to override db.ignore.external to exempt interesting external domain 
> URLs
> 
>
> Key: NUTCH-2144
> URL: https://issues.apache.org/jira/browse/NUTCH-2144
> Project: Nutch
>  Issue Type: New Feature
>  Components: crawldb, fetcher
>Reporter: Thamme Gowda N
>Assignee: Chris A. Mattmann
>Priority: Minor
> Fix For: 1.12
>
> Attachments: ignore-exempt.patch, ignore-exempt.patch
>
>
> Create a rule based urlfilter plugin that allows focused crawler 
> (db.ignore.external.links=true) to fetch static resources from external 
> domains.
> The generalized version of this: This plugin should permit interesting URLs 
> from external domains (by overriding db.ignore.external). The interesting 
> urls are decided from a combination of regex and mime-type rules.
> Concrete use case:
>   When using Nutch to crawl images from a set of domains, the crawler needs 
> to fetch all images which may be linked from CDNs and other domains. In this 
> scenario, allowing all external links and then writing hundreds of regular 
> expressions is not feasible for large number of domains.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2144) Plugin to override db.ignore.external to exempt interesting external domain URLs

2016-02-13 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15146210#comment-15146210
 ] 

ASF GitHub Bot commented on NUTCH-2144:
---

Github user sebastian-nagel commented on a diff in the pull request:

https://github.com/apache/nutch/pull/89#discussion_r52833922
  
--- Diff: src/plugin/urlfilter-ignoreexempt/README.md ---
@@ -0,0 +1,52 @@
+urlfilter-ignoreexempt
+==
+  This plugin allows certain urls to be exempted when the external links 
are configured to be ignored.
+  This is useful when focused crawl is setup but some resources like 
static files are linked from CDNs (external domains).
+
+How to enable ?
+==
+Add `urlfilter-ignoreexempt` value to `plugin.includes` property
+```xml
+
+  plugin.includes
+  protocol-http|urlfilter-(regex|ignoreexempt)...
+
+```
+
+How to configure rules?
+
+
+open `conf/db-ignore-external-exemptions.txt` and add rules
+
+ Format :
+
+```
+UrlRegex1
+UrlRegex2
+UrlRegex3
+```
+
+
+ NOTE ::
+ 1. If an url matches any of the given regexps then that url is exempted.
+ 2. \# in the beginning makes it a comment line
+ 3. To Test the regex, update this file and use the below command
+bin/nutch plugin urlfilter-ignoreexempt 
org.apache.nutch.urlfilter.ignoreexempt.ExemptionUrlFilter 
+
+
+ Example :
+
+ To exempt urls ending with image extensions, use this rule
+
+`.*\.(jpg|JPG|png$|PNG|gif|GIF)$# Testing`
--- End diff --

dito


> Plugin to override db.ignore.external to exempt interesting external domain 
> URLs
> 
>
> Key: NUTCH-2144
> URL: https://issues.apache.org/jira/browse/NUTCH-2144
> Project: Nutch
>  Issue Type: New Feature
>  Components: crawldb, fetcher
>Reporter: Thamme Gowda N
>Assignee: Chris A. Mattmann
>Priority: Minor
> Fix For: 1.12
>
> Attachments: ignore-exempt.patch, ignore-exempt.patch
>
>
> Create a rule based urlfilter plugin that allows focused crawler 
> (db.ignore.external.links=true) to fetch static resources from external 
> domains.
> The generalized version of this: This plugin should permit interesting URLs 
> from external domains (by overriding db.ignore.external). The interesting 
> urls are decided from a combination of regex and mime-type rules.
> Concrete use case:
>   When using Nutch to crawl images from a set of domains, the crawler needs 
> to fetch all images which may be linked from CDNs and other domains. In this 
> scenario, allowing all external links and then writing hundreds of regular 
> expressions is not feasible for large number of domains.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2144) Plugin to override db.ignore.external to exempt interesting external domain URLs

2016-02-13 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15146260#comment-15146260
 ] 

ASF GitHub Bot commented on NUTCH-2144:
---

Github user thammegowda commented on a diff in the pull request:

https://github.com/apache/nutch/pull/89#discussion_r52835489
  
--- Diff: 
src/plugin/urlfilter-ignoreexempt/src/java/org/apache/nutch/urlfilter/ignoreexempt/ExemptionUrlFilter.java
 ---
@@ -0,0 +1,144 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.nutch.urlfilter.ignoreexempt;
+
+import org.apache.commons.io.IOUtils;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.nutch.net.URLExemptionFilter;
+import org.apache.nutch.util.NutchConfiguration;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import java.io.IOException;
+import java.io.InputStream;
+import java.util.Arrays;
+import java.util.regex.Pattern;
+import java.util.List;
+import java.util.ArrayList;
+
+
+/**
+ * This implementation of {@link org.apache.nutch.net.URLExemptionFilter} 
uses regex configuration
+ * to check if URL is eligible for exemption from 'db.ignore.external'.
+ * When this filter is enabled, urls will be checked against configured 
sequence of regex rules.
+ *
+ * The exemption rule file defaults to db-ignore-external-exemptions.txt 
in the classpath but can be
+ * overridden using the property  
"db.ignore.external.exemptions.file" in ./conf/nutch-*.xml
+ *
+ *
+ * The exemption rules are specified in plain text file where each line is 
a rule.
+ *
+ * When the url matches regex it is exempted from 
'db.ignore.external...'
+ * Examples:
+ * 
+ *   
+ * Exempt urls ending with .jpg or .png or gif
+ *  .*\.(jpg|JPG|png$|PNG|gif|GIF)$
+ *   
+ * 
+ 
+ *
+ * @since Feb 10, 2016
+ * @version 1
+ * @see URLExemptionFilter
+ */
+public class ExemptionUrlFilter implements URLExemptionFilter {
+
+  public static final String DB_IGNORE_EXTERNAL_EXEMPTIONS_FILE = 
"db.ignore.external.exemptions.file";
+  private static final Logger LOG = 
LoggerFactory.getLogger(ExemptionUrlFilter.class);
+  private static ExemptionUrlFilter INSTANCE;
+
+  private List exemptions;
+  private Configuration conf;
+  private boolean enabled;
+
+  public static ExemptionUrlFilter getInstance() {
+if(INSTANCE == null) {
+  synchronized (ExemptionUrlFilter.class) {
+if (INSTANCE == null) {
+  INSTANCE = new ExemptionUrlFilter();
+  INSTANCE.setConf(NutchConfiguration.create());
+}
+  }
+}
+return INSTANCE;
+  }
+
+  public boolean isEnabled() {
+return enabled;
+  }
+
+  public List getExemptions() {
+return exemptions;
+  }
+
+  @Override
+  public boolean filter(String fromUrl, String toUrl) {
+//this implementation doesnt do anything with fromUrl
+if (exemptions != null) {
+  for (Pattern pattern : exemptions) {
+if (pattern.matcher(toUrl).matches()) {
+  return true; //If a regex matches, then exempted
+}
+  }
+}
+//not exempted
+return false;
+  }
+
+  @Override
+  public void setConf(Configuration conf) {
+this.conf = conf;
+LOG.info("Ignore exemptions enabled");
+String fileName = this.conf.get(DB_IGNORE_EXTERNAL_EXEMPTIONS_FILE);
+InputStream stream = this.conf.getConfResourceAsInputStream(fileName);
+if (stream == null) {
+  throw new RuntimeException("Couldn't find config file :" + fileName);
+}
+try {
+  this.exemptions = new ArrayList();
+  List lines = IOUtils.readLines(stream);
+  for (String line : lines) {
+line = line.trim();
+if (line.startsWith("#") || line.isEmpty()) {
+  

[jira] [Commented] (NUTCH-2144) Plugin to override db.ignore.external to exempt interesting external domain URLs

2016-02-13 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15146262#comment-15146262
 ] 

ASF GitHub Bot commented on NUTCH-2144:
---

Github user thammegowda commented on a diff in the pull request:

https://github.com/apache/nutch/pull/89#discussion_r52835493
  
--- Diff: 
src/plugin/urlfilter-ignoreexempt/src/java/org/apache/nutch/urlfilter/ignoreexempt/ExemptionUrlFilter.java
 ---
@@ -0,0 +1,144 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.nutch.urlfilter.ignoreexempt;
+
+import org.apache.commons.io.IOUtils;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.nutch.net.URLExemptionFilter;
+import org.apache.nutch.util.NutchConfiguration;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import java.io.IOException;
+import java.io.InputStream;
+import java.util.Arrays;
+import java.util.regex.Pattern;
+import java.util.List;
+import java.util.ArrayList;
+
+
+/**
+ * This implementation of {@link org.apache.nutch.net.URLExemptionFilter} 
uses regex configuration
+ * to check if URL is eligible for exemption from 'db.ignore.external'.
+ * When this filter is enabled, urls will be checked against configured 
sequence of regex rules.
+ *
+ * The exemption rule file defaults to db-ignore-external-exemptions.txt 
in the classpath but can be
+ * overridden using the property  
"db.ignore.external.exemptions.file" in ./conf/nutch-*.xml
+ *
+ *
+ * The exemption rules are specified in plain text file where each line is 
a rule.
+ *
+ * When the url matches regex it is exempted from 
'db.ignore.external...'
+ * Examples:
+ * 
+ *   
+ * Exempt urls ending with .jpg or .png or gif
+ *  .*\.(jpg|JPG|png$|PNG|gif|GIF)$
+ *   
+ * 
+ 
+ *
+ * @since Feb 10, 2016
+ * @version 1
+ * @see URLExemptionFilter
+ */
+public class ExemptionUrlFilter implements URLExemptionFilter {
+
+  public static final String DB_IGNORE_EXTERNAL_EXEMPTIONS_FILE = 
"db.ignore.external.exemptions.file";
+  private static final Logger LOG = 
LoggerFactory.getLogger(ExemptionUrlFilter.class);
+  private static ExemptionUrlFilter INSTANCE;
+
+  private List exemptions;
+  private Configuration conf;
+  private boolean enabled;
+
+  public static ExemptionUrlFilter getInstance() {
--- End diff --

:+1: 

Agreed.


> Plugin to override db.ignore.external to exempt interesting external domain 
> URLs
> 
>
> Key: NUTCH-2144
> URL: https://issues.apache.org/jira/browse/NUTCH-2144
> Project: Nutch
>  Issue Type: New Feature
>  Components: crawldb, fetcher
>Reporter: Thamme Gowda N
>Assignee: Chris A. Mattmann
>Priority: Minor
> Fix For: 1.12
>
> Attachments: ignore-exempt.patch, ignore-exempt.patch
>
>
> Create a rule based urlfilter plugin that allows focused crawler 
> (db.ignore.external.links=true) to fetch static resources from external 
> domains.
> The generalized version of this: This plugin should permit interesting URLs 
> from external domains (by overriding db.ignore.external). The interesting 
> urls are decided from a combination of regex and mime-type rules.
> Concrete use case:
>   When using Nutch to crawl images from a set of domains, the crawler needs 
> to fetch all images which may be linked from CDNs and other domains. In this 
> scenario, allowing all external links and then writing hundreds of regular 
> expressions is not feasible for large number of domains.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2144) Plugin to override db.ignore.external to exempt interesting external domain URLs

2016-02-13 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15146263#comment-15146263
 ] 

ASF GitHub Bot commented on NUTCH-2144:
---

Github user thammegowda commented on a diff in the pull request:

https://github.com/apache/nutch/pull/89#discussion_r52835517
  
--- Diff: 
src/plugin/urlfilter-ignoreexempt/src/java/org/apache/nutch/urlfilter/ignoreexempt/ExemptionUrlFilter.java
 ---
@@ -0,0 +1,144 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.nutch.urlfilter.ignoreexempt;
+
+import org.apache.commons.io.IOUtils;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.nutch.net.URLExemptionFilter;
+import org.apache.nutch.util.NutchConfiguration;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import java.io.IOException;
+import java.io.InputStream;
+import java.util.Arrays;
+import java.util.regex.Pattern;
+import java.util.List;
+import java.util.ArrayList;
+
+
+/**
+ * This implementation of {@link org.apache.nutch.net.URLExemptionFilter} 
uses regex configuration
+ * to check if URL is eligible for exemption from 'db.ignore.external'.
+ * When this filter is enabled, urls will be checked against configured 
sequence of regex rules.
+ *
+ * The exemption rule file defaults to db-ignore-external-exemptions.txt 
in the classpath but can be
+ * overridden using the property  
"db.ignore.external.exemptions.file" in ./conf/nutch-*.xml
+ *
+ *
+ * The exemption rules are specified in plain text file where each line is 
a rule.
+ *
+ * When the url matches regex it is exempted from 
'db.ignore.external...'
+ * Examples:
+ * 
+ *   
+ * Exempt urls ending with .jpg or .png or gif
+ *  .*\.(jpg|JPG|png$|PNG|gif|GIF)$
+ *   
+ * 
+ 
+ *
+ * @since Feb 10, 2016
+ * @version 1
+ * @see URLExemptionFilter
+ */
+public class ExemptionUrlFilter implements URLExemptionFilter {
+
+  public static final String DB_IGNORE_EXTERNAL_EXEMPTIONS_FILE = 
"db.ignore.external.exemptions.file";
+  private static final Logger LOG = 
LoggerFactory.getLogger(ExemptionUrlFilter.class);
+  private static ExemptionUrlFilter INSTANCE;
+
+  private List exemptions;
+  private Configuration conf;
+  private boolean enabled;
--- End diff --

Will be removed.
Thanks for pointing out.
 


> Plugin to override db.ignore.external to exempt interesting external domain 
> URLs
> 
>
> Key: NUTCH-2144
> URL: https://issues.apache.org/jira/browse/NUTCH-2144
> Project: Nutch
>  Issue Type: New Feature
>  Components: crawldb, fetcher
>Reporter: Thamme Gowda N
>Assignee: Chris A. Mattmann
>Priority: Minor
> Fix For: 1.12
>
> Attachments: ignore-exempt.patch, ignore-exempt.patch
>
>
> Create a rule based urlfilter plugin that allows focused crawler 
> (db.ignore.external.links=true) to fetch static resources from external 
> domains.
> The generalized version of this: This plugin should permit interesting URLs 
> from external domains (by overriding db.ignore.external). The interesting 
> urls are decided from a combination of regex and mime-type rules.
> Concrete use case:
>   When using Nutch to crawl images from a set of domains, the crawler needs 
> to fetch all images which may be linked from CDNs and other domains. In this 
> scenario, allowing all external links and then writing hundreds of regular 
> expressions is not feasible for large number of domains.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2144) Plugin to override db.ignore.external to exempt interesting external domain URLs

2016-02-13 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15146271#comment-15146271
 ] 

ASF GitHub Bot commented on NUTCH-2144:
---

Github user thammegowda commented on a diff in the pull request:

https://github.com/apache/nutch/pull/89#discussion_r52835613
  
--- Diff: src/java/org/apache/nutch/parse/ParseOutputFormat.java ---
@@ -338,6 +340,9 @@ public static String filterNormalize(String fromUrl, 
String toUrl,
   } catch (MalformedURLException e1) {
 return null; // skip it
   }
+
+
+
   if ("bydomain".equalsIgnoreCase(ignoreExternalLinksMode)) {
 String toDomain = URLUtil.getDomainName(targetURL).toLowerCase();
 if (toDomain == null || !toDomain.equals(origin)) {
--- End diff --

I did not have a chance to test `db.ignore.external.links.mode : byDomain` 
feature.

However, if this new feature should also be covered for overriding, then I 
am ready to do so.


> Plugin to override db.ignore.external to exempt interesting external domain 
> URLs
> 
>
> Key: NUTCH-2144
> URL: https://issues.apache.org/jira/browse/NUTCH-2144
> Project: Nutch
>  Issue Type: New Feature
>  Components: crawldb, fetcher
>Reporter: Thamme Gowda N
>Assignee: Chris A. Mattmann
>Priority: Minor
> Fix For: 1.12
>
> Attachments: ignore-exempt.patch, ignore-exempt.patch
>
>
> Create a rule based urlfilter plugin that allows focused crawler 
> (db.ignore.external.links=true) to fetch static resources from external 
> domains.
> The generalized version of this: This plugin should permit interesting URLs 
> from external domains (by overriding db.ignore.external). The interesting 
> urls are decided from a combination of regex and mime-type rules.
> Concrete use case:
>   When using Nutch to crawl images from a set of domains, the crawler needs 
> to fetch all images which may be linked from CDNs and other domains. In this 
> scenario, allowing all external links and then writing hundreds of regular 
> expressions is not feasible for large number of domains.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2144) Plugin to override db.ignore.external to exempt interesting external domain URLs

2016-02-13 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15146276#comment-15146276
 ] 

ASF GitHub Bot commented on NUTCH-2144:
---

Github user thammegowda commented on a diff in the pull request:

https://github.com/apache/nutch/pull/89#discussion_r52835676
  
--- Diff: 
src/plugin/urlfilter-ignoreexempt/src/java/org/apache/nutch/urlfilter/ignoreexempt/ExemptionUrlFilter.java
 ---
@@ -0,0 +1,144 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.nutch.urlfilter.ignoreexempt;
+
+import org.apache.commons.io.IOUtils;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.nutch.net.URLExemptionFilter;
+import org.apache.nutch.util.NutchConfiguration;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import java.io.IOException;
+import java.io.InputStream;
+import java.util.Arrays;
+import java.util.regex.Pattern;
+import java.util.List;
+import java.util.ArrayList;
+
+
+/**
+ * This implementation of {@link org.apache.nutch.net.URLExemptionFilter} 
uses regex configuration
+ * to check if URL is eligible for exemption from 'db.ignore.external'.
+ * When this filter is enabled, urls will be checked against configured 
sequence of regex rules.
+ *
+ * The exemption rule file defaults to db-ignore-external-exemptions.txt 
in the classpath but can be
+ * overridden using the property  
"db.ignore.external.exemptions.file" in ./conf/nutch-*.xml
+ *
+ *
+ * The exemption rules are specified in plain text file where each line is 
a rule.
+ *
+ * When the url matches regex it is exempted from 
'db.ignore.external...'
+ * Examples:
+ * 
+ *   
+ * Exempt urls ending with .jpg or .png or gif
+ *  .*\.(jpg|JPG|png$|PNG|gif|GIF)$
+ *   
+ * 
+ 
+ *
+ * @since Feb 10, 2016
+ * @version 1
+ * @see URLExemptionFilter
+ */
+public class ExemptionUrlFilter implements URLExemptionFilter {
+
+  public static final String DB_IGNORE_EXTERNAL_EXEMPTIONS_FILE = 
"db.ignore.external.exemptions.file";
+  private static final Logger LOG = 
LoggerFactory.getLogger(ExemptionUrlFilter.class);
+  private static ExemptionUrlFilter INSTANCE;
+
+  private List exemptions;
+  private Configuration conf;
+  private boolean enabled;
+
+  public static ExemptionUrlFilter getInstance() {
+if(INSTANCE == null) {
+  synchronized (ExemptionUrlFilter.class) {
+if (INSTANCE == null) {
+  INSTANCE = new ExemptionUrlFilter();
+  INSTANCE.setConf(NutchConfiguration.create());
+}
+  }
+}
+return INSTANCE;
+  }
+
+  public boolean isEnabled() {
+return enabled;
+  }
+
+  public List getExemptions() {
+return exemptions;
+  }
+
+  @Override
+  public boolean filter(String fromUrl, String toUrl) {
+//this implementation doesnt do anything with fromUrl
+if (exemptions != null) {
+  for (Pattern pattern : exemptions) {
+if (pattern.matcher(toUrl).matches()) {
--- End diff --

Totally agreed :+1: 

My previous attempt to extend 
`org.apache.nutch.urlfilter.api.RegexURLFilterBase` ,
failed due to my limited knowledge of ant build.

I will give it another try.


> Plugin to override db.ignore.external to exempt interesting external domain 
> URLs
> 
>
> Key: NUTCH-2144
> URL: https://issues.apache.org/jira/browse/NUTCH-2144
> Project: Nutch
>  Issue Type: New Feature
>  Components: crawldb, fetcher
>Reporter: Thamme Gowda N
>Assignee: Chris A. Mattmann
>Priority: Minor
> Fix For: 1.12
>
> Attachments: ignore-exempt.patch, ignore-exempt.patch
>
>
> Create a rule based 

[jira] [Commented] (NUTCH-2144) Plugin to override db.ignore.external to exempt interesting external domain URLs

2016-02-11 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15143793#comment-15143793
 ] 

ASF GitHub Bot commented on NUTCH-2144:
---

Github user lewismc commented on a diff in the pull request:

https://github.com/apache/nutch/pull/89#discussion_r52692152
  
--- Diff: conf/db-ignore-external-exemptions.txt ---
@@ -0,0 +1,21 @@
+# Exemption rules to db.ignore.external.links
--- End diff --

License header please


> Plugin to override db.ignore.external to exempt interesting external domain 
> URLs
> 
>
> Key: NUTCH-2144
> URL: https://issues.apache.org/jira/browse/NUTCH-2144
> Project: Nutch
>  Issue Type: New Feature
>  Components: crawldb, fetcher
>Reporter: Thamme Gowda N
>Assignee: Chris A. Mattmann
>Priority: Minor
> Fix For: 1.12
>
> Attachments: ignore-exempt.patch, ignore-exempt.patch
>
>
> Create a rule based urlfilter plugin that allows focused crawler 
> (db.ignore.external.links=true) to fetch static resources from external 
> domains.
> The generalized version of this: This plugin should permit interesting URLs 
> from external domains (by overriding db.ignore.external). The interesting 
> urls are decided from a combination of regex and mime-type rules.
> Concrete use case:
>   When using Nutch to crawl images from a set of domains, the crawler needs 
> to fetch all images which may be linked from CDNs and other domains. In this 
> scenario, allowing all external links and then writing hundreds of regular 
> expressions is not feasible for large number of domains.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2144) Plugin to override db.ignore.external to exempt interesting external domain URLs

2016-02-10 Thread Thamme Gowda N (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15141181#comment-15141181
 ] 

Thamme Gowda N commented on NUTCH-2144:
---

+1 sounds great

> Plugin to override db.ignore.external to exempt interesting external domain 
> URLs
> 
>
> Key: NUTCH-2144
> URL: https://issues.apache.org/jira/browse/NUTCH-2144
> Project: Nutch
>  Issue Type: New Feature
>  Components: crawldb, fetcher
>Reporter: Thamme Gowda N
>Assignee: Chris A. Mattmann
>Priority: Minor
> Fix For: 1.12
>
> Attachments: ignore-exempt.patch, ignore-exempt.patch
>
>
> Create a rule based urlfilter plugin that allows focused crawler 
> (db.ignore.external.links=true) to fetch static resources from external 
> domains.
> The generalized version of this: This plugin should permit interesting URLs 
> from external domains (by overriding db.ignore.external). The interesting 
> urls are decided from a combination of regex and mime-type rules.
> Concrete use case:
>   When using Nutch to crawl images from a set of domains, the crawler needs 
> to fetch all images which may be linked from CDNs and other domains. In this 
> scenario, allowing all external links and then writing hundreds of regular 
> expressions is not feasible for large number of domains.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2144) Plugin to override db.ignore.external to exempt interesting external domain URLs

2016-02-10 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15141172#comment-15141172
 ] 

Chris A. Mattmann commented on NUTCH-2144:
--

I am +1 for this patch, and enabled only by the user (and not by default). This 
is a critical patch for us in MEMEX and I think it adds a lot of value here to 
the community.

[~lewismc] and I will work to get this committed in the next 48 hours. Thank 
you [~thammegowda]!

> Plugin to override db.ignore.external to exempt interesting external domain 
> URLs
> 
>
> Key: NUTCH-2144
> URL: https://issues.apache.org/jira/browse/NUTCH-2144
> Project: Nutch
>  Issue Type: New Feature
>  Components: crawldb, fetcher
>Reporter: Thamme Gowda N
>Priority: Minor
> Attachments: ignore-exempt.patch, ignore-exempt.patch
>
>
> Create a rule based urlfilter plugin that allows focused crawler 
> (db.ignore.external.links=true) to fetch static resources from external 
> domains.
> The generalized version of this: This plugin should permit interesting URLs 
> from external domains (by overriding db.ignore.external). The interesting 
> urls are decided from a combination of regex and mime-type rules.
> Concrete use case:
>   When using Nutch to crawl images from a set of domains, the crawler needs 
> to fetch all images which may be linked from CDNs and other domains. In this 
> scenario, allowing all external links and then writing hundreds of regular 
> expressions is not feasible for large number of domains.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2144) Plugin to override db.ignore.external to exempt interesting external domain URLs

2016-02-10 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15141213#comment-15141213
 ] 

Lewis John McGibbney commented on NUTCH-2144:
-

Hi [~thammegowda], limitations I see are as follows
 * as mentioned, the HEAD is going to slow stuff down. I see you're FIXME. I 
have a suggestion for the time being. Lets think about initially addressing the 
case where we don't bother with HEAD, we just reply upon mimeType detection 
through evaluation of URL suffix. What do you think about this?
 * I feel that the invocation of this entire plugin could be extended to also 
deal with db.ignore.internal. The exact same may apply for the use case when we 
wish to crawl images from a set of domains, the crawler needs to fetch all 
images which may be linked internally but I have a list of say 5000 of these 
domains. In this scenario, allowing all internal links and then writing 
hundreds of regular expressions is not feasible for large number of domains.

This is a nice patch and a lot of work. I like the extension point.

> Plugin to override db.ignore.external to exempt interesting external domain 
> URLs
> 
>
> Key: NUTCH-2144
> URL: https://issues.apache.org/jira/browse/NUTCH-2144
> Project: Nutch
>  Issue Type: New Feature
>  Components: crawldb, fetcher
>Reporter: Thamme Gowda N
>Assignee: Chris A. Mattmann
>Priority: Minor
> Fix For: 1.12
>
> Attachments: ignore-exempt.patch, ignore-exempt.patch
>
>
> Create a rule based urlfilter plugin that allows focused crawler 
> (db.ignore.external.links=true) to fetch static resources from external 
> domains.
> The generalized version of this: This plugin should permit interesting URLs 
> from external domains (by overriding db.ignore.external). The interesting 
> urls are decided from a combination of regex and mime-type rules.
> Concrete use case:
>   When using Nutch to crawl images from a set of domains, the crawler needs 
> to fetch all images which may be linked from CDNs and other domains. In this 
> scenario, allowing all external links and then writing hundreds of regular 
> expressions is not feasible for large number of domains.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2144) Plugin to override db.ignore.external to exempt interesting external domain URLs

2016-02-10 Thread Thamme Gowda N (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15141285#comment-15141285
 ] 

Thamme Gowda N commented on NUTCH-2144:
---

Hi [~lewismc]
* I think relying on URL suffix based mimetype detection is a nice precision 
trade-off for the gained speed.  I did this for one of my homework and to be 
honest, i disabled HEAD based MIME type detection because it was taking lot of 
time. This patch is using basic java Regex to filter. [~chrismattmann] I am not 
sure if Tika can take an URL and guess possible mime type without making a HEAD 
call. Can you point me to an example, if there is one?
* Agreed. The same logic can be applied to permit certain urls in intra-domain. 

I am glad you liked it, Let me know what improvements needed to make this 
useful for wide audience.




> Plugin to override db.ignore.external to exempt interesting external domain 
> URLs
> 
>
> Key: NUTCH-2144
> URL: https://issues.apache.org/jira/browse/NUTCH-2144
> Project: Nutch
>  Issue Type: New Feature
>  Components: crawldb, fetcher
>Reporter: Thamme Gowda N
>Assignee: Chris A. Mattmann
>Priority: Minor
> Fix For: 1.12
>
> Attachments: ignore-exempt.patch, ignore-exempt.patch
>
>
> Create a rule based urlfilter plugin that allows focused crawler 
> (db.ignore.external.links=true) to fetch static resources from external 
> domains.
> The generalized version of this: This plugin should permit interesting URLs 
> from external domains (by overriding db.ignore.external). The interesting 
> urls are decided from a combination of regex and mime-type rules.
> Concrete use case:
>   When using Nutch to crawl images from a set of domains, the crawler needs 
> to fetch all images which may be linked from CDNs and other domains. In this 
> scenario, allowing all external links and then writing hundreds of regular 
> expressions is not feasible for large number of domains.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2144) Plugin to override db.ignore.external to exempt interesting external domain URLs

2016-02-10 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15141293#comment-15141293
 ] 

Chris A. Mattmann commented on NUTCH-2144:
--

Agreed and agreed. Thamme can you submit a new version of the patch/pull 
request as Lewis suggests using just suffix checking.

Thamme - if you turn off MIME magic in the tika config, then it will default to 
glob pattern and URL regex matching. However, I wouldn't even bother with it in 
this case, and just doing a simple URL/regex check in Nutch will satisfy the 
speed gains.

Looking forward to the new version of the PR.

> Plugin to override db.ignore.external to exempt interesting external domain 
> URLs
> 
>
> Key: NUTCH-2144
> URL: https://issues.apache.org/jira/browse/NUTCH-2144
> Project: Nutch
>  Issue Type: New Feature
>  Components: crawldb, fetcher
>Reporter: Thamme Gowda N
>Assignee: Chris A. Mattmann
>Priority: Minor
> Fix For: 1.12
>
> Attachments: ignore-exempt.patch, ignore-exempt.patch
>
>
> Create a rule based urlfilter plugin that allows focused crawler 
> (db.ignore.external.links=true) to fetch static resources from external 
> domains.
> The generalized version of this: This plugin should permit interesting URLs 
> from external domains (by overriding db.ignore.external). The interesting 
> urls are decided from a combination of regex and mime-type rules.
> Concrete use case:
>   When using Nutch to crawl images from a set of domains, the crawler needs 
> to fetch all images which may be linked from CDNs and other domains. In this 
> scenario, allowing all external links and then writing hundreds of regular 
> expressions is not feasible for large number of domains.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2144) Plugin to override db.ignore.external to exempt interesting external domain URLs

2016-02-10 Thread Thamme Gowda N (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15141359#comment-15141359
 ] 

Thamme Gowda N commented on NUTCH-2144:
---

Thanks.

Yes, I will submit a new patch.

> Plugin to override db.ignore.external to exempt interesting external domain 
> URLs
> 
>
> Key: NUTCH-2144
> URL: https://issues.apache.org/jira/browse/NUTCH-2144
> Project: Nutch
>  Issue Type: New Feature
>  Components: crawldb, fetcher
>Reporter: Thamme Gowda N
>Assignee: Chris A. Mattmann
>Priority: Minor
> Fix For: 1.12
>
> Attachments: ignore-exempt.patch, ignore-exempt.patch
>
>
> Create a rule based urlfilter plugin that allows focused crawler 
> (db.ignore.external.links=true) to fetch static resources from external 
> domains.
> The generalized version of this: This plugin should permit interesting URLs 
> from external domains (by overriding db.ignore.external). The interesting 
> urls are decided from a combination of regex and mime-type rules.
> Concrete use case:
>   When using Nutch to crawl images from a set of domains, the crawler needs 
> to fetch all images which may be linked from CDNs and other domains. In this 
> scenario, allowing all external links and then writing hundreds of regular 
> expressions is not feasible for large number of domains.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2144) Plugin to override db.ignore.external to exempt interesting external domain URLs

2016-02-10 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15141296#comment-15141296
 ] 

Lewis John McGibbney commented on NUTCH-2144:
-

bq.  [~chrismattmann] I am not sure if Tika can take an URL and guess possible 
mime type without making a HEAD call. Can you point me to an example, if there 
is one?

Yes here you go 
https://tika.apache.org/1.11/api/index.html?org/apache/tika/detect/NameDetector.html

> Plugin to override db.ignore.external to exempt interesting external domain 
> URLs
> 
>
> Key: NUTCH-2144
> URL: https://issues.apache.org/jira/browse/NUTCH-2144
> Project: Nutch
>  Issue Type: New Feature
>  Components: crawldb, fetcher
>Reporter: Thamme Gowda N
>Assignee: Chris A. Mattmann
>Priority: Minor
> Fix For: 1.12
>
> Attachments: ignore-exempt.patch, ignore-exempt.patch
>
>
> Create a rule based urlfilter plugin that allows focused crawler 
> (db.ignore.external.links=true) to fetch static resources from external 
> domains.
> The generalized version of this: This plugin should permit interesting URLs 
> from external domains (by overriding db.ignore.external). The interesting 
> urls are decided from a combination of regex and mime-type rules.
> Concrete use case:
>   When using Nutch to crawl images from a set of domains, the crawler needs 
> to fetch all images which may be linked from CDNs and other domains. In this 
> scenario, allowing all external links and then writing hundreds of regular 
> expressions is not feasible for large number of domains.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2144) Plugin to override db.ignore.external to exempt interesting external domain URLs

2016-02-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15142021#comment-15142021
 ] 

ASF GitHub Bot commented on NUTCH-2144:
---

GitHub user thammegowda opened a pull request:

https://github.com/apache/nutch/pull/89

NUTCH-2144 : override db.ignore.external to exempt interesting external 
domain URLs

 + Add extension point
  org.apache.nutch.net.URLExemptionFilter
 + Modify FetcherThread and ParseOutputFormat to
  integrate new extension point
 + Add extension urlfilter-ignoreexempt
 + build configs modified to include new extension


Resolves https://issues.apache.org/jira/browse/NUTCH-2144

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/thammegowda/nutch NUTCH-2144-ignore-exempt

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/nutch/pull/89.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #89


commit 29c7ae4ec088f0f428ab95992e10af9d87a231ad
Author: Thamme Gowda 
Date:   2015-10-19T14:31:04Z

Add an extension point and an extension to
 override 'db.ignore.external.links'.

 + Add extension point
  org.apache.nutch.net.URLExemptionFilter
 + Modify FetcherThread and ParseOutputFormat to
  integrate new extension point
 + Add extension urlfilter-ignoreexempt
 + build configs modified to include new extension

commit 43583f7af19c21dd7553c5e41c026cb852f0cfa1
Author: Thamme Gowda 
Date:   2016-02-11T00:03:03Z

Added ignore exemption extension point and an extension

commit 559eb4d905a081e34095f0e9d5ee3e805363ccc3
Author: Thamme Gowda 
Date:   2016-02-11T00:21:08Z

README updated

commit 3cf887befc56c3cf4127f45385e83c47248dd6e9
Author: Thamme Gowda 
Date:   2016-02-11T00:23:12Z

Add an example rule fiile




> Plugin to override db.ignore.external to exempt interesting external domain 
> URLs
> 
>
> Key: NUTCH-2144
> URL: https://issues.apache.org/jira/browse/NUTCH-2144
> Project: Nutch
>  Issue Type: New Feature
>  Components: crawldb, fetcher
>Reporter: Thamme Gowda N
>Assignee: Chris A. Mattmann
>Priority: Minor
> Fix For: 1.12
>
> Attachments: ignore-exempt.patch, ignore-exempt.patch
>
>
> Create a rule based urlfilter plugin that allows focused crawler 
> (db.ignore.external.links=true) to fetch static resources from external 
> domains.
> The generalized version of this: This plugin should permit interesting URLs 
> from external domains (by overriding db.ignore.external). The interesting 
> urls are decided from a combination of regex and mime-type rules.
> Concrete use case:
>   When using Nutch to crawl images from a set of domains, the crawler needs 
> to fetch all images which may be linked from CDNs and other domains. In this 
> scenario, allowing all external links and then writing hundreds of regular 
> expressions is not feasible for large number of domains.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2144) Plugin to override db.ignore.external to exempt interesting external domain URLs

2015-10-19 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14963082#comment-14963082
 ] 

Markus Jelsma commented on NUTCH-2144:
--

Hi - i like the purpose of this plugin. The patch, however, is hardly readable, 
it contains various diffs for the same files. Can you provide a clean patch?

> Plugin to override db.ignore.external to exempt interesting external domain 
> URLs
> 
>
> Key: NUTCH-2144
> URL: https://issues.apache.org/jira/browse/NUTCH-2144
> Project: Nutch
>  Issue Type: New Feature
>  Components: crawldb, fetcher
>Reporter: Thamme Gowda N
>Priority: Minor
> Attachments: ignore-exempt.patch
>
>
> Create a rule based urlfilter plugin that allows focused crawler 
> (db.ignore.external.links=true) to fetch static resources from external 
> domains.
> The generalized version of this: This plugin should permit interesting URLs 
> from external domains (by overriding db.ignore.external). The interesting 
> urls are decided from a combination of regex and mime-type rules.
> Concrete use case:
>   When using Nutch to crawl images from a set of domains, the crawler needs 
> to fetch all images which may be linked from CDNs and other domains. In this 
> scenario, allowing all external links and then writing hundreds of regular 
> expressions is not feasible for large number of domains.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2144) Plugin to override db.ignore.external to exempt interesting external domain URLs

2015-10-19 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14964110#comment-14964110
 ] 

Markus Jelsma commented on NUTCH-2144:
--

Yes, this is much more readable indeed.

In ExemptionUrlFilter there is a TODO for getting a content type in the Nutch 
way. It looks like you just want to get a content type for a given URL. Since 
you are using a built-in httpclient to do a head request, and want to do it via 
the fetcher, this means you are going to to many additional requests. This is 
bad, we need to find a way to get the content type for any URL via the 
CrawlDatum.

I had some thoughts about this earlier, the fact that URL filters are missing 
context completely, which we should fix some day anyway! But since this is 
about external items, it makes it much harder because there is no information 
about them in the CrawlDB to begin with.

Any of our other committers to share some thoughts about these issues?

> Plugin to override db.ignore.external to exempt interesting external domain 
> URLs
> 
>
> Key: NUTCH-2144
> URL: https://issues.apache.org/jira/browse/NUTCH-2144
> Project: Nutch
>  Issue Type: New Feature
>  Components: crawldb, fetcher
>Reporter: Thamme Gowda N
>Priority: Minor
> Attachments: ignore-exempt.patch, ignore-exempt.patch
>
>
> Create a rule based urlfilter plugin that allows focused crawler 
> (db.ignore.external.links=true) to fetch static resources from external 
> domains.
> The generalized version of this: This plugin should permit interesting URLs 
> from external domains (by overriding db.ignore.external). The interesting 
> urls are decided from a combination of regex and mime-type rules.
> Concrete use case:
>   When using Nutch to crawl images from a set of domains, the crawler needs 
> to fetch all images which may be linked from CDNs and other domains. In this 
> scenario, allowing all external links and then writing hundreds of regular 
> expressions is not feasible for large number of domains.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2144) Plugin to override db.ignore.external to exempt interesting external domain URLs

2015-10-19 Thread Thamme Gowda N (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14964175#comment-14964175
 ] 

Thamme Gowda N commented on NUTCH-2144:
---

Thanks for your feedback. I agree that the content-type will be missing for 
newly discovered URLs. I am double sure that HTTP HEAD requests add lots of 
overhead.

First I apply regex filter to limit urls. So HEAD call is made only on the URLs 
matched by regex. However that was still not sufficient, so content type filter 
is made optional.

For example, 
{color:red}
  image/jpeg,image/png,image/gif=.*\.(jpg|JPG|png$|PNG|gif|GIF)$
{color}

the above rule makes HTTP HEAD call to the urls matched by regex to determine 
content type.

However, 
{color:red}
=.*\.(jpg|JPG|png$|PNG|gif|GIF)$
{color}

rule doesn't make HTTP HEAD, because the content type filter is not applied.

> Plugin to override db.ignore.external to exempt interesting external domain 
> URLs
> 
>
> Key: NUTCH-2144
> URL: https://issues.apache.org/jira/browse/NUTCH-2144
> Project: Nutch
>  Issue Type: New Feature
>  Components: crawldb, fetcher
>Reporter: Thamme Gowda N
>Priority: Minor
> Attachments: ignore-exempt.patch, ignore-exempt.patch
>
>
> Create a rule based urlfilter plugin that allows focused crawler 
> (db.ignore.external.links=true) to fetch static resources from external 
> domains.
> The generalized version of this: This plugin should permit interesting URLs 
> from external domains (by overriding db.ignore.external). The interesting 
> urls are decided from a combination of regex and mime-type rules.
> Concrete use case:
>   When using Nutch to crawl images from a set of domains, the crawler needs 
> to fetch all images which may be linked from CDNs and other domains. In this 
> scenario, allowing all external links and then writing hundreds of regular 
> expressions is not feasible for large number of domains.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)