[jira] [Commented] (NUTCH-2362) Upgrade MaxMind GeoIP version in index-geoip

2017-12-15 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16293312#comment-16293312
 ] 

Hudson commented on NUTCH-2362:
---

SUCCESS: Integrated in Jenkins build Nutch-trunk #3481 (See 
[https://builds.apache.org/job/Nutch-trunk/3481/])
NUTCH-2362 Upgrade MaxMind GeoIP version in index-geoip (snagel: 
[https://github.com/apache/nutch/commit/29ee56206ab49bf3dfff0fdbc8c1ebaa47e4efe7])
* (edit) src/plugin/index-geoip/plugin.xml
* (edit) src/plugin/index-geoip/ivy.xml


> Upgrade MaxMind GeoIP version in index-geoip
> 
>
> Key: NUTCH-2362
> URL: https://issues.apache.org/jira/browse/NUTCH-2362
> Project: Nutch
>  Issue Type: Improvement
>  Components: plugin
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 1.14
>
>
> Current version of GeoIP dependency is 2.8.1, we should upgrade
> http://search.maven.org/#search|gav|1|g%3A%22com.maxmind.geoip2%22%20AND%20a%3A%22geoip2%22



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-2362) Upgrade MaxMind GeoIP version in index-geoip

2017-12-15 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16293223#comment-16293223
 ] 

ASF GitHub Bot commented on NUTCH-2362:
---

sebastian-nagel closed pull request #262: NUTCH-2362 Upgrade MaxMind GeoIP 
version in index-geoip
URL: https://github.com/apache/nutch/pull/262
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/src/plugin/index-geoip/ivy.xml b/src/plugin/index-geoip/ivy.xml
index 1b626f073..aa56a68e5 100644
--- a/src/plugin/index-geoip/ivy.xml
+++ b/src/plugin/index-geoip/ivy.xml
@@ -36,10 +36,12 @@
   
 
   
-
+
   
   
   
+  
+  
 
   
   
diff --git a/src/plugin/index-geoip/plugin.xml 
b/src/plugin/index-geoip/plugin.xml
index 214fbd08a..821ecc010 100644
--- a/src/plugin/index-geoip/plugin.xml
+++ b/src/plugin/index-geoip/plugin.xml
@@ -25,15 +25,11 @@
   
  
   
-  
-  
-  
-  
-  
-  
-  
-  
-  
+  
+  
+  
+  
+  

 



 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Upgrade MaxMind GeoIP version in index-geoip
> 
>
> Key: NUTCH-2362
> URL: https://issues.apache.org/jira/browse/NUTCH-2362
> Project: Nutch
>  Issue Type: Improvement
>  Components: plugin
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 1.14
>
>
> Current version of GeoIP dependency is 2.8.1, we should upgrade
> http://search.maven.org/#search|gav|1|g%3A%22com.maxmind.geoip2%22%20AND%20a%3A%22geoip2%22



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Resolved] (NUTCH-2362) Upgrade MaxMind GeoIP version in index-geoip

2017-12-15 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-2362.

Resolution: Fixed

> Upgrade MaxMind GeoIP version in index-geoip
> 
>
> Key: NUTCH-2362
> URL: https://issues.apache.org/jira/browse/NUTCH-2362
> Project: Nutch
>  Issue Type: Improvement
>  Components: plugin
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 1.14
>
>
> Current version of GeoIP dependency is 2.8.1, we should upgrade
> http://search.maven.org/#search|gav|1|g%3A%22com.maxmind.geoip2%22%20AND%20a%3A%22geoip2%22



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-2478) // is not a valid base URL

2017-12-15 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16293219#comment-16293219
 ] 

Sebastian Nagel commented on NUTCH-2478:


Ok, pull request [#263|https://github.com/apache/nutch/pull/263] opened. The 
unit test is ported to parse-tika.

> // is not a valid base URL
> --
>
> Key: NUTCH-2478
> URL: https://issues.apache.org/jira/browse/NUTCH-2478
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.13
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Fix For: 1.14
>
>
> This test fails:
> {code}
>   @Test
>   public void testBadResolver() throws Exception {
> URL base = new URL("//www.example.org/");
> String target = "index/produkt/kanaly/";
> 
> URL abs = URLUtil.resolveURL(base, target);
> Assert.assertEquals("http://www.example.org/index/produkt/kanaly/;, 
> abs.toString());
>   }
> {code}
> and has to fail because of invalid base URL, so the current URL is used. If 
> current URL is not /, its path will be prepended, resulting in 404 being 
> crawled.
> This ticket must allow // as base, and resolve the protocol.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-2354) Upgrade Hadoop dependencies to 2.7.4

2017-12-15 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16293116#comment-16293116
 ] 

Hudson commented on NUTCH-2354:
---

SUCCESS: Integrated in Jenkins build Nutch-trunk #3480 (See 
[https://builds.apache.org/job/Nutch-trunk/3480/])
NUTCH-2354 Upgrade Hadoop dependencies to 2.7.4 (snagel: 
[https://github.com/apache/nutch/commit/b2ba0ab9bcc245317c17d37d2298bc08e7c32be4])
* (edit) ivy/ivy.xml


> Upgrade Hadoop dependencies to 2.7.4
> 
>
> Key: NUTCH-2354
> URL: https://issues.apache.org/jira/browse/NUTCH-2354
> Project: Nutch
>  Issue Type: Bug
>  Components: injector
>Affects Versions: 1.12
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Blocker
> Fix For: 1.14
>
>
> This wednesday we experienced trouble running the 1.12 injector on Hadoop 
> 2.7.3. We operated 2.7.2 before and we had no trouble running a job.
> {code}
> 2017-01-18 15:36:53,005 FATAL [main] org.apache.hadoop.mapred.YarnChild: 
> Error running child : java.lang.IncompatibleClassChangeError: Found interface 
> org.apache.hadoop.mapreduce.Counter, but class was expected
>   at org.apache.nutch.crawl.Injector$InjectMapper.map(Injector.java:216)
>   at org.apache.nutch.crawl.Injector$InjectMapper.map(Injector.java:100)
>   at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:146)
>   at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:787)
>   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
>   at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
>   at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
> Exception in thread "main" java.lang.IncompatibleClassChangeError: Found 
> interface org.apache.hadoop.mapreduce.Counter, but class was expected
> at org.apache.nutch.crawl.Injector.inject(Injector.java:383)
> at org.apache.nutch.crawl.Injector.run(Injector.java:467)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
> at org.apache.nutch.crawl.Injector.main(Injector.java:441)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
> at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
> {code}
> Our processes retried injecting for a few minutes until we manually shut it 
> down. Meanwhile on HDFS, our CrawlDB was gone, thanks for snapshots and/or 
> backups we could restore it, so enable those if you haven't done so yet.
> These freak Hadoop errors can be notoriously difficult to debug but it seems 
> we are in luck, recompile Nutch with Hadoop 2.7.3 instead 2.4.0. You are also 
> in luck if your job file uses the old org.hadoop.mapred.* API, only jobs 
> using the org.hadoop.mapreduce.* API seem to fail.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-2480) Upgrade crawler-commons dependency to 0.9

2017-12-15 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16293115#comment-16293115
 ] 

Hudson commented on NUTCH-2480:
---

SUCCESS: Integrated in Jenkins build Nutch-trunk #3480 (See 
[https://builds.apache.org/job/Nutch-trunk/3480/])
NUTCH-2480 Upgrade crawler-commons dependency to 0.9 (snagel: 
[https://github.com/apache/nutch/commit/ee0ff5aba1bbc37ad1567e8f98b03fc566d07d90])
* (edit) ivy/ivy.xml


> Upgrade crawler-commons dependency to 0.9
> -
>
> Key: NUTCH-2480
> URL: https://issues.apache.org/jira/browse/NUTCH-2480
> Project: Nutch
>  Issue Type: Improvement
>  Components: build, deployment
>Affects Versions: 1.13
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Minor
> Fix For: 1.14
>
>
> Crawler-commons [0.9 is 
> relased|https://groups.google.com/d/msg/crawler-commons/O39RrYlwwTY/m4VS0YMvBgAJ].
>  We should upgrade the dependency: there are significant improvements in the 
> sitemap parser, also crawler-commons 0.9 depends on Tika 1.16 which minimizes 
> the gap to Tika 1.17 (NUTCH-2439).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Resolved] (NUTCH-2354) Upgrade Hadoop dependencies to 2.7.4

2017-12-15 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-2354.

Resolution: Fixed

Thanks, everyone!

> Upgrade Hadoop dependencies to 2.7.4
> 
>
> Key: NUTCH-2354
> URL: https://issues.apache.org/jira/browse/NUTCH-2354
> Project: Nutch
>  Issue Type: Bug
>  Components: injector
>Affects Versions: 1.12
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Blocker
> Fix For: 1.14
>
>
> This wednesday we experienced trouble running the 1.12 injector on Hadoop 
> 2.7.3. We operated 2.7.2 before and we had no trouble running a job.
> {code}
> 2017-01-18 15:36:53,005 FATAL [main] org.apache.hadoop.mapred.YarnChild: 
> Error running child : java.lang.IncompatibleClassChangeError: Found interface 
> org.apache.hadoop.mapreduce.Counter, but class was expected
>   at org.apache.nutch.crawl.Injector$InjectMapper.map(Injector.java:216)
>   at org.apache.nutch.crawl.Injector$InjectMapper.map(Injector.java:100)
>   at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:146)
>   at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:787)
>   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
>   at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
>   at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
> Exception in thread "main" java.lang.IncompatibleClassChangeError: Found 
> interface org.apache.hadoop.mapreduce.Counter, but class was expected
> at org.apache.nutch.crawl.Injector.inject(Injector.java:383)
> at org.apache.nutch.crawl.Injector.run(Injector.java:467)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
> at org.apache.nutch.crawl.Injector.main(Injector.java:441)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
> at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
> {code}
> Our processes retried injecting for a few minutes until we manually shut it 
> down. Meanwhile on HDFS, our CrawlDB was gone, thanks for snapshots and/or 
> backups we could restore it, so enable those if you haven't done so yet.
> These freak Hadoop errors can be notoriously difficult to debug but it seems 
> we are in luck, recompile Nutch with Hadoop 2.7.3 instead 2.4.0. You are also 
> in luck if your job file uses the old org.hadoop.mapred.* API, only jobs 
> using the org.hadoop.mapreduce.* API seem to fail.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Resolved] (NUTCH-2480) Upgrade crawler-commons dependency to 0.9

2017-12-15 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-2480.

Resolution: Fixed
  Assignee: Sebastian Nagel

> Upgrade crawler-commons dependency to 0.9
> -
>
> Key: NUTCH-2480
> URL: https://issues.apache.org/jira/browse/NUTCH-2480
> Project: Nutch
>  Issue Type: Improvement
>  Components: build, deployment
>Affects Versions: 1.13
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Minor
> Fix For: 1.14
>
>
> Crawler-commons [0.9 is 
> relased|https://groups.google.com/d/msg/crawler-commons/O39RrYlwwTY/m4VS0YMvBgAJ].
>  We should upgrade the dependency: there are significant improvements in the 
> sitemap parser, also crawler-commons 0.9 depends on Tika 1.16 which minimizes 
> the gap to Tika 1.17 (NUTCH-2439).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-2480) Upgrade crawler-commons dependency to 0.9

2017-12-15 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16293080#comment-16293080
 ] 

ASF GitHub Bot commented on NUTCH-2480:
---

sebastian-nagel closed pull request #260: NUTCH-2480 Upgrade crawler-commons 
dependency to 0.9
URL: https://github.com/apache/nutch/pull/260
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/ivy/ivy.xml b/ivy/ivy.xml
index a72a63283..2f7708199 100644
--- a/ivy/ivy.xml
+++ b/ivy/ivy.xml
@@ -74,7 +74,9 @@
 

 
-   
+   
+   
+   
 




 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Upgrade crawler-commons dependency to 0.9
> -
>
> Key: NUTCH-2480
> URL: https://issues.apache.org/jira/browse/NUTCH-2480
> Project: Nutch
>  Issue Type: Improvement
>  Components: build, deployment
>Affects Versions: 1.13
>Reporter: Sebastian Nagel
>Priority: Minor
> Fix For: 1.14
>
>
> Crawler-commons [0.9 is 
> relased|https://groups.google.com/d/msg/crawler-commons/O39RrYlwwTY/m4VS0YMvBgAJ].
>  We should upgrade the dependency: there are significant improvements in the 
> sitemap parser, also crawler-commons 0.9 depends on Tika 1.16 which minimizes 
> the gap to Tika 1.17 (NUTCH-2439).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-2439) Upgrade to Apache Tika 1.17

2017-12-15 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16292924#comment-16292924
 ] 

Hudson commented on NUTCH-2439:
---

SUCCESS: Integrated in Jenkins build Nutch-trunk #3479 (See 
[https://builds.apache.org/job/Nutch-trunk/3479/])
NUTCH-2439 Upgrade Apache Tika dependency to 1.17 (snagel: 
[https://github.com/apache/nutch/commit/c95f77e01502f1a319a47dc29ff70cfebc5aa63e])
* (edit) ivy/ivy.xml
* (edit) src/plugin/parse-tika/ivy.xml
* (edit) src/plugin/parse-tika/plugin.xml


> Upgrade to Apache Tika 1.17
> ---
>
> Key: NUTCH-2439
> URL: https://issues.apache.org/jira/browse/NUTCH-2439
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.13
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Fix For: 1.14
>
> Attachments: NUTCH-2439.patch, NUTCH-2439.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-2035) Regex filter using case sensitive rules.

2017-12-15 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16292898#comment-16292898
 ] 

Hudson commented on NUTCH-2035:
---

SUCCESS: Integrated in Jenkins build Nutch-nutchgora #1599 (See 
[https://builds.apache.org/job/Nutch-nutchgora/1599/])
NUTCH-2035 urlfilter-regex case insensitive rules (snagel: 
[https://github.com/apache/nutch/commit/ba4b2d495feb9351f7c07c767ecf9f4672cee2e3])
* (edit) conf/regex-urlfilter.txt.template


> Regex filter using case sensitive rules.
> 
>
> Key: NUTCH-2035
> URL: https://issues.apache.org/jira/browse/NUTCH-2035
> Project: Nutch
>  Issue Type: Improvement
>  Components: plugin
>Affects Versions: 1.10
>Reporter: Luis Lopez
>Assignee: Sebastian Nagel
>Priority: Minor
>  Labels: filters, regex, regex-urlfilter
> Fix For: 2.4, 1.14
>
> Attachments: regex-urlfilter.txt
>
>
> Regex expressions are computationally expensive and having “EXE|exe|JPG|jpg” 
> etc etc. adds up if we use complex rules.
> Regex filter should use case insensitive rules to make the rules more 
> readable and improve performance.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Resolved] (NUTCH-2439) Upgrade to Apache Tika 1.17

2017-12-15 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-2439.

Resolution: Fixed

Merged into 1.x, thanks!

> Upgrade to Apache Tika 1.17
> ---
>
> Key: NUTCH-2439
> URL: https://issues.apache.org/jira/browse/NUTCH-2439
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.13
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Fix For: 1.14
>
> Attachments: NUTCH-2439.patch, NUTCH-2439.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-2439) Upgrade to Apache Tika 1.17

2017-12-15 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16292837#comment-16292837
 ] 

ASF GitHub Bot commented on NUTCH-2439:
---

sebastian-nagel closed pull request #259: NUTCH-2439 Upgrade Apache Tika 
dependency to 1.17
URL: https://github.com/apache/nutch/pull/259
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/conf/nutch-default.xml b/conf/nutch-default.xml
index e68b0dd84..f386527a2 100644
--- a/conf/nutch-default.xml
+++ b/conf/nutch-default.xml
@@ -1406,6 +1406,12 @@ CAUTION: Set the parser.timeout to -1 or a bigger value 
than 30, when using this
 
 -->
 
+
+ tika.config.file
+ tika-config.xml
+ Nutch-specific Tika config file
+
+
 
   tika.uppercase.element.names
   true
diff --git a/conf/tika-config.xml.template b/conf/tika-config.xml.template
new file mode 100644
index 0..30af37d7b
--- /dev/null
+++ b/conf/tika-config.xml.template
@@ -0,0 +1,20 @@
+
+
+
+
+
diff --git a/ivy/ivy.xml b/ivy/ivy.xml
index a72a63283..f7867c47d 100644
--- a/ivy/ivy.xml
+++ b/ivy/ivy.xml
@@ -47,7 +47,7 @@



-   
+   


 
@@ -65,7 +65,7 @@
 
 
 
-   
+   

 

diff --git a/src/plugin/parse-tika/ivy.xml b/src/plugin/parse-tika/ivy.xml
index a01ec9804..24ad25b4e 100644
--- a/src/plugin/parse-tika/ivy.xml
+++ b/src/plugin/parse-tika/ivy.xml
@@ -36,11 +36,14 @@
   
 
   
-
+
   
   
   
   
+  
+  
+  
 
   
   
diff --git a/src/plugin/parse-tika/plugin.xml b/src/plugin/parse-tika/plugin.xml
index 7f14d9803..b9055e415 100644
--- a/src/plugin/parse-tika/plugin.xml
+++ b/src/plugin/parse-tika/plugin.xml
@@ -25,95 +25,87 @@
   
  
   
-  
-  
+  
+  
   
-  
-  
-  
-  
+  
+  
+  
   
   
   
   
   
-  
+  
+  
   
   
-  
-  
-  
-  
-  
-  
-  
-  
-  
+  
+  
+  
+  
+  
+  
   
-  
+  
   
   
-  
+  
   
-  
+  
   
-  
-  
-  
+  
+  
+  
+  
   
   
   
   
+  
   
-  
   
-  
+  
   
-  
-  
+  
   
   
-  
+  
   
   
   
+  
   
   
-  
-  
-  
-  
-  
+  
   
-  
-  
-  
-  
-  
-  
-  
-  
-  
+  
+  
+  
+  
+  
+  
+  
   
-  
   
   
-  
-  
-  
-  
-  
+  
+  
+  
+  
+  
+  
   
   
-  
+  
   
-  
-  
+  
+  
   
   
-  
-  
-  
+  
+  
+  

 



 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Upgrade to Apache Tika 1.17
> ---
>
> Key: NUTCH-2439
> URL: https://issues.apache.org/jira/browse/NUTCH-2439
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.13
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Fix For: 1.14
>
> Attachments: NUTCH-2439.patch, NUTCH-2439.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-2157) Parent Issue for Addressing Miredot REST API Warnings

2017-12-15 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16292830#comment-16292830
 ] 

Lewis John McGibbney commented on NUTCH-2157:
-

There are still many warnings. 
http://nutch.apache.org/miredot/1.13/index.html#warnings
Lets set fix for 1.15

> Parent Issue for Addressing Miredot REST API Warnings 
> --
>
> Key: NUTCH-2157
> URL: https://issues.apache.org/jira/browse/NUTCH-2157
> Project: Nutch
>  Issue Type: Improvement
>  Components: documentation, REST_api
>Affects Versions: 1.11
>Reporter: Lewis John McGibbney
> Fix For: 1.15
>
>
> This is a parent issue for addressing the numerous warning as stated within 
> the Miredot warnings. An example can be seen here 
> http://people.apache.org/~lewismc/miredot/#warnings
> For context on this issue please see NUTCH-1800
> It is a large issue with a lot of work so I assume that we can hammer through 
> it gradually as oppose to all at once.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (NUTCH-2157) Parent Issue for Addressing Miredot REST API Warnings

2017-12-15 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-2157:

Fix Version/s: (was: 1.14)
   1.15

> Parent Issue for Addressing Miredot REST API Warnings 
> --
>
> Key: NUTCH-2157
> URL: https://issues.apache.org/jira/browse/NUTCH-2157
> Project: Nutch
>  Issue Type: Improvement
>  Components: documentation, REST_api
>Affects Versions: 1.11
>Reporter: Lewis John McGibbney
> Fix For: 1.15
>
>
> This is a parent issue for addressing the numerous warning as stated within 
> the Miredot warnings. An example can be seen here 
> http://people.apache.org/~lewismc/miredot/#warnings
> For context on this issue please see NUTCH-1800
> It is a large issue with a lot of work so I assume that we can hammer through 
> it gradually as oppose to all at once.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Resolved] (NUTCH-2181) Add Webpage for 3rd Party Connectors/Libraries to Apache Nutch

2017-12-15 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney resolved NUTCH-2181.
-
   Resolution: Won't Fix
Fix Version/s: 1.14

These are never kept up-to-date

> Add Webpage for 3rd Party Connectors/Libraries to Apache Nutch
> --
>
> Key: NUTCH-2181
> URL: https://issues.apache.org/jira/browse/NUTCH-2181
> Project: Nutch
>  Issue Type: Task
>  Components: website
>Reporter: Lewis John McGibbney
> Fix For: 1.14
>
>
> It would be nice to have a webpage/wiki page dedicated to 3rd party libraries 
> which can be used with Nutch.
> http://github.com/chrismattmann/nutch-python.git is an example



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (NUTCH-2185) protocol-soda-consumer plugin

2017-12-15 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-2185:

Fix Version/s: (was: 1.15)
   1.14

> protocol-soda-consumer plugin
> -
>
> Key: NUTCH-2185
> URL: https://issues.apache.org/jira/browse/NUTCH-2185
> Project: Nutch
>  Issue Type: Bug
>  Components: plugin, protocol
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: 1.14
>
>
> I'm finishing off a Nutch protocol implementation for interacting with the 
> popular [Socrata|https://www.socrata.com/] Open Data platform via their 
> [soda-java api|https://github.com/socrata/soda-java]. I feel that this would 
> be useful for Government and other public sector organizations who make their 
> data available through the Socrata platforms so it is my intention to propose 
> it as a protocol-soda-consumer plugin for Nutch.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Resolved] (NUTCH-2185) protocol-soda-consumer plugin

2017-12-15 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney resolved NUTCH-2185.
-
Resolution: Won't Fix

This was a very limited use case and is not worth integration into Nutch.

> protocol-soda-consumer plugin
> -
>
> Key: NUTCH-2185
> URL: https://issues.apache.org/jira/browse/NUTCH-2185
> Project: Nutch
>  Issue Type: Bug
>  Components: plugin, protocol
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: 1.15
>
>
> I'm finishing off a Nutch protocol implementation for interacting with the 
> popular [Socrata|https://www.socrata.com/] Open Data platform via their 
> [soda-java api|https://github.com/socrata/soda-java]. I feel that this would 
> be useful for Government and other public sector organizations who make their 
> data available through the Socrata platforms so it is my intention to propose 
> it as a protocol-soda-consumer plugin for Nutch.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-2035) Regex filter using case sensitive rules.

2017-12-15 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16292802#comment-16292802
 ] 

Hudson commented on NUTCH-2035:
---

SUCCESS: Integrated in Jenkins build Nutch-trunk #3478 (See 
[https://builds.apache.org/job/Nutch-trunk/3478/])
NUTCH-2035 urlfilter-regex case insensitive rules (snagel: 
[https://github.com/apache/nutch/commit/df14c8a0a19e4f670d75ecd7ae2a22c3d8eeb0b6])
* (edit) conf/regex-urlfilter.txt.template


> Regex filter using case sensitive rules.
> 
>
> Key: NUTCH-2035
> URL: https://issues.apache.org/jira/browse/NUTCH-2035
> Project: Nutch
>  Issue Type: Improvement
>  Components: plugin
>Affects Versions: 1.10
>Reporter: Luis Lopez
>Assignee: Sebastian Nagel
>Priority: Minor
>  Labels: filters, regex, regex-urlfilter
> Fix For: 2.4, 1.14
>
> Attachments: regex-urlfilter.txt
>
>
> Regex expressions are computationally expensive and having “EXE|exe|JPG|jpg” 
> etc etc. adds up if we use complex rules.
> Regex filter should use case insensitive rules to make the rules more 
> readable and improve performance.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (NUTCH-2334) Extension point for schedulers

2017-12-15 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-2334:
---
Fix Version/s: (was: 1.14)
   1.15

> Extension point for schedulers
> --
>
> Key: NUTCH-2334
> URL: https://issues.apache.org/jira/browse/NUTCH-2334
> Project: Nutch
>  Issue Type: New Feature
>  Components: generator
>Affects Versions: 1.12
>Reporter: Roannel Fernández Hernández
>Priority: Minor
> Fix For: 1.15
>
>
> With an extension point for schedulers, the users should be able to create 
> new schedulers that meet to their own needs.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (NUTCH-2419) Domain blacklist URL filter does not respect command-line override for file

2017-12-15 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-2419:
---
Fix Version/s: (was: 1.14)
   1.15

> Domain blacklist URL filter does not respect command-line override for file
> ---
>
> Key: NUTCH-2419
> URL: https://issues.apache.org/jira/browse/NUTCH-2419
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.13
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.15
>
> Attachments: NUTCH-2419.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (NUTCH-2309) Scoring-Similarity Plugin raises NullPointerException when error occurs in fetching URL

2017-12-15 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-2309:
---
Fix Version/s: (was: 1.14)
   1.15

> Scoring-Similarity Plugin raises NullPointerException when error occurs in 
> fetching URL
> ---
>
> Key: NUTCH-2309
> URL: https://issues.apache.org/jira/browse/NUTCH-2309
> Project: Nutch
>  Issue Type: Bug
>  Components: plugin, scoring
>Affects Versions: 1.12
>Reporter: Joey Hong
>Priority: Trivial
>  Labels: easyfix
> Fix For: 1.15
>
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> When the Scoring-Similarity plugin is enabled, a NullPointerException is 
> thrown, cancelling the crawl, when computing the Cosine Similarity for URLs 
> where any kind of error occurred in fetching it. 
> The error occurs in line 77 in CosineSimilarity.java:
> float score = 
> Float.parseFloat(parseData.getContentMeta().get(Nutch.SCORE_KEY));
> This is probably because Nutch.SCORE_KEY is null for such URLs. It can be 
> easily fixed by setting a default value for score.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (NUTCH-2030) ParseZip plugin is not able to extract language from zip document,this could solve that problem.

2017-12-15 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-2030:
---
Fix Version/s: (was: 1.14)
   1.15

> ParseZip plugin is not able to extract language from zip document,this could 
> solve that problem.
> 
>
> Key: NUTCH-2030
> URL: https://issues.apache.org/jira/browse/NUTCH-2030
> Project: Nutch
>  Issue Type: Improvement
>  Components: parser, plugin
> Environment: Linux Mint 17 qiana, 4 GB Ram,Core I3.
>Reporter: Eyeris Rodriguez Rueda
>Priority: Minor
> Fix For: 1.15
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Actually parse-zip plugin don´t extract language from zip document, therefore 
> lang field is empty in solr or elastic. If the package(.zip) contains a list 
> of documents so the lang field could be multivalued to support that list of 
> languages. A simple change to parse-zip pluging could fix this problem. I 
> will use Language Identifier class from tika and analyze each document inside.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (NUTCH-1228) Change mapred.task.timeout to mapreduce.task.timeout in fetcher

2017-12-15 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-1228:
---
Fix Version/s: (was: 1.14)
   1.15

> Change mapred.task.timeout to mapreduce.task.timeout in fetcher
> ---
>
> Key: NUTCH-1228
> URL: https://issues.apache.org/jira/browse/NUTCH-1228
> Project: Nutch
>  Issue Type: Task
>  Components: fetcher
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Trivial
> Fix For: 2.4, 1.15
>
> Attachments: NUTCH-1228-2.1.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (NUTCH-1228) Change mapred.task.timeout to mapreduce.task.timeout in fetcher

2017-12-15 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-1228:
---
Fix Version/s: 2.4

> Change mapred.task.timeout to mapreduce.task.timeout in fetcher
> ---
>
> Key: NUTCH-1228
> URL: https://issues.apache.org/jira/browse/NUTCH-1228
> Project: Nutch
>  Issue Type: Task
>  Components: fetcher
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Trivial
> Fix For: 2.4, 1.15
>
> Attachments: NUTCH-1228-2.1.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (NUTCH-2312) Support PhantomJS as a WebDriver in protocol-selenium

2017-12-15 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-2312:
---
Fix Version/s: (was: 1.14)
   1.15

> Support PhantomJS as a WebDriver in protocol-selenium
> -
>
> Key: NUTCH-2312
> URL: https://issues.apache.org/jira/browse/NUTCH-2312
> Project: Nutch
>  Issue Type: Bug
>  Components: protocol
>Affects Versions: 1.12
>Reporter: Joey Hong
>Priority: Trivial
>  Labels: easyfix
> Fix For: 1.15
>
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> PhantomJS is a great parallelizable and headless browser to work with Nutch 
> via protocol-selenium. It looks like the phantomjs JAR is already in the 
> dependencies, and an empty initialization for the PhantomJSDriver exists in 
> protocol-selenium source code.
> However, at its current state, protocol-selenium will not fetch any URLs with 
> phantomjs, and configurations must be passed in via a DesiredCapabilities 
> object. Also a parameter must be created to allow users to add a path to 
> their phantomjs binary inside nutch-site.xml.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (NUTCH-2247) Protocol resolver

2017-12-15 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-2247:
---
Fix Version/s: (was: 1.14)
   1.15

> Protocol resolver
> -
>
> Key: NUTCH-2247
> URL: https://issues.apache.org/jira/browse/NUTCH-2247
> Project: Nutch
>  Issue Type: New Feature
>Affects Versions: 1.11
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Fix For: 1.15
>
> Attachments: NUTCH-2274.patch
>
>
> Protocol resolver program capable of emitting rules for the 
> urlnormalizer-protocol to ingest.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (NUTCH-2133) Transfer Selenium Documentation to WIki

2017-12-15 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-2133:
---
Fix Version/s: (was: 1.14)
   1.15

> Transfer Selenium Documentation to WIki
> ---
>
> Key: NUTCH-2133
> URL: https://issues.apache.org/jira/browse/NUTCH-2133
> Project: Nutch
>  Issue Type: Improvement
>  Components: documentation
>Affects Versions: 2.3, 1.10
>Reporter: Michael Joyce
> Fix For: 1.15
>
>
> There's a decent chunk of Selenium related documentation stuck in READMEs for 
> various plugins. I would be nice to get this stuff pushed to the wiki.
> E.G.: 
> https://github.com/apache/nutch/blob/trunk/src/plugin/protocol-selenium/README.md



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (NUTCH-2188) While crawling with solr url (kerberos enabled) Error: org.apache.solr.common.SolrException: Unauthorized

2017-12-15 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-2188:
---
Fix Version/s: (was: 1.14)
   1.15

> While crawling with solr url (kerberos enabled) Error: 
> org.apache.solr.common.SolrException: Unauthorized
> -
>
> Key: NUTCH-2188
> URL: https://issues.apache.org/jira/browse/NUTCH-2188
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.9
> Environment: Proof of Concept
>Reporter: Mohankumar K H
> Fix For: 1.15
>
>
> 15/12/16 21:49:22 INFO mapreduce.Job: Task Id : 
> attempt_1449548680888_0063_r_02_0, Status : FAILED
> Error: org.apache.solr.common.SolrException: Unauthorized
> Unauthorized
> request: 
> https://hdrdn001c.cps.intel.com:8985/solr/nutch_std_config/update?wt=javabin=2
> at 
> org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:430)
> at 
> org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:244)
> at 
> org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:105)
> at 
> org.apache.nutch.indexwriter.solr.SolrIndexWriter.close(SolrIndexWriter.java:155)
> at org.apache.nutch.indexer.IndexWriters.close(IndexWriters.java:118)
> at 
> org.apache.nutch.indexer.IndexerOutputFormat$1.close(IndexerOutputFormat.java:44)
> at 
> org.apache.hadoop.mapred.ReduceTask$OldTrackingRecordWriter.close(ReduceTask.java:502)
> at 
> org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:456)
> at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:392)
> at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671)
> at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
> Container killed by the ApplicationMaster.
> Container killed on request. Exit code is 143
> Container exited with a non-zero exit code 143



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (NUTCH-2033) parse-tika skips valid documents.

2017-12-15 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-2033:
---
Fix Version/s: (was: 1.14)
   1.15

> parse-tika skips valid documents.
> -
>
> Key: NUTCH-2033
> URL: https://issues.apache.org/jira/browse/NUTCH-2033
> Project: Nutch
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.10
>Reporter: Luis Lopez
>Assignee: Lewis John McGibbney
>  Labels: mime-type, parse-tika, parser, tika
> Fix For: 1.15
>
>
> If we run:
> {code}
> bin/nutch parsechecker -dumpText 
> http://ngdc.noaa.gov/geoportal/openSearchDescription
> {code}
> we’ll get:
> {code}
> Status: failed(2,0): Can't retrieve Tika parser for mime-type 
> application/opensearchdescription+xml
> {code}
> the same occurs  for:
> {code}
> bin/nutch parsechecker -dumpText http://petstore.swagger.io/v2/swagger.json
> {code}
> Both perfectly valid documents if they were returned as "application/xml" and 
> "text/plain" respectively. 
> This happens because parse-tika uses the mime type to retrieve a suitable 
> parser, some composite mime types are not included in this list even though 
> they are perfectly valid and parsable documents. This not taking into account 
> that servers often return incorrect mime types for the documents requested.
> We created a helper class as a workaround for this issue. The class uses 
> regex expressions to define synonyms. In the first case any mime type that 
> matches "application/(.*)\+xml" will be replaced by "application/xml". This 
> way parse-tika will parse the document just fine.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (NUTCH-2369) Create a new GraphGenerator Tool for writing Nutch Records as a Full Web Graph

2017-12-15 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-2369:
---
Fix Version/s: (was: 1.14)
   1.15

> Create a new GraphGenerator Tool for writing Nutch Records as a Full Web Graph
> --
>
> Key: NUTCH-2369
> URL: https://issues.apache.org/jira/browse/NUTCH-2369
> Project: Nutch
>  Issue Type: Task
>  Components: crawldb, graphgenerator, hostdb, linkdb, segment, 
> storage, tool
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>  Labels: gsoc2017
> Fix For: 1.15
>
>
> I've been thinking for quite some time now that a new Tool which writes Nutch 
> data out as full graph data would be an excellent addition to the codebase.
> My thoughts involves writing data using Tinkerpop's ScriptInputFormat and 
> ScriptOutputFormat's to create Vertex objects representing Nutch Crawl 
> Records. 
> http://tinkerpop.apache.org/javadocs/current/full/index.html?org/apache/tinkerpop/gremlin/hadoop/structure/io/script/ScriptInputFormat.html
> http://tinkerpop.apache.org/javadocs/current/full/index.html?org/apache/tinkerpop/gremlin/hadoop/structure/io/script/ScriptOutputFormat.html
> I envisage that each Vertex object would require the CrawlDB, LinkDB a 
> Segment and possibly the HostDB in order to be fully populated. Graph 
> characteristics e.g. Edge's would comes from those existing data structures 
> as well.
> It is my intention to propose this as a GSoC project for 2017 and I have 
> already talked offline with a potential student [~omkar20895] about him 
> participating as the student.
> Essentially, if we were able to create a Graph enabling true traversal, this 
> could be a game changer for how Nutch Crawl data is interpreted. It is my 
> feeling that this issue most likely also involved an entire upgrade of the 
> Hadoop API's from mapred to mapreduce for the master codebase.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-2157) Parent Issue for Addressing Miredot REST API Warnings

2017-12-15 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16292692#comment-16292692
 ] 

Sebastian Nagel commented on NUTCH-2157:


There is a successful commit. Is this fixed?

> Parent Issue for Addressing Miredot REST API Warnings 
> --
>
> Key: NUTCH-2157
> URL: https://issues.apache.org/jira/browse/NUTCH-2157
> Project: Nutch
>  Issue Type: Improvement
>  Components: documentation, REST_api
>Affects Versions: 1.11
>Reporter: Lewis John McGibbney
> Fix For: 1.14
>
>
> This is a parent issue for addressing the numerous warning as stated within 
> the Miredot warnings. An example can be seen here 
> http://people.apache.org/~lewismc/miredot/#warnings
> For context on this issue please see NUTCH-1800
> It is a large issue with a lot of work so I assume that we can hammer through 
> it gradually as oppose to all at once.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (NUTCH-2156) Dump via Services end point

2017-12-15 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-2156:
---
Fix Version/s: (was: 1.14)
   1.15

> Dump via Services end point 
> 
>
> Key: NUTCH-2156
> URL: https://issues.apache.org/jira/browse/NUTCH-2156
> Project: Nutch
>  Issue Type: Sub-task
>  Components: REST_api
>Reporter: Sujen Shah
>Assignee: Sujen Shah
> Fix For: 1.15
>
>
> Expose the ./bin/nutch dump command via the REST api. 
> Please review the documentation of the api design on 
> http://docs.apachenutchrestapi.apiary.io/# and give your feedbacks. 
> Thank you all for your inputs :) 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (NUTCH-2151) Service endpoint for REST API

2017-12-15 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-2151:
---
Fix Version/s: (was: 1.14)
   1.15

> Service endpoint for REST API
> -
>
> Key: NUTCH-2151
> URL: https://issues.apache.org/jira/browse/NUTCH-2151
> Project: Nutch
>  Issue Type: New Feature
>  Components: REST_api
>Reporter: Sujen Shah
>Assignee: Sujen Shah
>  Labels: memex
> Fix For: 1.15
>
>
> The service endpoint will enable users to call Nutch jobs like dump, 
> commoncrawldump, readseg, etc via the REST api. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (NUTCH-2147) MetadataScoringFilter for Nutch

2017-12-15 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-2147:
---
Fix Version/s: (was: 1.14)
   1.15

> MetadataScoringFilter for Nutch
> ---
>
> Key: NUTCH-2147
> URL: https://issues.apache.org/jira/browse/NUTCH-2147
> Project: Nutch
>  Issue Type: New Feature
>  Components: plugin, scoring
>Affects Versions: 1.10
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: 1.15
>
>
> This issue originally started by envisioning an implementation of a 
> LanguagePreferenceScoringFilter so that Nutch could easily be made into a 
> directed crawler based on crawl administrator ranking preferences of 
> languages we wish to crawl. 
> Right now this is not possible.
> We already detect and index language within the language-identifier plugin as 
> well as within parse-tika irrc, however currently the presence of a language 
> does not effect scoring of pages.
> The scope of this issue has changed to make it more generally applicable for 
> a wider variety of use cases. This will therefore take advantage of 
> NUTCH-1980 by pulling (amongst other things) Language entries from the 
> CrawlDB Metadata.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (NUTCH-1943) Form authentication should not be global and ignore

2017-12-15 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-1943:
---
Fix Version/s: (was: 1.14)
   1.15

> Form authentication should not be global and ignore 
> ---
>
> Key: NUTCH-1943
> URL: https://issues.apache.org/jira/browse/NUTCH-1943
> Project: Nutch
>  Issue Type: Improvement
>  Components: plugin, protocol
>Affects Versions: 1.10
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>  Labels: memex
> Fix For: 1.15
>
>
> Taken from [~wastl-nagel]'s comments on NUTCH-827
> bq. the form authentication is global and ignores . So you have to 
> restrict your crawl to the form authentication pages only. Ideally, also form 
> authentication should be bound to a scope (one host, one URL prefix, etc.) 
> same as HTTP authentication.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (NUTCH-2032) Plugin to index the raw content of a readable document.

2017-12-15 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-2032:
---
Fix Version/s: (was: 1.14)
   1.15

> Plugin to index the raw content of a readable document. 
> 
>
> Key: NUTCH-2032
> URL: https://issues.apache.org/jira/browse/NUTCH-2032
> Project: Nutch
>  Issue Type: New Feature
>  Components: indexer, parser
>Affects Versions: 1.10
>Reporter: Luis Lopez
>Assignee: Lewis John McGibbney
>  Labels: content, index, index-rawcontent, parser, raw
> Fix For: 1.15
>
>
> This is related to https://issues.apache.org/jira/browse/NUTCH-1785 and 
> https://issues.apache.org/jira/browse/NUTCH-1458
> We created a couple plugins to index the raw content of readable documents. 
> If we include these plugins in the plugin chain we'll index the raw content 
> of a readable document, i.e. XML, HTML, CSV, TXT etc. The index-rawcontent 
> plugin is not designed to index binary files, however having the full content 
> of an HTML/XML or a CSV document is really critical for some of us.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (NUTCH-2363) Fetcher support for reading and setting cookies

2017-12-15 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-2363:
---
Fix Version/s: (was: 1.14)
   1.15

> Fetcher support for reading and setting cookies
> ---
>
> Key: NUTCH-2363
> URL: https://issues.apache.org/jira/browse/NUTCH-2363
> Project: Nutch
>  Issue Type: New Feature
>  Components: fetcher
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Fix For: 1.15
>
> Attachments: NUTCH-2363.patch
>
>
> Patch adds basic support for cookies in the fetcher, and a scoring plugin 
> that passes cookies to its outlinks, within the domain. Sub-domain or path 
> based is not supported.
> This is useful if you want to maintain sessions or need to get around a 
> cookie wall.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (NUTCH-2162) Nutch Webapp Crawl fails as it tries to index

2017-12-15 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-2162:
---
Fix Version/s: (was: 1.14)
   1.15

> Nutch Webapp Crawl fails as it tries to index
> -
>
> Key: NUTCH-2162
> URL: https://issues.apache.org/jira/browse/NUTCH-2162
> Project: Nutch
>  Issue Type: Bug
>  Components: web gui
>Affects Versions: 1.11
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: 1.15
>
> Attachments: nutch_webapp.log
>
>
> Right now a crawl task fails on the trunk version of the WebApp due to it 
> attempting to index. No indexer is defined by default so this is a major bug.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (NUTCH-2181) Add Webpage for 3rd Party Connectors/Libraries to Apache Nutch

2017-12-15 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-2181:
---
Fix Version/s: (was: 1.14)

> Add Webpage for 3rd Party Connectors/Libraries to Apache Nutch
> --
>
> Key: NUTCH-2181
> URL: https://issues.apache.org/jira/browse/NUTCH-2181
> Project: Nutch
>  Issue Type: Task
>  Components: website
>Reporter: Lewis John McGibbney
>
> It would be nice to have a webpage/wiki page dedicated to 3rd party libraries 
> which can be used with Nutch.
> http://github.com/chrismattmann/nutch-python.git is an example



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (NUTCH-2214) Index clean to be flexible on what it deletes

2017-12-15 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-2214:
---
Fix Version/s: (was: 1.14)
   1.15

> Index clean to be flexible on what it deletes
> -
>
> Key: NUTCH-2214
> URL: https://issues.apache.org/jira/browse/NUTCH-2214
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.11
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Fix For: 1.15
>
>
> Nutch clean removes all useless records, but if Nutch is configured correctly 
> (-deleteGone etc), the index should only contain duplicates, if existing. On 
> a large index, this could result in Nutch sending millions of getById's to 
> Solr, for records that don't exist in the first place.
> This issue will make it configurable on what to delete, e.g. useless records 
> (404, 30x) or duplicates.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (NUTCH-2209) Improved Tokenization for Similarity Scoring plugin

2017-12-15 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-2209:
---
Fix Version/s: (was: 1.14)
   1.15

> Improved Tokenization for Similarity Scoring plugin
> ---
>
> Key: NUTCH-2209
> URL: https://issues.apache.org/jira/browse/NUTCH-2209
> Project: Nutch
>  Issue Type: Improvement
>  Components: scoring
>Reporter: Sujen Shah
>Assignee: Sujen Shah
>  Labels: memex
> Fix For: 1.15
>
>
> This patch would add Lucene based tokenization to the cosine similarity 
> plugin and clean up the code currently present. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (NUTCH-2185) protocol-soda-consumer plugin

2017-12-15 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-2185:
---
Fix Version/s: (was: 1.14)
   1.15

> protocol-soda-consumer plugin
> -
>
> Key: NUTCH-2185
> URL: https://issues.apache.org/jira/browse/NUTCH-2185
> Project: Nutch
>  Issue Type: Bug
>  Components: plugin, protocol
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: 1.15
>
>
> I'm finishing off a Nutch protocol implementation for interacting with the 
> popular [Socrata|https://www.socrata.com/] Open Data platform via their 
> [soda-java api|https://github.com/socrata/soda-java]. I feel that this would 
> be useful for Government and other public sector organizations who make their 
> data available through the Socrata platforms so it is my intention to propose 
> it as a protocol-soda-consumer plugin for Nutch.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (NUTCH-2265) Write A Test Package for Scoring Similarity

2017-12-15 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-2265:
---
Fix Version/s: (was: 1.14)
   1.15

> Write A Test Package for Scoring Similarity
> ---
>
> Key: NUTCH-2265
> URL: https://issues.apache.org/jira/browse/NUTCH-2265
> Project: Nutch
>  Issue Type: Improvement
>  Components: plugin, scoring
>Reporter: Furkan KAMACI
>Assignee: Furkan KAMACI
> Fix For: 1.15
>
>
> There is no test package for org.apache.nutch.scoring.similarity and it 
> should be implemented.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (NUTCH-2292) Mavenize the build for nutch-core and nutch-plugins

2017-12-15 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2292?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-2292:
---
Fix Version/s: (was: 1.14)
   1.15

> Mavenize the build for nutch-core and nutch-plugins
> ---
>
> Key: NUTCH-2292
> URL: https://issues.apache.org/jira/browse/NUTCH-2292
> Project: Nutch
>  Issue Type: Improvement
>  Components: build
>Reporter: Thamme Gowda
>Assignee: Thamme Gowda
> Fix For: 1.15
>
>
> Convert the build system of  nutch-core as well as plugins to Apache Maven.
> *Plan :*
> Create multi-module maven project with the following structure
> {code}
> nutch-parent
>   |-- pom.xml (POM)
>   |-- nutch-core
>   |   |-- pom.xml (JAR)
>   |   |--src: sources
>   |-- nutch-plugins
>   |-- pom.xml (POM)
>   |-- plugin1
>   ||-- pom.xml (JAR)
>   | .
>   |-- pluginN
>|-- pom.xml (JAR)
> {code}
> NOTE: watch out for cyclic dependencies bwteen nutch-core and plugins, 
> introduce another POM to break the cycle if required.
>  



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-2362) Upgrade MaxMind GeoIP version in index-geoip

2017-12-15 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16292680#comment-16292680
 ] 

ASF GitHub Bot commented on NUTCH-2362:
---

sebastian-nagel opened a new pull request #262: NUTCH-2362 Upgrade MaxMind 
GeoIP version in index-geoip
URL: https://github.com/apache/nutch/pull/262
 
 
   - upgrade to recent version 2.10.0
   - as there are no unit tests, tested with indexchecker:
   ```
   $ bin/nutch indexchecker -Dstore.ip.address=true 
-Dindex.geoip.usage=cityDatabase \
-Dplugin.includes="protocol-http|parse-html|index-(basic|geoip)" 
http://www.example.com/
   fetching: http://www.example.com/
   robots.txt whitelist not configured.
   parsing: http://www.example.com/
   contentType: text/html
   countryIsoCode :US
   postalCode :02061
   title : Example Domain
   accRadius : 50
   content :   Example Domain
   Example Domain
   This domain is established to be used for illustrative examples in doc
   cityName :  Norwell
   digest :09b9c392dc1f6e914cea287cb6be34b0
   host :  www.example.com
   id :http://www.example.com/
   continentCode : NA
   cityGeoNameId : 4945936
   ip :93.184.216.34
   countryGeoName :6252001
   timeZone :  America/New_York
   subDivName :Massachusetts
   url :   http://www.example.com/
   subDivIdoCode : MA
   subDivGeoNameId :   6254926
   tstamp :Fri Dec 15 16:16:19 CET 2017
   latLon :42.1508,-70.8228
   metroCode : 506
   continentGeoNameId :6255149
   countryName :   United States
   continentName : North America
   ```
   (requires to work around 
[NUTCH-2482](https://issues.apache.org/jira/browse/NUTCH-2482))


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Upgrade MaxMind GeoIP version in index-geoip
> 
>
> Key: NUTCH-2362
> URL: https://issues.apache.org/jira/browse/NUTCH-2362
> Project: Nutch
>  Issue Type: Improvement
>  Components: plugin
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 1.14
>
>
> Current version of GeoIP dependency is 2.8.1, we should upgrade
> http://search.maven.org/#search|gav|1|g%3A%22com.maxmind.geoip2%22%20AND%20a%3A%22geoip2%22



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (NUTCH-2482) index-geoip not to add null values to document fields

2017-12-15 Thread Sebastian Nagel (JIRA)
Sebastian Nagel created NUTCH-2482:
--

 Summary: index-geoip not to add null values to document fields
 Key: NUTCH-2482
 URL: https://issues.apache.org/jira/browse/NUTCH-2482
 Project: Nutch
  Issue Type: Bug
  Components: indexer, plugin
Affects Versions: 1.13
Reporter: Sebastian Nagel
Priority: Minor
 Fix For: 1.15


The plugin index-geoip may add null values to document fields which then cause 
further errors, here a NPE in IndexingFiltersChecker when toString() is called 
on null:
{noformat}
$ bin/nutch indexchecker -Dstore.ip.address=true 
-Dindex.geoip.usage=cityDatabase \
 -Dplugin.includes="protocol-http|parse-html|index-(basic|geoip)" 
http://www.example.com/
...
Exception in thread "main" java.lang.NullPointerException
at 
org.apache.nutch.indexer.IndexingFiltersChecker.fetch(IndexingFiltersChecker.java:340)
at 
org.apache.nutch.indexer.IndexingFiltersChecker.run(IndexingFiltersChecker.java:127)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at 
org.apache.nutch.indexer.IndexingFiltersChecker.main(IndexingFiltersChecker.java:370)
{noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (NUTCH-2412) Exchange component for indexing job

2017-12-15 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-2412:
---
Fix Version/s: (was: 1.14)
   1.15

> Exchange component for indexing job
> ---
>
> Key: NUTCH-2412
> URL: https://issues.apache.org/jira/browse/NUTCH-2412
> Project: Nutch
>  Issue Type: New Feature
>  Components: indexer, plugin
>Affects Versions: 1.14
>Reporter: Roannel Fernández Hernández
>Priority: Minor
> Fix For: 1.15
>
>
> The exchange component acts in indexing job and decides which index writer a 
> document should go to. It includes an extension point to allow developers to 
> develop plugins with their own logic.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (NUTCH-2249) WordNet Integration for Cosine Similarity

2017-12-15 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-2249:
---
Fix Version/s: (was: 1.14)
   1.15

> WordNet Integration for Cosine Similarity
> -
>
> Key: NUTCH-2249
> URL: https://issues.apache.org/jira/browse/NUTCH-2249
> Project: Nutch
>  Issue Type: New Feature
>  Components: plugin, scoring
>Reporter: Bhavya Sanghavi
>Assignee: Sujen Shah
>Priority: Minor
>  Labels: memex
> Fix For: 1.15
>
>
> Integrated WordNet database to enhance the cosine similarity plugin. 
> This helps in reducing the size of the vectors for calculating the cosine 
> similarity by mapping the synonymous words to the same entry in the vector. 
> Consequently, it would increase the accuracy of the scores given to the 
> webpages to be crawled. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (NUTCH-2354) Upgrade Hadoop dependencies to 2.7.4

2017-12-15 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-2354:
---
Summary: Upgrade Hadoop dependencies to 2.7.4  (was: Upgrade Hadoop 
dependencies to 2.7.3)

> Upgrade Hadoop dependencies to 2.7.4
> 
>
> Key: NUTCH-2354
> URL: https://issues.apache.org/jira/browse/NUTCH-2354
> Project: Nutch
>  Issue Type: Bug
>  Components: injector
>Affects Versions: 1.12
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Blocker
> Fix For: 1.14
>
>
> This wednesday we experienced trouble running the 1.12 injector on Hadoop 
> 2.7.3. We operated 2.7.2 before and we had no trouble running a job.
> {code}
> 2017-01-18 15:36:53,005 FATAL [main] org.apache.hadoop.mapred.YarnChild: 
> Error running child : java.lang.IncompatibleClassChangeError: Found interface 
> org.apache.hadoop.mapreduce.Counter, but class was expected
>   at org.apache.nutch.crawl.Injector$InjectMapper.map(Injector.java:216)
>   at org.apache.nutch.crawl.Injector$InjectMapper.map(Injector.java:100)
>   at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:146)
>   at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:787)
>   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
>   at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
>   at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
> Exception in thread "main" java.lang.IncompatibleClassChangeError: Found 
> interface org.apache.hadoop.mapreduce.Counter, but class was expected
> at org.apache.nutch.crawl.Injector.inject(Injector.java:383)
> at org.apache.nutch.crawl.Injector.run(Injector.java:467)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
> at org.apache.nutch.crawl.Injector.main(Injector.java:441)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
> at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
> {code}
> Our processes retried injecting for a few minutes until we manually shut it 
> down. Meanwhile on HDFS, our CrawlDB was gone, thanks for snapshots and/or 
> backups we could restore it, so enable those if you haven't done so yet.
> These freak Hadoop errors can be notoriously difficult to debug but it seems 
> we are in luck, recompile Nutch with Hadoop 2.7.3 instead 2.4.0. You are also 
> in luck if your job file uses the old org.hadoop.mapred.* API, only jobs 
> using the org.hadoop.mapreduce.* API seem to fail.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (NUTCH-2354) Upgrade Hadoop dependencies to 2.7.4

2017-12-15 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-2354:
---
Patch Info: Patch Available

> Upgrade Hadoop dependencies to 2.7.4
> 
>
> Key: NUTCH-2354
> URL: https://issues.apache.org/jira/browse/NUTCH-2354
> Project: Nutch
>  Issue Type: Bug
>  Components: injector
>Affects Versions: 1.12
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Blocker
> Fix For: 1.14
>
>
> This wednesday we experienced trouble running the 1.12 injector on Hadoop 
> 2.7.3. We operated 2.7.2 before and we had no trouble running a job.
> {code}
> 2017-01-18 15:36:53,005 FATAL [main] org.apache.hadoop.mapred.YarnChild: 
> Error running child : java.lang.IncompatibleClassChangeError: Found interface 
> org.apache.hadoop.mapreduce.Counter, but class was expected
>   at org.apache.nutch.crawl.Injector$InjectMapper.map(Injector.java:216)
>   at org.apache.nutch.crawl.Injector$InjectMapper.map(Injector.java:100)
>   at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:146)
>   at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:787)
>   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
>   at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
>   at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
> Exception in thread "main" java.lang.IncompatibleClassChangeError: Found 
> interface org.apache.hadoop.mapreduce.Counter, but class was expected
> at org.apache.nutch.crawl.Injector.inject(Injector.java:383)
> at org.apache.nutch.crawl.Injector.run(Injector.java:467)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
> at org.apache.nutch.crawl.Injector.main(Injector.java:441)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
> at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
> {code}
> Our processes retried injecting for a few minutes until we manually shut it 
> down. Meanwhile on HDFS, our CrawlDB was gone, thanks for snapshots and/or 
> backups we could restore it, so enable those if you haven't done so yet.
> These freak Hadoop errors can be notoriously difficult to debug but it seems 
> we are in luck, recompile Nutch with Hadoop 2.7.3 instead 2.4.0. You are also 
> in luck if your job file uses the old org.hadoop.mapred.* API, only jobs 
> using the org.hadoop.mapreduce.* API seem to fail.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-2354) Upgrade Hadoop dependencies to 2.7.3

2017-12-15 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16292630#comment-16292630
 ] 

ASF GitHub Bot commented on NUTCH-2354:
---

sebastian-nagel opened a new pull request #261: NUTCH-2354 Upgrade Hadoop 
dependencies to 2.7.4
URL: https://github.com/apache/nutch/pull/261
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Upgrade Hadoop dependencies to 2.7.3
> 
>
> Key: NUTCH-2354
> URL: https://issues.apache.org/jira/browse/NUTCH-2354
> Project: Nutch
>  Issue Type: Bug
>  Components: injector
>Affects Versions: 1.12
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Blocker
> Fix For: 1.14
>
>
> This wednesday we experienced trouble running the 1.12 injector on Hadoop 
> 2.7.3. We operated 2.7.2 before and we had no trouble running a job.
> {code}
> 2017-01-18 15:36:53,005 FATAL [main] org.apache.hadoop.mapred.YarnChild: 
> Error running child : java.lang.IncompatibleClassChangeError: Found interface 
> org.apache.hadoop.mapreduce.Counter, but class was expected
>   at org.apache.nutch.crawl.Injector$InjectMapper.map(Injector.java:216)
>   at org.apache.nutch.crawl.Injector$InjectMapper.map(Injector.java:100)
>   at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:146)
>   at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:787)
>   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
>   at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
>   at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
> Exception in thread "main" java.lang.IncompatibleClassChangeError: Found 
> interface org.apache.hadoop.mapreduce.Counter, but class was expected
> at org.apache.nutch.crawl.Injector.inject(Injector.java:383)
> at org.apache.nutch.crawl.Injector.run(Injector.java:467)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
> at org.apache.nutch.crawl.Injector.main(Injector.java:441)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
> at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
> {code}
> Our processes retried injecting for a few minutes until we manually shut it 
> down. Meanwhile on HDFS, our CrawlDB was gone, thanks for snapshots and/or 
> backups we could restore it, so enable those if you haven't done so yet.
> These freak Hadoop errors can be notoriously difficult to debug but it seems 
> we are in luck, recompile Nutch with Hadoop 2.7.3 instead 2.4.0. You are also 
> in luck if your job file uses the old org.hadoop.mapred.* API, only jobs 
> using the org.hadoop.mapreduce.* API seem to fail.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-2480) Upgrade crawler-commons dependency to 0.9

2017-12-15 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16292556#comment-16292556
 ] 

ASF GitHub Bot commented on NUTCH-2480:
---

sebastian-nagel opened a new pull request #260: NUTCH-2480 Upgrade 
crawler-commons dependency to 0.9
URL: https://github.com/apache/nutch/pull/260
 
 
   and exclude transitive dependency to tika-core to avoid that Tika version 
requested as direct dependency is evicted.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Upgrade crawler-commons dependency to 0.9
> -
>
> Key: NUTCH-2480
> URL: https://issues.apache.org/jira/browse/NUTCH-2480
> Project: Nutch
>  Issue Type: Improvement
>  Components: build, deployment
>Affects Versions: 1.13
>Reporter: Sebastian Nagel
>Priority: Minor
> Fix For: 1.14
>
>
> Crawler-commons [0.9 is 
> relased|https://groups.google.com/d/msg/crawler-commons/O39RrYlwwTY/m4VS0YMvBgAJ].
>  We should upgrade the dependency: there are significant improvements in the 
> sitemap parser, also crawler-commons 0.9 depends on Tika 1.16 which minimizes 
> the gap to Tika 1.17 (NUTCH-2439).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


Re: [DISCUSS] Release 1.14?

2017-12-15 Thread Sebastian Nagel
Ok, the pull request for the upgrade to Tika 1.17 is ready:
  https://issues.apache.org/jira/browse/NUTCH-2439
  https://github.com/apache/nutch/pull/259

Thanks,
Sebastian


On 12/14/2017 10:44 AM, Julien Nioche wrote:
> FYI Tika 1.17 has just been released 
> http://www.apache.org/dist/tika/CHANGES-1.17.txt
> 
> On 12 December 2017 at 12:36, Sebastian Nagel  > wrote:
> 
> Hi Julien,
> 
> yes, I know there's an open issue by Markus which depends on Tika 1.7.
> If the Tika release happens this week, I'll make sure that it's included.
> 
> Thanks,
> Sebastian
> 
> 
> On 12/11/2017 10:22 AM, Julien Nioche wrote:
> > Tika 1.17 will be released shortly, maybe it would be worth waiting a 
> bit and integrate it first?
> >
> > On 8 December 2017 at 22:53, Sebastian Nagel 
> 
> >  >> wrote:
> >
> >     Hi all,
> >
> >     50+ issues fixed
> >       https://issues.apache.org/jira/projects/NUTCH/versions/12340218
> 
> >      >
> >
> >     Of course, as always and still many open issues. But maybe it's 
> time to
> >     push a release now and try to integrate the next features and
> >     fixes early next year. What do you think?
> >
> >     The last release (1.3) dates 8 month back (April 2017).
> >
> >     I would be ready to push a release candidate next week.
> >
> >
> >     Sebastian
> >
> >
> >
> >
> > --
> > *
> > */Open Source Solutions for Text Engineering/
> > /
> > /http://www.digitalpebble.com 
> > http://digitalpebble.blogspot.com/ 
> > #digitalpebble  >
> 
> 
> 
> 
> -- 
> *
> */Open Source Solutions for Text Engineering/
> /
> /http://www.digitalpebble.com 
> http://digitalpebble.blogspot.com/
> #digitalpebble 



[jira] [Commented] (NUTCH-2439) Upgrade to Apache Tika 1.17

2017-12-15 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16292496#comment-16292496
 ] 

ASF GitHub Bot commented on NUTCH-2439:
---

lewismc commented on issue #259: NUTCH-2439 Upgrade Apache Tika dependency to 
1.17
URL: https://github.com/apache/nutch/pull/259#issuecomment-352002621
 
 
   Tika really has become a behemoth of sorts !!!
   +1


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Upgrade to Apache Tika 1.17
> ---
>
> Key: NUTCH-2439
> URL: https://issues.apache.org/jira/browse/NUTCH-2439
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.13
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Fix For: 1.14
>
> Attachments: NUTCH-2439.patch, NUTCH-2439.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-2439) Upgrade to Apache Tika 1.17

2017-12-15 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16292486#comment-16292486
 ] 

Sebastian Nagel commented on NUTCH-2439:


Ok, got it: of course, I have to add a tika-config.xml as described in 
TIKA-2490. I'll update the PR :)

> Upgrade to Apache Tika 1.17
> ---
>
> Key: NUTCH-2439
> URL: https://issues.apache.org/jira/browse/NUTCH-2439
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.13
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Fix For: 1.14
>
> Attachments: NUTCH-2439.patch, NUTCH-2439.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (NUTCH-2481) HostDatum deltas(previous step statistics)

2017-12-15 Thread Semyon Semyonov (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Semyon Semyonov updated NUTCH-2481:
---
Description: 
To allow the usage of previous step statistics(deltas of fetched,unfetced etc) 
in hostdb. The motivation is usage of this statistics in generate with maxCount 
expressions.

See an example bellow and two possible solutions.

??Lets say for each website we have condition of generate while number of 
fetched < 150. 
The problem is for some websites that condition will (almost)never be finished, 
because of its structure. 
1) Round1. 1 page
2) Round2. 10 pages
3) Round3. 80 pages
4) Round 4. 1 page
5) Round 5. 1 page 
...etc.

I would like to add the delta condition for fetched that describes speed of the 
process. Lets say generate while number of fetched < 150 && delta_fetched > 1. 
Therefore in this case the process should stop on round 5 with total number of 
fetched equals to 92. 
??

I see two possible solutions :
1. In HostDatum class apart from current statistic include last step statistics.
class PagesStatistics
{
  protected int unfetched = 0;
  protected int fetched = 0;
  protected int notModified = 0;
  protected int redirTemp = 0;
  protected int redirPerm = 0;
  protected int gone = 0;
}

Inside HostDatum
private PagesStatistics currentStatistics;
private PagesStatistics previousStepStatistics;

And update both in UpdateHostDb. *The main problem - space. In generate 
HostDatum is stored in a Dictionary(RAM)*

2. 
Include metadata flag(s) in HostDatum and store as a field in 
HostDatum.(Metadata.StopGenerate = true/false). Calculate the value of 
StopGenerate in UpdateHostDB.
*The main advantage is space, we store only flag in the db. The main problem - 
lack of flexibility in Generate*  

  was:
To allow the usage of previous step statistics(deltas of fetched,unfetced etc) 
in hostdb. The motivation is usage of this statistics in generate with maxCount 
expressions.

See an example bellow and two possible solutions.

??Lets say for each website we have condition of generate while number of 
fetched < 150. 
The problem is for some websites that condition will (almost)never be finished, 
because of its structure. 
1) Round1. 1 page
2) Round2. 10 pages
3) Round3. 80 pages
4) Round 4. 1 page
5) Round 5. 1 page 
...etc.

I would like to add the delta condition for fetched that describes speed of the 
process. Lets say generate while number of fetched < 150 && delta_fetched > 1. 
Therefore in this case the process should stop on round 5 with total number of 
fetched equals to 92. 
??

I see two possible solutions :
1. In HostDatum class apart from current statistic include last step statistics.
class PagesStatistics
{
  protected int unfetched = 0;
  protected int fetched = 0;
  protected int notModified = 0;
  protected int redirTemp = 0;
  protected int redirPerm = 0;
  protected int gone = 0;
}

Inside HostDatum
private PagesStatistics currentStatistics;
private PagesStatistics previousStepStatistics;

And update both in UpdateHostDb. *The main problem - space. In generate 
HostDatum is stored in a Dictionary in a memory*

2. 
Include metadata flag(s) in HostDatum and store as a field in 
HostDatum.(Metadata.StopGenerate = true/false). Calculate the value of 
StopGenerate in UpdateHostDB.
*The main advantage is space, we store only flag in the db. The main problem - 
lack of flexibility in Generate*  


> HostDatum deltas(previous step statistics)
> --
>
> Key: NUTCH-2481
> URL: https://issues.apache.org/jira/browse/NUTCH-2481
> Project: Nutch
>  Issue Type: Improvement
>  Components: hostdb
>Reporter: Semyon Semyonov
>
> To allow the usage of previous step statistics(deltas of fetched,unfetced 
> etc) in hostdb. The motivation is usage of this statistics in generate with 
> maxCount expressions.
> See an example bellow and two possible solutions.
> ??Lets say for each website we have condition of generate while number of 
> fetched < 150. 
> The problem is for some websites that condition will (almost)never be 
> finished, because of its structure. 
> 1) Round1. 1 page
> 2) Round2. 10 pages
> 3) Round3. 80 pages
> 4) Round 4. 1 page
> 5) Round 5. 1 page 
> ...etc.
> I would like to add the delta condition for fetched that describes speed of 
> the process. Lets say generate while number of fetched < 150 && delta_fetched 
> > 1. 
> Therefore in this case the process should stop on round 5 with total number 
> of fetched equals to 92. 
> ??
> I see two possible solutions :
> 1. In HostDatum class apart from current statistic include last step 
> statistics.
> class PagesStatistics
> {
>   protected int unfetched = 0;
>   protected int fetched = 0;
>   protected int notModified = 0;
>   protected int redirTemp = 0;
>   protected int redirPerm = 0;
>   protected int gone = 

[jira] [Updated] (NUTCH-2481) HostDatum deltas(previous step statistics)

2017-12-15 Thread Semyon Semyonov (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Semyon Semyonov updated NUTCH-2481:
---
Description: 
To allow the usage of previous step statistics(deltas of fetched,unfetced etc) 
in hostdb. The motivation is usage of this statistics in generate with maxCount 
expressions.

See an example bellow and two possible solutions.

??Lets say for each website we have condition of generate while number of 
fetched < 150. 
The problem is for some websites that condition will (almost)never be finished, 
because of its structure. 
1) Round1. 1 page
2) Round2. 10 pages
3) Round3. 80 pages
4) Round 4. 1 page
5) Round 5. 1 page 
...etc.

I would like to add the delta condition for fetched that describes speed of the 
process. Lets say generate while number of fetched < 150 && delta_fetched > 1. 
Therefore in this case the process should stop on round 5 with total number of 
fetched equals to 92. 
??

I see two possible solutions :
1. In HostDatum class apart from current statistic include last step statistics.
class PagesStatistics
{
  protected int unfetched = 0;
  protected int fetched = 0;
  protected int notModified = 0;
  protected int redirTemp = 0;
  protected int redirPerm = 0;
  protected int gone = 0;
}

Inside HostDatum
private PagesStatistics currentStatistics;
private PagesStatistics previousStepStatistics;

And update both in UpdateHostDb. *The main problem - space. In generate 
HostDatum is stored in a Dictionary in a memory*

2. 
Include metadata flag(s) in HostDatum and store as a field in 
HostDatum.(Metadata.StopGenerate = true/false). Calculate the value of 
StopGenerate in UpdateHostDB.
*The main advantage is space, we store only flag in the db. The main problem - 
lack of flexibility in Generate*  

  was:
To allow the usage of previous step statistics(deltas of fetched,unfetced etc) 
in hostdb. The motivation is usage of this statistics in generate with maxCount 
expressions.

See an example bellow and two possible solutions.

??Lets say for each website we have condition of generate while number of 
fetched < 150. 
The problem is for some websites that condition will (almost)never be finished, 
because of its structure. 
1) Round1. 1 page
2) Round2. 10 pages
3) Round3. 80 pages
4) Round 4. 1 page
5) Round 5. 1 page 
...etc.

I would like to add the delta condition for fetched that describes speed of the 
process. Lets say generate while number of fetched < 150 && delta_fetched > 1. 
Therefore in this case the process should stop on round 5 with total number of 
fetched equals to 92. 
??

I see two possible solutions :
1. In HostDatum class apart from current statistic include last step statistics.
class PagesStatistics
{
  protected int unfetched = 0;
  protected int fetched = 0;
  protected int notModified = 0;
  protected int redirTemp = 0;
  protected int redirPerm = 0;
  protected int gone = 0;
}

Inside HostDatum
private PagesStatistics currentStatistics;
private PagesStatistics previousStepStatistics;

And update both in UpdateHostDb.* The main problem - space. In generate 
HostDatum is stored in a Dictionary in a memory,*

2. 
Include metadata flag(s) in HostDatum and store as a field in 
HostDatum.(Metadata.StopGenerate = true/false). Calculate the value of 
StopGenerate in UpdateHostDB.
*The main advantage is space, we store only flag in the db. The main problem - 
lack of flexibility in Generate.,*  


> HostDatum deltas(previous step statistics)
> --
>
> Key: NUTCH-2481
> URL: https://issues.apache.org/jira/browse/NUTCH-2481
> Project: Nutch
>  Issue Type: Improvement
>  Components: hostdb
>Reporter: Semyon Semyonov
>
> To allow the usage of previous step statistics(deltas of fetched,unfetced 
> etc) in hostdb. The motivation is usage of this statistics in generate with 
> maxCount expressions.
> See an example bellow and two possible solutions.
> ??Lets say for each website we have condition of generate while number of 
> fetched < 150. 
> The problem is for some websites that condition will (almost)never be 
> finished, because of its structure. 
> 1) Round1. 1 page
> 2) Round2. 10 pages
> 3) Round3. 80 pages
> 4) Round 4. 1 page
> 5) Round 5. 1 page 
> ...etc.
> I would like to add the delta condition for fetched that describes speed of 
> the process. Lets say generate while number of fetched < 150 && delta_fetched 
> > 1. 
> Therefore in this case the process should stop on round 5 with total number 
> of fetched equals to 92. 
> ??
> I see two possible solutions :
> 1. In HostDatum class apart from current statistic include last step 
> statistics.
> class PagesStatistics
> {
>   protected int unfetched = 0;
>   protected int fetched = 0;
>   protected int notModified = 0;
>   protected int redirTemp = 0;
>   protected int redirPerm = 0;
>   protected 

[jira] [Created] (NUTCH-2481) HostDatum deltas(previous step statistics)

2017-12-15 Thread Semyon Semyonov (JIRA)
Semyon Semyonov created NUTCH-2481:
--

 Summary: HostDatum deltas(previous step statistics)
 Key: NUTCH-2481
 URL: https://issues.apache.org/jira/browse/NUTCH-2481
 Project: Nutch
  Issue Type: Improvement
  Components: hostdb
Reporter: Semyon Semyonov


To allow the usage of previous step statistics(deltas of fetched,unfetced etc) 
in hostdb. The motivation is usage of this statistics in generate with maxCount 
expressions.

See an example bellow and two possible solutions.

??Lets say for each website we have condition of generate while number of 
fetched < 150. 
The problem is for some websites that condition will (almost)never be finished, 
because of its structure. 
1) Round1. 1 page
2) Round2. 10 pages
3) Round3. 80 pages
4) Round 4. 1 page
5) Round 5. 1 page 
...etc.

I would like to add the delta condition for fetched that describes speed of the 
process. Lets say generate while number of fetched < 150 && delta_fetched > 1. 
Therefore in this case the process should stop on round 5 with total number of 
fetched equals to 92. 
??

I see two possible solutions :
1. In HostDatum class apart from current statistic include last step statistics.
class PagesStatistics
{
  protected int unfetched = 0;
  protected int fetched = 0;
  protected int notModified = 0;
  protected int redirTemp = 0;
  protected int redirPerm = 0;
  protected int gone = 0;
}

Inside HostDatum
private PagesStatistics currentStatistics;
private PagesStatistics previousStepStatistics;

And update both in UpdateHostDb.* The main problem - space. In generate 
HostDatum is stored in a Dictionary in a memory,*

2. 
Include metadata flag(s) in HostDatum and store as a field in 
HostDatum.(Metadata.StopGenerate = true/false). Calculate the value of 
StopGenerate in UpdateHostDB.
*The main advantage is space, we store only flag in the db. The main problem - 
lack of flexibility in Generate.,*  



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-2439) Upgrade to Apache Tika 1.17

2017-12-15 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16292476#comment-16292476
 ] 

Sebastian Nagel commented on NUTCH-2439:


Of course, I get the warning about Tesseract only because it's installed on my 
laptop. Nothing else necessary to suppress warnings?

> Upgrade to Apache Tika 1.17
> ---
>
> Key: NUTCH-2439
> URL: https://issues.apache.org/jira/browse/NUTCH-2439
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.13
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Fix For: 1.14
>
> Attachments: NUTCH-2439.patch, NUTCH-2439.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-2439) Upgrade to Apache Tika 1.17

2017-12-15 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16292474#comment-16292474
 ] 

ASF GitHub Bot commented on NUTCH-2439:
---

sebastian-nagel opened a new pull request #259: NUTCH-2439 Upgrade Apache Tika 
dependency to 1.17
URL: https://github.com/apache/nutch/pull/259
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Upgrade to Apache Tika 1.17
> ---
>
> Key: NUTCH-2439
> URL: https://issues.apache.org/jira/browse/NUTCH-2439
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.13
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Fix For: 1.14
>
> Attachments: NUTCH-2439.patch, NUTCH-2439.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-2439) Upgrade to Apache Tika 1.17

2017-12-15 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16292469#comment-16292469
 ] 

Markus Jelsma commented on NUTCH-2439:
--

Weird, i only got :

Dec 15, 2017 1:45:42 PM org.apache.tika.config.InitializableProblemHandler$3 
handleInitializableProblem
WARNING: org.xerial's sqlite-jdbc is not loaded.
Please provide the jar on your classpath to parse sqlite files.
See tika-parsers/pom.xml for the correct version.

But this is our custom parser, which is actually just an older TikaParser using 
a custom ContentHandler. This was a problem with 1.17-SNAPSHOT which i 
addressed, and Tim Allison quickly solved.

> Upgrade to Apache Tika 1.17
> ---
>
> Key: NUTCH-2439
> URL: https://issues.apache.org/jira/browse/NUTCH-2439
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.13
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Fix For: 1.14
>
> Attachments: NUTCH-2439.patch, NUTCH-2439.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-2439) Upgrade to Apache Tika 1.17

2017-12-15 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16292450#comment-16292450
 ] 

Sebastian Nagel commented on NUTCH-2439:


Really? I've almost done with a PR for the upgrade (had to resolve a dependency 
conflict which breaks multiple parse-tika tests), but the amount of errors 
written to stderr is still hardly acceptable:
{noformat}
$ bin/nutch parsechecker -Dplugin.includes="protocol-http|parse-tika" 
http://localhost/nutch/test.pdf >/dev/null
Dec 15, 2017 1:37:59 PM org.apache.tika.config.InitializableProblemHandler$3 
handleInitializableProblem
WARNING: JBIG2ImageReader not loaded. jbig2 files will be ignored
See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io
for optional dependencies.
TIFFImageWriter not loaded. tiff files will not be processed
See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io
for optional dependencies.
J2KImageReader not loaded. JPEG2000 files will not be processed.
See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io
for optional dependencies.

Dec 15, 2017 1:37:59 PM org.apache.tika.config.InitializableProblemHandler$3 
handleInitializableProblem
WARNING: Tesseract OCR is installed and will be automatically applied to image 
files unless
you've excluded the TesseractOCRParser from the default parser.
Tesseract may dramatically slow down content extraction (TIKA-2359).
As of Tika 1.15 (and prior versions), Tesseract is automatically called.
In future versions of Tika, users may need to turn the TesseractOCRParser on 
via TikaConfig.
Dec 15, 2017 1:37:59 PM org.apache.tika.config.InitializableProblemHandler$3 
handleInitializableProblem
WARNING: org.xerial's sqlite-jdbc is not loaded.
Please provide the jar on your classpath to parse sqlite files.
See tika-parsers/pom.xml for the correct version.
{noformat}



> Upgrade to Apache Tika 1.17
> ---
>
> Key: NUTCH-2439
> URL: https://issues.apache.org/jira/browse/NUTCH-2439
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.13
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Fix For: 1.14
>
> Attachments: NUTCH-2439.patch, NUTCH-2439.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-2439) Upgrade to Apache Tika 1.17

2017-12-15 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16292421#comment-16292421
 ] 

Markus Jelsma commented on NUTCH-2439:
--

Note, since 1.17, all but one of the warnings are gone.

> Upgrade to Apache Tika 1.17
> ---
>
> Key: NUTCH-2439
> URL: https://issues.apache.org/jira/browse/NUTCH-2439
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.13
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Fix For: 1.14
>
> Attachments: NUTCH-2439.patch, NUTCH-2439.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-2478) // is not a valid base URL

2017-12-15 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16292419#comment-16292419
 ] 

Markus Jelsma commented on NUTCH-2478:
--

I prefer your patch, it also carries a test.

> // is not a valid base URL
> --
>
> Key: NUTCH-2478
> URL: https://issues.apache.org/jira/browse/NUTCH-2478
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.13
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Fix For: 1.14
>
>
> This test fails:
> {code}
>   @Test
>   public void testBadResolver() throws Exception {
> URL base = new URL("//www.example.org/");
> String target = "index/produkt/kanaly/";
> 
> URL abs = URLUtil.resolveURL(base, target);
> Assert.assertEquals("http://www.example.org/index/produkt/kanaly/;, 
> abs.toString());
>   }
> {code}
> and has to fail because of invalid base URL, so the current URL is used. If 
> current URL is not /, its path will be prepended, resulting in 404 being 
> crawled.
> This ticket must allow // as base, and resolve the protocol.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (NUTCH-2480) Upgrade crawler-commons dependency to 0.9

2017-12-15 Thread Sebastian Nagel (JIRA)
Sebastian Nagel created NUTCH-2480:
--

 Summary: Upgrade crawler-commons dependency to 0.9
 Key: NUTCH-2480
 URL: https://issues.apache.org/jira/browse/NUTCH-2480
 Project: Nutch
  Issue Type: Improvement
  Components: build, deployment
Affects Versions: 1.13
Reporter: Sebastian Nagel
Priority: Minor
 Fix For: 1.14


Crawler-commons [0.9 is 
relased|https://groups.google.com/d/msg/crawler-commons/O39RrYlwwTY/m4VS0YMvBgAJ].
 We should upgrade the dependency: there are significant improvements in the 
sitemap parser, also crawler-commons 0.9 depends on Tika 1.16 which minimizes 
the gap to Tika 1.17 (NUTCH-2439).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (NUTCH-2439) Upgrade to Apache Tika 1.17

2017-12-15 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16208309#comment-16208309
 ] 

Sebastian Nagel edited comment on NUTCH-2439 at 12/15/17 10:45 AM:
---

+1   Tika-core 1.15 already slipped into as dependency of crawler-commons 0.8.

The Tika warnings to stderr are annoying. Looks like they cannot be supressed 
via Nutch's log4j.properties. Or is there a way?


was (Author: wastl-nagel):
+1   Tika-core 1.16 already slept into as dependency of crawler-commons 0.8.

The Tika warnings to stderr are annoying. Looks like they cannot be supressed 
via Nutch's log4j.properties. Or is there a way?

> Upgrade to Apache Tika 1.17
> ---
>
> Key: NUTCH-2439
> URL: https://issues.apache.org/jira/browse/NUTCH-2439
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.13
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Fix For: 1.14
>
> Attachments: NUTCH-2439.patch, NUTCH-2439.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-2415) Create a JEXL based IndexingFilter

2017-12-15 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16292231#comment-16292231
 ] 

ASF GitHub Bot commented on NUTCH-2415:
---

sebastian-nagel commented on a change in pull request #219: NUTCH-2415 : Create 
a JEXL based IndexingFilter
URL: https://github.com/apache/nutch/pull/219#discussion_r157154504
 
 

 ##
 File path: build.xml
 ##
 @@ -1042,6 +1042,8 @@
 
 
 
+
 
 Review comment:
   - the index-jexl-filter dir should also be added to the Eclipse target: 
there will be 4 entries in the main build.xml
   - the plugin folders in build.xml are sorted lexicographically: 
index-jexl-filter should come before index-links
   - please also add the package to plugins.index in default.properties. That's 
required to list the pluging on the [Javadoc overview 
page](https://builds.apache.org/job/nutch-trunk/javadoc/)
   - also a package-info.java would be great


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Create a JEXL based IndexingFilter
> --
>
> Key: NUTCH-2415
> URL: https://issues.apache.org/jira/browse/NUTCH-2415
> Project: Nutch
>  Issue Type: New Feature
>  Components: plugin
>Affects Versions: 1.13
>Reporter: Yossi Tamari
>Assignee: Jorge Luis Betancourt Gonzalez
>Priority: Minor
>
> Following on NUTCH-2414 and NUTCH-2412, the requirement was raised for a 
> IndexingFilter plugin which will decide whether to index a document based on 
> a JEXL expression.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)