[jira] [Commented] (NUTCH-2434) Add methods to reset parameters HTMLMetaTags

2020-04-30 Thread Markus Jelsma (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17096645#comment-17096645
 ] 

Markus Jelsma commented on NUTCH-2434:
--

Ah, thanks!

> Add methods to reset parameters HTMLMetaTags
> 
>
> Key: NUTCH-2434
> URL: https://issues.apache.org/jira/browse/NUTCH-2434
> Project: Nutch
>  Issue Type: Task
>  Components: parser
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Major
> Fix For: 1.17
>
> Attachments: NUTCH-2434.patch
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (NUTCH-2434) Add methods to reset parameters HTMLMetaTags

2020-04-30 Thread Sebastian Nagel (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17096633#comment-17096633
 ] 

Sebastian Nagel commented on NUTCH-2434:


+1

[~markus17], nothing to complain, as this does not change the behavior of any 
parser, just adds method. I'll include it into 1.17

> Add methods to reset parameters HTMLMetaTags
> 
>
> Key: NUTCH-2434
> URL: https://issues.apache.org/jira/browse/NUTCH-2434
> Project: Nutch
>  Issue Type: Task
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Major
> Fix For: 1.17
>
> Attachments: NUTCH-2434.patch
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (NUTCH-2434) Add methods to reset parameters HTMLMetaTags

2020-04-30 Thread Sebastian Nagel (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-2434:
---
Component/s: parser

> Add methods to reset parameters HTMLMetaTags
> 
>
> Key: NUTCH-2434
> URL: https://issues.apache.org/jira/browse/NUTCH-2434
> Project: Nutch
>  Issue Type: Task
>  Components: parser
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Major
> Fix For: 1.17
>
> Attachments: NUTCH-2434.patch
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (NUTCH-2434) Add methods to reset parameters HTMLMetaTags

2020-04-30 Thread Sebastian Nagel (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-2434:
---
Summary: Add methods to reset parameters HTMLMetaTags  (was: Option to 
reset parameters HTMLMetaTags)

> Add methods to reset parameters HTMLMetaTags
> 
>
> Key: NUTCH-2434
> URL: https://issues.apache.org/jira/browse/NUTCH-2434
> Project: Nutch
>  Issue Type: Task
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Major
> Fix For: 1.17
>
> Attachments: NUTCH-2434.patch
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [nutch] sebastian-nagel opened a new pull request #523: NUTCH-2753 Add -listen option to command-line help of CrawlDbReader and LinkDbReader

2020-04-30 Thread GitBox


sebastian-nagel opened a new pull request #523:
URL: https://github.com/apache/nutch/pull/523


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Commented] (NUTCH-2753) Add -listen option to command-line help of CrawlDbReader and LinkDbReader

2020-04-30 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17096630#comment-17096630
 ] 

ASF GitHub Bot commented on NUTCH-2753:
---

sebastian-nagel opened a new pull request #523:
URL: https://github.com/apache/nutch/pull/523


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add -listen option to command-line help of CrawlDbReader and LinkDbReader
> -
>
> Key: NUTCH-2753
> URL: https://issues.apache.org/jira/browse/NUTCH-2753
> Project: Nutch
>  Issue Type: Bug
>  Components: crawldb, linkdb
>Affects Versions: 1.15
>Reporter: Sebastian Nagel
>Priority: Minor
>  Labels: easytask, help-wanted
> Fix For: 1.17
>
>
> The tools CrawlDbReader and LinkDbReader extend AbstractChecker but do not 
> show `-listen  [-keepClientCnxOpen]` as available option(s).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (NUTCH-2758) Add plugin READMEs to binary release packages

2020-04-30 Thread Sebastian Nagel (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel reassigned NUTCH-2758:
--

Assignee: Sebastian Nagel

> Add plugin READMEs to binary release packages
> -
>
> Key: NUTCH-2758
> URL: https://issues.apache.org/jira/browse/NUTCH-2758
> Project: Nutch
>  Issue Type: Improvement
>  Components: build, plugin
>Affects Versions: 1.16
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Major
> Fix For: 1.17
>
>
> Almost 20 plugins have a README (.md or .txt) which explains how to use and 
> configure the plugin. The READMEs should be included in the binary release 
> packages.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [nutch] sebastian-nagel opened a new pull request #522: NUTCH-2758 Add plugin READMEs to binary release packages

2020-04-30 Thread GitBox


sebastian-nagel opened a new pull request #522:
URL: https://github.com/apache/nutch/pull/522


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Commented] (NUTCH-2758) Add plugin READMEs to binary release packages

2020-04-30 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17096472#comment-17096472
 ] 

ASF GitHub Bot commented on NUTCH-2758:
---

sebastian-nagel opened a new pull request #522:
URL: https://github.com/apache/nutch/pull/522


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add plugin READMEs to binary release packages
> -
>
> Key: NUTCH-2758
> URL: https://issues.apache.org/jira/browse/NUTCH-2758
> Project: Nutch
>  Issue Type: Improvement
>  Components: build, plugin
>Affects Versions: 1.16
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Major
> Fix For: 1.17
>
>
> Almost 20 plugins have a README (.md or .txt) which explains how to use and 
> configure the plugin. The READMEs should be included in the binary release 
> packages.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (NUTCH-2425) Update GettingNutchRunningWithUbuntu wiki article

2020-04-30 Thread Sebastian Nagel (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-2425.

Fix Version/s: (was: 1.17)
   Resolution: Abandoned

The wiki page 
https://cwiki.apache.org/confluence/display/NUTCH/GettingNutchRunningWithUbuntu 
is now in "Archive and Legacy". It's strongly recommended to follow the 
tutorial (https://cwiki.apache.org/confluence/display/NUTCH/NutchTutorial) 
which is regularly updated.

> Update GettingNutchRunningWithUbuntu wiki article
> -
>
> Key: NUTCH-2425
> URL: https://issues.apache.org/jira/browse/NUTCH-2425
> Project: Nutch
>  Issue Type: Task
>  Components: documentation, wiki
>Reporter: Karl Richter
>Priority: Major
>
> https://wiki.apache.org/nutch/GettingNutchRunningWithUbuntu contains some 
> errors (e.g. `echo 'http://lucene.apache.org/nutch/' > urls` where `urls` is 
> a directory) and obsolete parts (`conf/crawl-urlfilter.txt` is 
> `conf/regex-urlfilter.txt` in 2.x) and thus appear to be tested well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (NUTCH-2423) Update contributor info page

2020-04-30 Thread Sebastian Nagel (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17096452#comment-17096452
 ] 

Sebastian Nagel commented on NUTCH-2423:


Applies to:
- https://cwiki.apache.org/confluence/display/NUTCH/Becoming+A+Nutch+Developer
- https://cwiki.apache.org/confluence/display/NUTCH/HowToContribute

Which should be adapted to git-based workflows (maybe not only using github). 
The [README|https://github.com/apache/nutch/blob/master/README.md] already 
lists the most important commands.

 

> Update contributor info page
> 
>
> Key: NUTCH-2423
> URL: https://issues.apache.org/jira/browse/NUTCH-2423
> Project: Nutch
>  Issue Type: Task
>  Components: documentation, wiki
>Reporter: Karl Richter
>Priority: Major
>  Labels: easytask, help-wanted
> Fix For: 1.18
>
>
> The [contributor info 
> page](https://wiki.apache.org/nutch/Becoming_A_Nutch_Developer) still 
> mentions subversion as SCM which I assume is obsolete because there's 
> git://git.apache.org/nutch.git. It should mention how the devs with write 
> access deal with pull/merge requests in general or on different popular 
> platforms (the information that they're not accepted is valuable as well).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (NUTCH-2423) Update contributor info page

2020-04-30 Thread Sebastian Nagel (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-2423:
---
Fix Version/s: (was: 1.17)
   1.18

> Update contributor info page
> 
>
> Key: NUTCH-2423
> URL: https://issues.apache.org/jira/browse/NUTCH-2423
> Project: Nutch
>  Issue Type: Task
>  Components: documentation, wiki
>Reporter: Karl Richter
>Priority: Major
>  Labels: easytask, help-wanted
> Fix For: 1.18
>
>
> The [contributor info 
> page](https://wiki.apache.org/nutch/Becoming_A_Nutch_Developer) still 
> mentions subversion as SCM which I assume is obsolete because there's 
> git://git.apache.org/nutch.git. It should mention how the devs with write 
> access deal with pull/merge requests in general or on different popular 
> platforms (the information that they're not accepted is valuable as well).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (NUTCH-2507) NutchTutorial wiki pages as a lot of outdated command line calls when it starts with the solr interaction

2020-04-30 Thread Sebastian Nagel (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-2507.

  Assignee: Sebastian Nagel
Resolution: Fixed

Thanks, [~artodeto]! The section in 
[https://cwiki.apache.org/confluence/display/NUTCH/NutchTutorial] related to 
Solr indexing have been updated.

> NutchTutorial wiki pages as a lot of outdated command line calls when it 
> starts with the solr interaction
> -
>
> Key: NUTCH-2507
> URL: https://issues.apache.org/jira/browse/NUTCH-2507
> Project: Nutch
>  Issue Type: Bug
>  Components: documentation
>Affects Versions: 1.14
>Reporter: artodeto
>Assignee: Sebastian Nagel
>Priority: Major
>  Labels: documentation, easyfix
> Fix For: 1.17
>
>
> h2. h2. Section "Step-by-Step: Indexing into Apache Solr"
> replace:
> {code:java}
> Example: bin/nutch index http://localhost:8983/solr crawl/crawldb/ -linkdb 
> crawl/linkdb/ crawl/segments/20131108063838/ -filter -normalize 
> -deleteGone{code}
> with:
> {code:java}
> Example: bin/nutch index -Dsolr.server.url=http://localhost:8983/solr/nutch 
> ${NUTCH_RUNTIME_HOME}/crawl
> /crawldb/ -linkdb ${NUTCH_RUNTIME_HOME}/crawl
> /linkdb/ ${NUTCH_RUNTIME_HOME}/crawl
> /segments/20131108063838
> / -filter -normalize -deleteGo{code}
>  
> h2. Section "Step-by-Step: Deleting Duplicates"
> replace:
> {code:java}
>  Usage: bin/nutch dedup 
>  Example: /bin/nutch dedup http://localhost:8983/solr
> {code}
> with:
> {code:java}
>  Usage: bin/nutch dedup  
>  Example: /bin/nutch dedup ${NUTCH_RUNTIME_HOME}/crawl/crawldb/ 
> http://localhost:8983/sol
> {code}
> h2. Section "Step-by-Step: Cleaning Solr"
> replace:
> {code:java}
>  Usage: bin/nutch clean -Dsolr.server.url= 
>  Example: /bin/nutch clean 
> -Dsolr.server.url=http://localhost:8983/solr/nutch crawl/crawldb/
> {code}
> with:
> {code}
>  Usage: bin/nutch clean -Dsolr.server.url= 
>  Example: /bin/nutch clean 
> -Dsolr.server.url=http://localhost:8983/solr/nutch 
> ${NUTCH_RUNTIME_HOME}/crawl/crawldb/
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (NUTCH-2002) ParserChecker and IndexingFiltersChecker to check robots.txt

2020-04-30 Thread Sebastian Nagel (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2002?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-2002:
---
Summary: ParserChecker and IndexingFiltersChecker to check robots.txt  
(was: ParserChecker to check robots.txt)

> ParserChecker and IndexingFiltersChecker to check robots.txt
> 
>
> Key: NUTCH-2002
> URL: https://issues.apache.org/jira/browse/NUTCH-2002
> Project: Nutch
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.9
>Reporter: Julien Nioche
>Priority: Minor
> Fix For: 1.17
>
> Attachments: NUTCH-2002.patch
>
>
> ParserChecker could check whether a given URL is allowed by the robots.txt 
> directives.  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (NUTCH-2002) ParserChecker and IndexingFiltersChecker to check robots.txt

2020-04-30 Thread Sebastian Nagel (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2002?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel reassigned NUTCH-2002:
--

Assignee: Sebastian Nagel

> ParserChecker and IndexingFiltersChecker to check robots.txt
> 
>
> Key: NUTCH-2002
> URL: https://issues.apache.org/jira/browse/NUTCH-2002
> Project: Nutch
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.9
>Reporter: Julien Nioche
>Assignee: Sebastian Nagel
>Priority: Minor
> Fix For: 1.17
>
> Attachments: NUTCH-2002.patch
>
>
> ParserChecker could check whether a given URL is allowed by the robots.txt 
> directives.  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (NUTCH-2002) ParserChecker to check robots.txt

2020-04-30 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17096424#comment-17096424
 ] 

ASF GitHub Bot commented on NUTCH-2002:
---

sebastian-nagel opened a new pull request #521:
URL: https://github.com/apache/nutch/pull/521


   - applied Julien's patch to recent code base
   - also check redirects whether they are allowed
   - add command-line parameter `-checkRobotsTxt` enabling this check



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> ParserChecker to check robots.txt
> -
>
> Key: NUTCH-2002
> URL: https://issues.apache.org/jira/browse/NUTCH-2002
> Project: Nutch
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.9
>Reporter: Julien Nioche
>Priority: Minor
> Fix For: 1.17
>
> Attachments: NUTCH-2002.patch
>
>
> ParserChecker could check whether a given URL is allowed by the robots.txt 
> directives.  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [nutch] sebastian-nagel opened a new pull request #521: NUTCH-2002 parse and index checkers to check robots.txt

2020-04-30 Thread GitBox


sebastian-nagel opened a new pull request #521:
URL: https://github.com/apache/nutch/pull/521


   - applied Julien's patch to recent code base
   - also check redirects whether they are allowed
   - add command-line parameter `-checkRobotsTxt` enabling this check



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Commented] (NUTCH-2743) Add list of Nutch properties (nutch-default.xml) to documentation

2020-04-30 Thread Sebastian Nagel (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17096398#comment-17096398
 ] 

Sebastian Nagel commented on NUTCH-2743:


Also note that properties can be addressed via page anchors: 
https://builds.apache.org/job/nutch-trunk/javadoc/resources/nutch-default.xml#http.content.limit

> Add list of Nutch properties (nutch-default.xml) to documentation
> -
>
> Key: NUTCH-2743
> URL: https://issues.apache.org/jira/browse/NUTCH-2743
> Project: Nutch
>  Issue Type: Improvement
>  Components: documentation
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Major
> Fix For: 1.17
>
>
> The file nutch-default.xml lists all Nutch properties. It should become part 
> of the documentation similar as done for Hadoop (eg. 
> [mapred-default.xml|https://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/mapred-default.xml]),
>  including the XSL (configuration.xsl) required to render the file into a 
> table.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (NUTCH-2743) Add list of Nutch properties (nutch-default.xml) to documentation

2020-04-30 Thread Sebastian Nagel (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17096396#comment-17096396
 ] 

Sebastian Nagel commented on NUTCH-2743:


Current properties are now available through nightly builds: 
https://builds.apache.org/job/nutch-trunk/javadoc/resources/nutch-default.xml
I will add links from the [Nutch API docs 
page](https://nutch.apache.org/javadoc.html) when updating it with the 1.17 
release.

> Add list of Nutch properties (nutch-default.xml) to documentation
> -
>
> Key: NUTCH-2743
> URL: https://issues.apache.org/jira/browse/NUTCH-2743
> Project: Nutch
>  Issue Type: Improvement
>  Components: documentation
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Major
> Fix For: 1.17
>
>
> The file nutch-default.xml lists all Nutch properties. It should become part 
> of the documentation similar as done for Hadoop (eg. 
> [mapred-default.xml|https://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/mapred-default.xml]),
>  including the XSL (configuration.xsl) required to render the file into a 
> table.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (NUTCH-2784) Add tool to list Nutch and Hadoop properties

2020-04-30 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17096387#comment-17096387
 ] 

Hudson commented on NUTCH-2784:
---

SUCCESS: Integrated in Jenkins build Nutch-trunk #3678 (See 
[https://builds.apache.org/job/Nutch-trunk/3678/])
NUTCH-2784 Tool to list Nutch properties and configured values (snagel: 
[https://github.com/apache/nutch/commit/a20c2613c944a8e845632fcc81384abac5dcdf85])
* (edit) src/bin/nutch
* (add) src/java/org/apache/nutch/tools/ShowProperties.java


> Add tool to list Nutch and Hadoop properties
> 
>
> Key: NUTCH-2784
> URL: https://issues.apache.org/jira/browse/NUTCH-2784
> Project: Nutch
>  Issue Type: Improvement
>  Components: tool
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Minor
> Fix For: 1.17
>
>
> Nutch properties are defined in nutch-default.xml but can be redefined 
> (overridden) in nutch-site.xml or from command-line (-Dproperty=value). In 
> addition, property definitions can include other properties 
> ({{${property.name}}}) which makes it sometimes hard to figure out what the 
> actual value of a property is.
> In short, a command-line tool which lists all properties and the configured 
> values could be useful.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (NUTCH-2495) Use -deleteGone instead of clean job in crawler script while indexing

2020-04-30 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17096388#comment-17096388
 ] 

Hudson commented on NUTCH-2495:
---

SUCCESS: Integrated in Jenkins build Nutch-trunk #3678 (See 
[https://builds.apache.org/job/Nutch-trunk/3678/])
NUTCH-2495: Use -deleteGone instead of clean job in crawl script while (snagel: 
[https://github.com/apache/nutch/commit/7ebd35dc96b8d40846103a8c343edecec1763595])
* (edit) src/bin/crawl


> Use -deleteGone instead of clean job in crawler script while indexing
> -
>
> Key: NUTCH-2495
> URL: https://issues.apache.org/jira/browse/NUTCH-2495
> Project: Nutch
>  Issue Type: Improvement
>  Components: bin
>Affects Versions: 1.15
>Reporter: Moreno Feltscher
>Assignee: Lewis John McGibbney
>Priority: Major
> Fix For: 1.17
>
>
> Instead of running {{bin/nutch clean}} after indexing the documents run 
> {{bin/nutch index}} with the {{-deleteGone}} flag which instead of just 
> deleting gone and duplicated documents also deletes redirects from the index.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (NUTCH-2743) Add list of Nutch properties (nutch-default.xml) to documentation

2020-04-30 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17096389#comment-17096389
 ] 

Hudson commented on NUTCH-2743:
---

SUCCESS: Integrated in Jenkins build Nutch-trunk #3678 (See 
[https://builds.apache.org/job/Nutch-trunk/3678/])
NUTCH-2743 Add list of Nutch properties (nutch-default.xml) to (snagel: 
[https://github.com/apache/nutch/commit/462ca6e39db4a3bba8723a14d23445b0471ad7a0])
* (edit) src/plugin/creativecommons/conf/nutch-site.xml
* (edit) build.xml
* (delete) conf/nutch-conf.xsl
* (edit) conf/configuration.xsl
* (edit) conf/nutch-default.xml


> Add list of Nutch properties (nutch-default.xml) to documentation
> -
>
> Key: NUTCH-2743
> URL: https://issues.apache.org/jira/browse/NUTCH-2743
> Project: Nutch
>  Issue Type: Improvement
>  Components: documentation
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Major
> Fix For: 1.17
>
>
> The file nutch-default.xml lists all Nutch properties. It should become part 
> of the documentation similar as done for Hadoop (eg. 
> [mapred-default.xml|https://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/mapred-default.xml]),
>  including the XSL (configuration.xsl) required to render the file into a 
> table.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (NUTCH-2743) Add list of Nutch properties (nutch-default.xml) to documentation

2020-04-30 Thread Sebastian Nagel (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-2743.

Resolution: Implemented

> Add list of Nutch properties (nutch-default.xml) to documentation
> -
>
> Key: NUTCH-2743
> URL: https://issues.apache.org/jira/browse/NUTCH-2743
> Project: Nutch
>  Issue Type: Improvement
>  Components: documentation
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Major
> Fix For: 1.17
>
>
> The file nutch-default.xml lists all Nutch properties. It should become part 
> of the documentation similar as done for Hadoop (eg. 
> [mapred-default.xml|https://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/mapred-default.xml]),
>  including the XSL (configuration.xsl) required to render the file into a 
> table.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (NUTCH-2784) Add tool to list Nutch and Hadoop properties

2020-04-30 Thread Sebastian Nagel (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-2784.

Resolution: Implemented

> Add tool to list Nutch and Hadoop properties
> 
>
> Key: NUTCH-2784
> URL: https://issues.apache.org/jira/browse/NUTCH-2784
> Project: Nutch
>  Issue Type: Improvement
>  Components: tool
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Minor
> Fix For: 1.17
>
>
> Nutch properties are defined in nutch-default.xml but can be redefined 
> (overridden) in nutch-site.xml or from command-line (-Dproperty=value). In 
> addition, property definitions can include other properties 
> ({{${property.name}}}) which makes it sometimes hard to figure out what the 
> actual value of a property is.
> In short, a command-line tool which lists all properties and the configured 
> values could be useful.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (NUTCH-2495) Use -deleteGone instead of clean job in crawler script while indexing

2020-04-30 Thread Sebastian Nagel (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-2495.

Resolution: Fixed

> Use -deleteGone instead of clean job in crawler script while indexing
> -
>
> Key: NUTCH-2495
> URL: https://issues.apache.org/jira/browse/NUTCH-2495
> Project: Nutch
>  Issue Type: Improvement
>  Components: bin
>Affects Versions: 1.15
>Reporter: Moreno Feltscher
>Assignee: Lewis John McGibbney
>Priority: Major
> Fix For: 1.17
>
>
> Instead of running {{bin/nutch clean}} after indexing the documents run 
> {{bin/nutch index}} with the {{-deleteGone}} flag which instead of just 
> deleting gone and duplicated documents also deletes redirects from the index.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (NUTCH-2776) Fetcher to temporarily deduplicate followed redirects

2020-04-30 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17096338#comment-17096338
 ] 

Hudson commented on NUTCH-2776:
---

SUCCESS: Integrated in Jenkins build Nutch-trunk #3677 (See 
[https://builds.apache.org/job/Nutch-trunk/3677/])
NUTCH-2776 Fetcher to temporarily deduplicate followed redirects - cache 
(snagel: 
[https://github.com/apache/nutch/commit/0f33d183c80e3f75f39d8ebe0dff163436b6d710])
* (edit) conf/nutch-default.xml
* (edit) src/java/org/apache/nutch/fetcher/FetcherThread.java
* (edit) src/java/org/apache/nutch/fetcher/FetchItemQueues.java


> Fetcher to temporarily deduplicate followed redirects
> -
>
> Key: NUTCH-2776
> URL: https://issues.apache.org/jira/browse/NUTCH-2776
> Project: Nutch
>  Issue Type: Improvement
>  Components: fetcher
>Affects Versions: 1.16
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Major
> Fix For: 1.17
>
>
> If fetcher follows redirect (http.redirect.max > 0), it may happen that many 
> redirects of a site point to the same URL. In this situation, it might be 
> good if fetcher could temporarily (for a configurable time period) 
> deduplicate the redirect targets and skip all redirects except the first one. 
> Typical examples of duplicated redirect targets are:
> - instead of responding with HTTP status 404:
> {noformat}
> /
> /resource-not-found
> /search/
> /404
> /error/not-found
> /err/notfound.html{noformat}
> - a page to accept/decline cookies
> {noformat}
> /cookie_usage.php
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (NUTCH-2772) Debugging parse filter to show serialized DOM tree

2020-04-30 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17096337#comment-17096337
 ] 

Hudson commented on NUTCH-2772:
---

SUCCESS: Integrated in Jenkins build Nutch-trunk #3677 (See 
[https://builds.apache.org/job/Nutch-trunk/3677/])
NUTCH-2772 Debugging parse filter to show serialized DOM tree (snagel: 
[https://github.com/apache/nutch/commit/caea3a051aceb947d17ccfaa080f6bd864802a4d])
* (add) src/plugin/parsefilter-debug/plugin.xml
* (add) src/plugin/parsefilter-debug/build.xml
* (edit) default.properties
* (edit) src/plugin/build.xml
* (add) 
src/plugin/parsefilter-debug/src/java/org/apache/nutch/parsefilter/debug/package-info.java
* (edit) src/java/org/apache/nutch/util/DomUtil.java
* (add) src/plugin/parsefilter-debug/ivy.xml
* (edit) build.xml
* (add) 
src/plugin/parsefilter-debug/src/java/org/apache/nutch/parsefilter/debug/DebugParseFilter.java


> Debugging parse filter to show serialized DOM tree
> --
>
> Key: NUTCH-2772
> URL: https://issues.apache.org/jira/browse/NUTCH-2772
> Project: Nutch
>  Issue Type: Improvement
>  Components: parser, plugin
>Affects Versions: 1.16
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Major
> Fix For: 1.17
>
>
> A tool to show the DOM tree (eg. serialized as XML/HTML) might be helpful for 
> debugging, eg., see NUTCH-2769. The DOM tree is available in the parse 
> plugins and is also passed to the HtmlParseFilter plugins. We could provide a 
> parsefilter-debug plugin which logs the DOM tree and add the serialized 
> string representation to the parse data.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (NUTCH-2776) Fetcher to temporarily deduplicate followed redirects

2020-04-30 Thread Sebastian Nagel (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-2776.

Resolution: Implemented

Merged. This feature has been successfully tested in production in a 
large-scale crawl using a cache size of 6000.

> Fetcher to temporarily deduplicate followed redirects
> -
>
> Key: NUTCH-2776
> URL: https://issues.apache.org/jira/browse/NUTCH-2776
> Project: Nutch
>  Issue Type: Improvement
>  Components: fetcher
>Affects Versions: 1.16
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Major
> Fix For: 1.17
>
>
> If fetcher follows redirect (http.redirect.max > 0), it may happen that many 
> redirects of a site point to the same URL. In this situation, it might be 
> good if fetcher could temporarily (for a configurable time period) 
> deduplicate the redirect targets and skip all redirects except the first one. 
> Typical examples of duplicated redirect targets are:
> - instead of responding with HTTP status 404:
> {noformat}
> /
> /resource-not-found
> /search/
> /404
> /error/not-found
> /err/notfound.html{noformat}
> - a page to accept/decline cookies
> {noformat}
> /cookie_usage.php
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (NUTCH-2772) Debugging parse filter to show serialized DOM tree

2020-04-30 Thread Sebastian Nagel (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-2772.

Resolution: Implemented

> Debugging parse filter to show serialized DOM tree
> --
>
> Key: NUTCH-2772
> URL: https://issues.apache.org/jira/browse/NUTCH-2772
> Project: Nutch
>  Issue Type: Improvement
>  Components: parser, plugin
>Affects Versions: 1.16
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Major
> Fix For: 1.17
>
>
> A tool to show the DOM tree (eg. serialized as XML/HTML) might be helpful for 
> debugging, eg., see NUTCH-2769. The DOM tree is available in the parse 
> plugins and is also passed to the HtmlParseFilter plugins. We could provide a 
> parsefilter-debug plugin which logs the DOM tree and add the serialized 
> string representation to the parse data.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (NUTCH-2771) Tests in nightly builds: speed up long runners

2020-04-30 Thread Sebastian Nagel (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-2771:
---
Fix Version/s: (was: 1.17)
   1.18

> Tests in nightly builds: speed up long runners
> --
>
> Key: NUTCH-2771
> URL: https://issues.apache.org/jira/browse/NUTCH-2771
> Project: Nutch
>  Issue Type: Improvement
>  Components: build, test
>Affects Versions: 1.16
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Minor
> Fix For: 1.18
>
>
> The Nutch tests run by "ant test" or "ant nightly") take rather long to run. 
> Although all tests are implemented as JUnit tests, some tests are more 
> integration tests, eg. launching a Jetty web server and fetching documents 
> from it. It's nice to have also higher level tests, and they are expected to 
> long runner than a simple unit test. However, some of the test classes take 
> really long to run (times taken from 
> https://builds.apache.org/job/Nutch-trunk/3663/consoleText):
> {noformat}
> [junit] Running org.apache.nutch.segment.TestSegmentMergerCrawlDatums
> [junit] Tests run: 7, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
> 133.898 sec
> [junit] Running org.apache.nutch.segment.TestSegmentMerger
> [junit] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
> 101.026 sec
> [junit] Running org.apache.nutch.crawl.TestGenerator
> [junit] Tests run: 4, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
> 46.03 sec
> [junit] Running org.apache.nutch.fetcher.TestFetcher
> [junit] Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
> 17.805 sec
> [junit] Running org.apache.nutch.urlfilter.fast.TestFastURLFilter
> [junit] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
> 12.36 sec
> [junit] Running org.apache.nutch.parse.tika.TestPdfParser
> [junit] Tests run: 7, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
> 11.974 sec
> [junit] Running org.apache.nutch.parse.tika.TestImageMetadata
> [junit] Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
> 9.113 sec
> [junit] Running org.apache.nutch.parse.feed.TestFeedParser
> [junit] Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
> 6.369 sec
> [junit] Running org.apache.nutch.crawl.TestInjector
> [junit] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
> 6.15 sec
> {noformat}
> We could try to speed up at least some of these long-running tests.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (NUTCH-2771) Tests in nightly builds: speed up long runners

2020-04-30 Thread Sebastian Nagel (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17096297#comment-17096297
 ] 

Sebastian Nagel commented on NUTCH-2771:


Moving to 1.18 for now. After a closer look: all these tests are useful. One 
option could be to mark long-runners using [JUnit 5 
tags|https://junit.org/junit5/docs/current/user-guide/#writing-tests-annotations]
 which would allow to run them separately.

> Tests in nightly builds: speed up long runners
> --
>
> Key: NUTCH-2771
> URL: https://issues.apache.org/jira/browse/NUTCH-2771
> Project: Nutch
>  Issue Type: Improvement
>  Components: build, test
>Affects Versions: 1.16
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Minor
> Fix For: 1.17
>
>
> The Nutch tests run by "ant test" or "ant nightly") take rather long to run. 
> Although all tests are implemented as JUnit tests, some tests are more 
> integration tests, eg. launching a Jetty web server and fetching documents 
> from it. It's nice to have also higher level tests, and they are expected to 
> long runner than a simple unit test. However, some of the test classes take 
> really long to run (times taken from 
> https://builds.apache.org/job/Nutch-trunk/3663/consoleText):
> {noformat}
> [junit] Running org.apache.nutch.segment.TestSegmentMergerCrawlDatums
> [junit] Tests run: 7, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
> 133.898 sec
> [junit] Running org.apache.nutch.segment.TestSegmentMerger
> [junit] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
> 101.026 sec
> [junit] Running org.apache.nutch.crawl.TestGenerator
> [junit] Tests run: 4, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
> 46.03 sec
> [junit] Running org.apache.nutch.fetcher.TestFetcher
> [junit] Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
> 17.805 sec
> [junit] Running org.apache.nutch.urlfilter.fast.TestFastURLFilter
> [junit] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
> 12.36 sec
> [junit] Running org.apache.nutch.parse.tika.TestPdfParser
> [junit] Tests run: 7, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
> 11.974 sec
> [junit] Running org.apache.nutch.parse.tika.TestImageMetadata
> [junit] Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
> 9.113 sec
> [junit] Running org.apache.nutch.parse.feed.TestFeedParser
> [junit] Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
> 6.369 sec
> [junit] Running org.apache.nutch.crawl.TestInjector
> [junit] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
> 6.15 sec
> {noformat}
> We could try to speed up at least some of these long-running tests.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)