[jira] [Commented] (NUTCH-3013) Employ commons-lang3's StopWatch to simplify timing logic

2023-10-21 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-3013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17778206#comment-17778206
 ] 

Hudson commented on NUTCH-3013:
---

SUCCESS: Integrated in Jenkins build Nutch » Nutch-trunk #134 (See 
[https://ci-builds.apache.org/job/Nutch/job/Nutch-trunk/134/])
NUTCH-3013 Employ commons-lang3's StopWatch to simplify timing logic (#788) 
(github: 
[https://github.com/apache/nutch/commit/8431dcfe52f5395a0fd9e3c00db009dbb2bcf6f5])
* (edit) src/java/org/apache/nutch/crawl/Injector.java
* (edit) 
src/plugin/lib-regex-filter/src/test/org/apache/nutch/urlfilter/api/RegexURLFilterBaseTest.java
* (edit) src/java/org/apache/nutch/parse/ParseSegment.java
* (edit) src/java/org/apache/nutch/tools/arc/ArcSegmentCreator.java
* (edit) .gitignore
* (edit) src/java/org/apache/nutch/util/domain/DomainStatistics.java
* (edit) src/java/org/apache/nutch/scoring/webgraph/NodeDumper.java
* (edit) 
src/plugin/urlnormalizer-regex/src/test/org/apache/nutch/net/urlnormalizer/regex/TestRegexURLNormalizer.java
* (edit) src/java/org/apache/nutch/util/CrawlCompletionStats.java
* (edit) src/java/org/apache/nutch/tools/FreeGenerator.java
* (edit) src/java/org/apache/nutch/hostdb/ReadHostDb.java
* (edit) src/java/org/apache/nutch/indexer/IndexingJob.java
* (edit) .github/workflows/master-build.yml
* (edit) src/java/org/apache/nutch/util/ProtocolStatusStatistics.java
* (edit) src/java/org/apache/nutch/fetcher/Fetcher.java
* (edit) src/java/org/apache/nutch/scoring/webgraph/LinkDumper.java
* (edit) src/java/org/apache/nutch/scoring/webgraph/ScoreUpdater.java
* (edit) src/java/org/apache/nutch/indexer/CleaningJob.java
* (edit) src/java/org/apache/nutch/crawl/CrawlDbMerger.java
* (edit) src/java/org/apache/nutch/hostdb/UpdateHostDb.java
* (edit) src/java/org/apache/nutch/crawl/DeduplicationJob.java
* (edit) src/java/org/apache/nutch/crawl/CrawlDb.java
* (edit) src/java/org/apache/nutch/scoring/webgraph/LinkRank.java
* (edit) src/java/org/apache/nutch/util/SitemapProcessor.java
* (edit) src/java/org/apache/nutch/crawl/LinkDbReader.java
* (edit) src/java/org/apache/nutch/crawl/Generator.java
* (edit) src/java/org/apache/nutch/crawl/LinkDbMerger.java
* (edit) src/java/org/apache/nutch/scoring/webgraph/WebGraph.java
* (edit) src/java/org/apache/nutch/tools/warc/WARCExporter.java
* (edit) src/java/org/apache/nutch/crawl/LinkDb.java


> Employ commons-lang3's StopWatch to simplify timing logic
> -
>
> Key: NUTCH-3013
> URL: https://issues.apache.org/jira/browse/NUTCH-3013
> Project: Nutch
>  Issue Type: Improvement
>  Components: logging, runtime, util
>Affects Versions: 1.19
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
>  Labels: timing
> Fix For: 1.20
>
>
> I ended up running some experiments integrating Nutch and [Celeborn 
> (Incubating)|https://celeborn.apache.org/] and it got me thinking about 
> runtime timings. After some investigation I came across [common-lang3's 
> StopWatch 
> Class|https://commons.apache.org/proper/commons-lang/javadocs/api-release/index.html?org/apache/commons/lang3/time/StopWatch.html]
>  which provides a convenient API for timings.
> Seeing as we already declare the commons-lang3 dependency, I think StopWatch 
> could help us clean up some timing logic in Nutch. Specifically, it would 
> reduce redundancy in terms of duplicated code and logic. It would also open 
> the door to introduce timing _*splits*_ if anyone is so inclined to dig 
> deeper into runtime timings.
> A cursory search for *_"long start = System.currentTimeMillis();"_* returns 
> hits for 32 files so it's fair to say that timing already affects lots of 
> aspects of the Nutch execution workflow.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (NUTCH-3013) Employ commons-lang3's StopWatch to simplify timing logic

2023-10-21 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3013?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney resolved NUTCH-3013.
-
Resolution: Fixed

Thanks for the review [~snagel] 

> Employ commons-lang3's StopWatch to simplify timing logic
> -
>
> Key: NUTCH-3013
> URL: https://issues.apache.org/jira/browse/NUTCH-3013
> Project: Nutch
>  Issue Type: Improvement
>  Components: logging, runtime, util
>Affects Versions: 1.19
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
>  Labels: timing
> Fix For: 1.20
>
>
> I ended up running some experiments integrating Nutch and [Celeborn 
> (Incubating)|https://celeborn.apache.org/] and it got me thinking about 
> runtime timings. After some investigation I came across [common-lang3's 
> StopWatch 
> Class|https://commons.apache.org/proper/commons-lang/javadocs/api-release/index.html?org/apache/commons/lang3/time/StopWatch.html]
>  which provides a convenient API for timings.
> Seeing as we already declare the commons-lang3 dependency, I think StopWatch 
> could help us clean up some timing logic in Nutch. Specifically, it would 
> reduce redundancy in terms of duplicated code and logic. It would also open 
> the door to introduce timing _*splits*_ if anyone is so inclined to dig 
> deeper into runtime timings.
> A cursory search for *_"long start = System.currentTimeMillis();"_* returns 
> hits for 32 files so it's fair to say that timing already affects lots of 
> aspects of the Nutch execution workflow.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (NUTCH-3013) Employ commons-lang3's StopWatch to simplify timing logic

2023-10-21 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3013?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney closed NUTCH-3013.
---

> Employ commons-lang3's StopWatch to simplify timing logic
> -
>
> Key: NUTCH-3013
> URL: https://issues.apache.org/jira/browse/NUTCH-3013
> Project: Nutch
>  Issue Type: Improvement
>  Components: logging, runtime, util
>Affects Versions: 1.19
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
>  Labels: timing
> Fix For: 1.20
>
>
> I ended up running some experiments integrating Nutch and [Celeborn 
> (Incubating)|https://celeborn.apache.org/] and it got me thinking about 
> runtime timings. After some investigation I came across [common-lang3's 
> StopWatch 
> Class|https://commons.apache.org/proper/commons-lang/javadocs/api-release/index.html?org/apache/commons/lang3/time/StopWatch.html]
>  which provides a convenient API for timings.
> Seeing as we already declare the commons-lang3 dependency, I think StopWatch 
> could help us clean up some timing logic in Nutch. Specifically, it would 
> reduce redundancy in terms of duplicated code and logic. It would also open 
> the door to introduce timing _*splits*_ if anyone is so inclined to dig 
> deeper into runtime timings.
> A cursory search for *_"long start = System.currentTimeMillis();"_* returns 
> hits for 32 files so it's fair to say that timing already affects lots of 
> aspects of the Nutch execution workflow.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-3013) Employ commons-lang3's StopWatch to simplify timing logic

2023-10-21 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-3013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17778181#comment-17778181
 ] 

ASF GitHub Bot commented on NUTCH-3013:
---

lewismc merged PR #788:
URL: https://github.com/apache/nutch/pull/788




> Employ commons-lang3's StopWatch to simplify timing logic
> -
>
> Key: NUTCH-3013
> URL: https://issues.apache.org/jira/browse/NUTCH-3013
> Project: Nutch
>  Issue Type: Improvement
>  Components: logging, runtime, util
>Affects Versions: 1.19
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
>  Labels: timing
> Fix For: 1.20
>
>
> I ended up running some experiments integrating Nutch and [Celeborn 
> (Incubating)|https://celeborn.apache.org/] and it got me thinking about 
> runtime timings. After some investigation I came across [common-lang3's 
> StopWatch 
> Class|https://commons.apache.org/proper/commons-lang/javadocs/api-release/index.html?org/apache/commons/lang3/time/StopWatch.html]
>  which provides a convenient API for timings.
> Seeing as we already declare the commons-lang3 dependency, I think StopWatch 
> could help us clean up some timing logic in Nutch. Specifically, it would 
> reduce redundancy in terms of duplicated code and logic. It would also open 
> the door to introduce timing _*splits*_ if anyone is so inclined to dig 
> deeper into runtime timings.
> A cursory search for *_"long start = System.currentTimeMillis();"_* returns 
> hits for 32 files so it's fair to say that timing already affects lots of 
> aspects of the Nutch execution workflow.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [PR] NUTCH-3013 Employ commons-lang3's StopWatch to simplify timing logic [nutch]

2023-10-21 Thread via GitHub


lewismc merged PR #788:
URL: https://github.com/apache/nutch/pull/788


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (NUTCH-3012) SegmentReader when dumping with option -recode: NPE on unparsed documents

2023-10-21 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-3012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17778135#comment-17778135
 ] 

Hudson commented on NUTCH-3012:
---

SUCCESS: Integrated in Jenkins build Nutch » Nutch-trunk #133 (See 
[https://ci-builds.apache.org/job/Nutch/job/Nutch-trunk/133/])
NUTCH-3012 SegmentReader when dumping with option -recode: NPE on unparsed 
documents (snagel: 
[https://github.com/apache/nutch/commit/d2c3e96d88818d8107f320c49e007329b020e090])
* (edit) src/java/org/apache/nutch/segment/SegmentReader.java


> SegmentReader when dumping with option -recode: NPE on unparsed documents
> -
>
> Key: NUTCH-3012
> URL: https://issues.apache.org/jira/browse/NUTCH-3012
> Project: Nutch
>  Issue Type: Bug
>  Components: segment
>Affects Versions: 1.19
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Major
> Fix For: 1.20
>
>
> SegmentReader when called with the flag {{-recode}} fails with a NPE when 
> trying to stringify the raw content of unparsed documents:
> {noformat}
> $> bin/nutch readseg  -dump crawl/segments/20231009065431 
> crawl/segreader/20231009065431 -recode
> ...
> 2023-10-09 07:55:18,451 INFO mapreduce.Job: Task Id : 
> attempt_1696825862783_0005_r_00_0, Status : FAILED
> Error: java.lang.NullPointerException: charset
> at java.base/java.lang.String.(String.java:504)
> at java.base/java.lang.String.(String.java:561)
> at org.apache.nutch.protocol.Content.toString(Content.java:297)
> at 
> org.apache.nutch.segment.SegmentReader$InputCompatReducer.reduce(SegmentReader.java:189)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-3011) HttpRobotRulesParser: handle HTTP 429 Too Many Requests same as server errors (HTTP 5xx)

2023-10-21 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-3011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17778136#comment-17778136
 ] 

Hudson commented on NUTCH-3011:
---

SUCCESS: Integrated in Jenkins build Nutch » Nutch-trunk #133 (See 
[https://ci-builds.apache.org/job/Nutch/job/Nutch-trunk/133/])
NUTCH-3011 HttpRobotRulesParser: handle HTTP 429 Too Many Requests same as 
server errors (HTTP 5xx) (snagel: 
[https://github.com/apache/nutch/commit/b081c75d87be61e42297c952298b72eb7ff2a6dc])
* (edit) conf/nutch-default.xml
* (edit) 
src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpRobotRulesParser.java


> HttpRobotRulesParser: handle HTTP 429 Too Many Requests same as server errors 
> (HTTP 5xx)
> 
>
> Key: NUTCH-3011
> URL: https://issues.apache.org/jira/browse/NUTCH-3011
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.19
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Major
> Fix For: 1.20
>
>
> HttpRobotRulesParser should handle HTTP 429 Too Many Requests same as server 
> errors (HTTP 5xx), that is if configured signalize Fetcher to delay requests. 
> See also NUTCH-2573 and 
> https://support.google.com/webmasters/answer/9679690#robots_details



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-2990) HttpRobotRulesParser to follow 5 redirects as specified by RFC 9309

2023-10-21 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17778123#comment-17778123
 ] 

Hudson commented on NUTCH-2990:
---

SUCCESS: Integrated in Jenkins build Nutch » Nutch-trunk #132 (See 
[https://ci-builds.apache.org/job/Nutch/job/Nutch-trunk/132/])
NUTCH-2990 HttpRobotRulesParser to follow 5 redirects as specified by RFC 9309 
(#779) (github: 
[https://github.com/apache/nutch/commit/ecdd19dbdd4424bf9b9bce206f23992140ee43fe])
* (edit) conf/nutch-default.xml
* (edit) 
src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpRobotRulesParser.java
* (edit) src/java/org/apache/nutch/protocol/RobotRulesParser.java


> HttpRobotRulesParser to follow 5 redirects as specified by RFC 9309
> ---
>
> Key: NUTCH-2990
> URL: https://issues.apache.org/jira/browse/NUTCH-2990
> Project: Nutch
>  Issue Type: Improvement
>  Components: protocol, robots
>Affects Versions: 1.19
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Major
> Fix For: 1.20
>
>
> The robots.txt parser 
> ([HttpRobotRulesParser|https://nutch.apache.org/documentation/javadoc/apidocs/org/apache/nutch/protocol/http/api/HttpRobotRulesParser.html])
>  follows only one redirect when fetching the robots.txt while the robots.txt 
> RFC 9309 recommends to follow 5 redirects:
> {quote} 2.3.1.2. Redirects
> It's possible that a server responds to a robots.txt fetch request with a 
> redirect, such as HTTP 301 or HTTP 302 in the case of HTTP. The crawlers 
> SHOULD follow at least five consecutive redirects, even across authorities 
> (for example, hosts in the case of HTTP).
> If a robots.txt file is reached within five consecutive redirects, the 
> robots.txt file MUST be fetched, parsed, and its rules followed in the 
> context of the initial authority. If there are more than five consecutive 
> redirects, crawlers MAY assume that the robots.txt file is unavailable.
> (https://datatracker.ietf.org/doc/html/rfc9309#name-redirects){quote}
> While following redirects, the parser should check whether the redirect 
> location is itself a "/robots.txt" on a different host and then try to read 
> it from the cache.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-3002) Protocol-okhttp HttpResponse: HTTP header metadata lookup should be case-insensitive

2023-10-21 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-3002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17778124#comment-17778124
 ] 

Hudson commented on NUTCH-3002:
---

SUCCESS: Integrated in Jenkins build Nutch » Nutch-trunk #132 (See 
[https://ci-builds.apache.org/job/Nutch/job/Nutch-trunk/132/])
NUTCH-3002 Protocol-okhttp HttpResponse: HTTP header metadata lookup should be 
case-insensitive (snagel: 
[https://github.com/apache/nutch/commit/e96cfc56ee04c8e7e07e11d4eef521b4674a9ec6])
* (add) 
src/plugin/protocol-http/src/test/org/apache/nutch/protocol/http/TestResponse.java
* (edit) src/java/org/apache/nutch/net/protocols/Response.java
* (edit) 
src/plugin/protocol-okhttp/src/java/org/apache/nutch/protocol/okhttp/OkHttpResponse.java
* (add) src/java/org/apache/nutch/metadata/CaseInsensitiveMetadata.java
* (edit) src/java/org/apache/nutch/metadata/Metadata.java
* (add) 
src/plugin/protocol-okhttp/src/test/org/apache/nutch/protocol/okhttp/TestResponse.java
* (edit) src/java/org/apache/nutch/metadata/SpellCheckedMetadata.java


> Protocol-okhttp HttpResponse: HTTP header metadata lookup should be 
> case-insensitive
> 
>
> Key: NUTCH-3002
> URL: https://issues.apache.org/jira/browse/NUTCH-3002
> Project: Nutch
>  Issue Type: Bug
>  Components: metadata, plugin, protocol
>Affects Versions: 1.19
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Major
> Fix For: 1.20
>
>
> Lookup of HTTP headers in the class HttpResponse should be case-insensitive - 
> for example, any "Location" header should be returned independent from the 
> casing sent by the sender.
> While protocol-http uses the class SpellCheckedMetadata which provides 
> case-insensitive lookups (as part of the spell-checking functionality), 
> protocol-okhttp relies on the class Metadata which stores the metadata values 
> case-sensitive.
> It's a good question, whether we still need to spell-check HTTP headers. 
> However, case-insensitive look-ups are definitely required. Especially, since 
> HTTP header names are case-insensitive in HTTP/2.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-3009) Upgrade to Hadoop 3.3.6

2023-10-21 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-3009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17778125#comment-17778125
 ] 

Hudson commented on NUTCH-3009:
---

SUCCESS: Integrated in Jenkins build Nutch » Nutch-trunk #132 (See 
[https://ci-builds.apache.org/job/Nutch/job/Nutch-trunk/132/])
NUTCH-3009 Upgrade to Hadoop 3.3.6 (snagel: 
[https://github.com/apache/nutch/commit/bb68385f9601b37c61ef5a2baac58740c975bddb])
* (edit) ivy/ivy.xml
* (edit) default.properties


> Upgrade to Hadoop 3.3.6
> ---
>
> Key: NUTCH-3009
> URL: https://issues.apache.org/jira/browse/NUTCH-3009
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.19
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Major
> Fix For: 1.20
>
>
> Upgrade to [Hadoop 3.3.6|https://hadoop.apache.org/release/3.3.6.html], the 
> latest available release of Hadoop (release date: 2023-06-23).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [PR] NUTCH-3012 SegmentReader when dumping with option -recode: NPE on unarsed documents [nutch]

2023-10-21 Thread via GitHub


sebastian-nagel merged PR #787:
URL: https://github.com/apache/nutch/pull/787


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (NUTCH-3012) SegmentReader when dumping with option -recode: NPE on unparsed documents

2023-10-21 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-3012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17778116#comment-17778116
 ] 

ASF GitHub Bot commented on NUTCH-3012:
---

sebastian-nagel merged PR #787:
URL: https://github.com/apache/nutch/pull/787




> SegmentReader when dumping with option -recode: NPE on unparsed documents
> -
>
> Key: NUTCH-3012
> URL: https://issues.apache.org/jira/browse/NUTCH-3012
> Project: Nutch
>  Issue Type: Bug
>  Components: segment
>Affects Versions: 1.19
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Major
> Fix For: 1.20
>
>
> SegmentReader when called with the flag {{-recode}} fails with a NPE when 
> trying to stringify the raw content of unparsed documents:
> {noformat}
> $> bin/nutch readseg  -dump crawl/segments/20231009065431 
> crawl/segreader/20231009065431 -recode
> ...
> 2023-10-09 07:55:18,451 INFO mapreduce.Job: Task Id : 
> attempt_1696825862783_0005_r_00_0, Status : FAILED
> Error: java.lang.NullPointerException: charset
> at java.base/java.lang.String.(String.java:504)
> at java.base/java.lang.String.(String.java:561)
> at org.apache.nutch.protocol.Content.toString(Content.java:297)
> at 
> org.apache.nutch.segment.SegmentReader$InputCompatReducer.reduce(SegmentReader.java:189)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (NUTCH-3012) SegmentReader when dumping with option -recode: NPE on unparsed documents

2023-10-21 Thread Sebastian Nagel (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-3012.

Resolution: Fixed

> SegmentReader when dumping with option -recode: NPE on unparsed documents
> -
>
> Key: NUTCH-3012
> URL: https://issues.apache.org/jira/browse/NUTCH-3012
> Project: Nutch
>  Issue Type: Bug
>  Components: segment
>Affects Versions: 1.19
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Major
> Fix For: 1.20
>
>
> SegmentReader when called with the flag {{-recode}} fails with a NPE when 
> trying to stringify the raw content of unparsed documents:
> {noformat}
> $> bin/nutch readseg  -dump crawl/segments/20231009065431 
> crawl/segreader/20231009065431 -recode
> ...
> 2023-10-09 07:55:18,451 INFO mapreduce.Job: Task Id : 
> attempt_1696825862783_0005_r_00_0, Status : FAILED
> Error: java.lang.NullPointerException: charset
> at java.base/java.lang.String.(String.java:504)
> at java.base/java.lang.String.(String.java:561)
> at org.apache.nutch.protocol.Content.toString(Content.java:297)
> at 
> org.apache.nutch.segment.SegmentReader$InputCompatReducer.reduce(SegmentReader.java:189)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (NUTCH-3011) HttpRobotRulesParser: handle HTTP 429 Too Many Requests same as server errors (HTTP 5xx)

2023-10-21 Thread Sebastian Nagel (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-3011.

Resolution: Implemented

> HttpRobotRulesParser: handle HTTP 429 Too Many Requests same as server errors 
> (HTTP 5xx)
> 
>
> Key: NUTCH-3011
> URL: https://issues.apache.org/jira/browse/NUTCH-3011
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.19
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Major
> Fix For: 1.20
>
>
> HttpRobotRulesParser should handle HTTP 429 Too Many Requests same as server 
> errors (HTTP 5xx), that is if configured signalize Fetcher to delay requests. 
> See also NUTCH-2573 and 
> https://support.google.com/webmasters/answer/9679690#robots_details



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-3011) HttpRobotRulesParser: handle HTTP 429 Too Many Requests same as server errors (HTTP 5xx)

2023-10-21 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-3011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17778115#comment-17778115
 ] 

ASF GitHub Bot commented on NUTCH-3011:
---

sebastian-nagel merged PR #786:
URL: https://github.com/apache/nutch/pull/786




> HttpRobotRulesParser: handle HTTP 429 Too Many Requests same as server errors 
> (HTTP 5xx)
> 
>
> Key: NUTCH-3011
> URL: https://issues.apache.org/jira/browse/NUTCH-3011
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.19
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Major
> Fix For: 1.20
>
>
> HttpRobotRulesParser should handle HTTP 429 Too Many Requests same as server 
> errors (HTTP 5xx), that is if configured signalize Fetcher to delay requests. 
> See also NUTCH-2573 and 
> https://support.google.com/webmasters/answer/9679690#robots_details



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [PR] NUTCH-3011 HttpRobotRulesParser: handle HTTP 429 Too Many Requests same as server errors (HTTP 5xx) [nutch]

2023-10-21 Thread via GitHub


sebastian-nagel merged PR #786:
URL: https://github.com/apache/nutch/pull/786


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (NUTCH-2990) HttpRobotRulesParser to follow 5 redirects as specified by RFC 9309

2023-10-21 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17778108#comment-17778108
 ] 

ASF GitHub Bot commented on NUTCH-2990:
---

sebastian-nagel merged PR #779:
URL: https://github.com/apache/nutch/pull/779




> HttpRobotRulesParser to follow 5 redirects as specified by RFC 9309
> ---
>
> Key: NUTCH-2990
> URL: https://issues.apache.org/jira/browse/NUTCH-2990
> Project: Nutch
>  Issue Type: Improvement
>  Components: protocol, robots
>Affects Versions: 1.19
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Major
> Fix For: 1.20
>
>
> The robots.txt parser 
> ([HttpRobotRulesParser|https://nutch.apache.org/documentation/javadoc/apidocs/org/apache/nutch/protocol/http/api/HttpRobotRulesParser.html])
>  follows only one redirect when fetching the robots.txt while the robots.txt 
> RFC 9309 recommends to follow 5 redirects:
> {quote} 2.3.1.2. Redirects
> It's possible that a server responds to a robots.txt fetch request with a 
> redirect, such as HTTP 301 or HTTP 302 in the case of HTTP. The crawlers 
> SHOULD follow at least five consecutive redirects, even across authorities 
> (for example, hosts in the case of HTTP).
> If a robots.txt file is reached within five consecutive redirects, the 
> robots.txt file MUST be fetched, parsed, and its rules followed in the 
> context of the initial authority. If there are more than five consecutive 
> redirects, crawlers MAY assume that the robots.txt file is unavailable.
> (https://datatracker.ietf.org/doc/html/rfc9309#name-redirects){quote}
> While following redirects, the parser should check whether the redirect 
> location is itself a "/robots.txt" on a different host and then try to read 
> it from the cache.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (NUTCH-2990) HttpRobotRulesParser to follow 5 redirects as specified by RFC 9309

2023-10-21 Thread Sebastian Nagel (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2990?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-2990.

Resolution: Implemented

Thanks, everybody!

> HttpRobotRulesParser to follow 5 redirects as specified by RFC 9309
> ---
>
> Key: NUTCH-2990
> URL: https://issues.apache.org/jira/browse/NUTCH-2990
> Project: Nutch
>  Issue Type: Improvement
>  Components: protocol, robots
>Affects Versions: 1.19
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Major
> Fix For: 1.20
>
>
> The robots.txt parser 
> ([HttpRobotRulesParser|https://nutch.apache.org/documentation/javadoc/apidocs/org/apache/nutch/protocol/http/api/HttpRobotRulesParser.html])
>  follows only one redirect when fetching the robots.txt while the robots.txt 
> RFC 9309 recommends to follow 5 redirects:
> {quote} 2.3.1.2. Redirects
> It's possible that a server responds to a robots.txt fetch request with a 
> redirect, such as HTTP 301 or HTTP 302 in the case of HTTP. The crawlers 
> SHOULD follow at least five consecutive redirects, even across authorities 
> (for example, hosts in the case of HTTP).
> If a robots.txt file is reached within five consecutive redirects, the 
> robots.txt file MUST be fetched, parsed, and its rules followed in the 
> context of the initial authority. If there are more than five consecutive 
> redirects, crawlers MAY assume that the robots.txt file is unavailable.
> (https://datatracker.ietf.org/doc/html/rfc9309#name-redirects){quote}
> While following redirects, the parser should check whether the redirect 
> location is itself a "/robots.txt" on a different host and then try to read 
> it from the cache.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [PR] NUTCH-2990 HttpRobotRulesParser to follow 5 redirects as specified by RFC 9309 [nutch]

2023-10-21 Thread via GitHub


sebastian-nagel merged PR #779:
URL: https://github.com/apache/nutch/pull/779


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (NUTCH-3009) Upgrade to Hadoop 3.3.6

2023-10-21 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-3009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17778107#comment-17778107
 ] 

ASF GitHub Bot commented on NUTCH-3009:
---

sebastian-nagel merged PR #782:
URL: https://github.com/apache/nutch/pull/782




> Upgrade to Hadoop 3.3.6
> ---
>
> Key: NUTCH-3009
> URL: https://issues.apache.org/jira/browse/NUTCH-3009
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.19
>Reporter: Sebastian Nagel
>Priority: Major
> Fix For: 1.20
>
>
> Upgrade to [Hadoop 3.3.6|https://hadoop.apache.org/release/3.3.6.html], the 
> latest available release of Hadoop (release date: 2023-06-23).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (NUTCH-3009) Upgrade to Hadoop 3.3.6

2023-10-21 Thread Sebastian Nagel (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3009?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel reassigned NUTCH-3009:
--

Assignee: Sebastian Nagel

> Upgrade to Hadoop 3.3.6
> ---
>
> Key: NUTCH-3009
> URL: https://issues.apache.org/jira/browse/NUTCH-3009
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.19
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Major
> Fix For: 1.20
>
>
> Upgrade to [Hadoop 3.3.6|https://hadoop.apache.org/release/3.3.6.html], the 
> latest available release of Hadoop (release date: 2023-06-23).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (NUTCH-3009) Upgrade to Hadoop 3.3.6

2023-10-21 Thread Sebastian Nagel (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3009?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-3009.

Resolution: Implemented

> Upgrade to Hadoop 3.3.6
> ---
>
> Key: NUTCH-3009
> URL: https://issues.apache.org/jira/browse/NUTCH-3009
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.19
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Major
> Fix For: 1.20
>
>
> Upgrade to [Hadoop 3.3.6|https://hadoop.apache.org/release/3.3.6.html], the 
> latest available release of Hadoop (release date: 2023-06-23).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [PR] NUTCH-3009 Upgrade to Hadoop 3.3.6 [nutch]

2023-10-21 Thread via GitHub


sebastian-nagel merged PR #782:
URL: https://github.com/apache/nutch/pull/782


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Resolved] (NUTCH-3006) Downgrade Tika dependency to 2.2.1 (core and parse-tika)

2023-10-21 Thread Sebastian Nagel (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-3006.

Fix Version/s: (was: 1.20)
   Resolution: Abandoned

> Downgrade Tika dependency to 2.2.1 (core and parse-tika)
> 
>
> Key: NUTCH-3006
> URL: https://issues.apache.org/jira/browse/NUTCH-3006
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.20
>Reporter: Sebastian Nagel
>Priority: Major
>
> Tika 2.3.0 and upwards depend on a commons-io 2.11.0 (or even higher) which 
> is not available when Nutch is used on Hadoop. Only Hadoop 3.4.0 is expected 
> to ship with commons-io 2.11.0 (HADOOP-18301), all currently released 
> versions provide commons-io 2.8.0. Because Hadoop-required dependencies are 
> enforced in (pseudo)distributed mode, using Tika may cause issues, see 
> NUTCH-2937 and NUTCH-2959.
> [~lewismc] suggested in the discussion of [Githup PR 
> #776|https://github.com/apache/nutch/pull/776] to downgrade to Tika 2.2.1 to 
> resolve these issues for now and until Hadoop 3.4.0 becomes available.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (NUTCH-3002) Protocol-okhttp HttpResponse: HTTP header metadata lookup should be case-insensitive

2023-10-21 Thread Sebastian Nagel (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3002?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel reassigned NUTCH-3002:
--

Assignee: Sebastian Nagel

> Protocol-okhttp HttpResponse: HTTP header metadata lookup should be 
> case-insensitive
> 
>
> Key: NUTCH-3002
> URL: https://issues.apache.org/jira/browse/NUTCH-3002
> Project: Nutch
>  Issue Type: Bug
>  Components: metadata, plugin, protocol
>Affects Versions: 1.19
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Major
> Fix For: 1.20
>
>
> Lookup of HTTP headers in the class HttpResponse should be case-insensitive - 
> for example, any "Location" header should be returned independent from the 
> casing sent by the sender.
> While protocol-http uses the class SpellCheckedMetadata which provides 
> case-insensitive lookups (as part of the spell-checking functionality), 
> protocol-okhttp relies on the class Metadata which stores the metadata values 
> case-sensitive.
> It's a good question, whether we still need to spell-check HTTP headers. 
> However, case-insensitive look-ups are definitely required. Especially, since 
> HTTP header names are case-insensitive in HTTP/2.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (NUTCH-3002) Protocol-okhttp HttpResponse: HTTP header metadata lookup should be case-insensitive

2023-10-21 Thread Sebastian Nagel (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3002?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-3002.

Resolution: Fixed

> Protocol-okhttp HttpResponse: HTTP header metadata lookup should be 
> case-insensitive
> 
>
> Key: NUTCH-3002
> URL: https://issues.apache.org/jira/browse/NUTCH-3002
> Project: Nutch
>  Issue Type: Bug
>  Components: metadata, plugin, protocol
>Affects Versions: 1.19
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Major
> Fix For: 1.20
>
>
> Lookup of HTTP headers in the class HttpResponse should be case-insensitive - 
> for example, any "Location" header should be returned independent from the 
> casing sent by the sender.
> While protocol-http uses the class SpellCheckedMetadata which provides 
> case-insensitive lookups (as part of the spell-checking functionality), 
> protocol-okhttp relies on the class Metadata which stores the metadata values 
> case-sensitive.
> It's a good question, whether we still need to spell-check HTTP headers. 
> However, case-insensitive look-ups are definitely required. Especially, since 
> HTTP header names are case-insensitive in HTTP/2.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-3002) Protocol-okhttp HttpResponse: HTTP header metadata lookup should be case-insensitive

2023-10-21 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-3002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17778105#comment-17778105
 ] 

ASF GitHub Bot commented on NUTCH-3002:
---

sebastian-nagel merged PR #777:
URL: https://github.com/apache/nutch/pull/777




> Protocol-okhttp HttpResponse: HTTP header metadata lookup should be 
> case-insensitive
> 
>
> Key: NUTCH-3002
> URL: https://issues.apache.org/jira/browse/NUTCH-3002
> Project: Nutch
>  Issue Type: Bug
>  Components: metadata, plugin, protocol
>Affects Versions: 1.19
>Reporter: Sebastian Nagel
>Priority: Major
> Fix For: 1.20
>
>
> Lookup of HTTP headers in the class HttpResponse should be case-insensitive - 
> for example, any "Location" header should be returned independent from the 
> casing sent by the sender.
> While protocol-http uses the class SpellCheckedMetadata which provides 
> case-insensitive lookups (as part of the spell-checking functionality), 
> protocol-okhttp relies on the class Metadata which stores the metadata values 
> case-sensitive.
> It's a good question, whether we still need to spell-check HTTP headers. 
> However, case-insensitive look-ups are definitely required. Especially, since 
> HTTP header names are case-insensitive in HTTP/2.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [PR] NUTCH-3002 Protocol-okhttp HttpResponse: HTTP header metadata lookup should be case-insensitive [nutch]

2023-10-21 Thread via GitHub


sebastian-nagel merged PR #777:
URL: https://github.com/apache/nutch/pull/777


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (NUTCH-3014) Standardize NutchJob job names

2023-10-21 Thread Sebastian Nagel (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-3014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17778103#comment-17778103
 ] 

Sebastian Nagel commented on NUTCH-3014:


If there is a single data name/directory (CrawlDb, segment, etc.), using it as 
part of the additional info would make the job name more unique. Imagine a long 
list of generate - fetch - updatedb jobs: adding the segment for the "generator 
partition" and fetcher job, makes it easier to figure out where in the crawl 
workflow a job was located. If there are multiple workflows running in 
concurrently, the CrawlDb name/path would be also a useful discriminatory 
component.

> Standardize NutchJob job names
> --
>
> Key: NUTCH-3014
> URL: https://issues.apache.org/jira/browse/NUTCH-3014
> Project: Nutch
>  Issue Type: Improvement
>  Components: configuration, runtime
>Affects Versions: 1.19
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 1.20
>
>
> There is a large degree of variability when we set the job name{{{}{}}}
>  
> {{Job job = NutchJob.getInstance(getConf());}}
> {{job.setJobName("read " + segment);}}
>  
> Some examples mention the job name, others don't. Some use upper case, others 
> don't, etc.
> I think we can standardize the NutchJob job names. This would help when 
> filtering jobs in YARN ResourceManager UI as well.
> I propose we implement the following convention
>  * *Nutch* (mandatory) - static value which prepends the job name, assists 
> with distinguishing the Job as a NutchJob and making it easily findable.
>  * *${ClassName}* (mandatory) - literally the name of the Class the job is 
> encoded in
>  * *${additional info}* (optional) - value could further distinguish the type 
> of job (LinkRank Counter, LinkRank Initializer, LinkRank Inverter, etc.)
> _*Nutch ${ClassName}* *${additional info}*_
> _Examples:_
>  * _Nutch LinkRank Inverter_
>  * _Nutch CrawlDb + $crawldb_
>  * _Nutch LinkDbReader + $linkdb_
> Thanks for any suggestions/comments.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (NUTCH-3014) Standardize NutchJob job names

2023-10-21 Thread Lewis John McGibbney (Jira)
Lewis John McGibbney created NUTCH-3014:
---

 Summary: Standardize NutchJob job names
 Key: NUTCH-3014
 URL: https://issues.apache.org/jira/browse/NUTCH-3014
 Project: Nutch
  Issue Type: Improvement
  Components: configuration, runtime
Affects Versions: 1.19
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
 Fix For: 1.20


There is a large degree of variability when we set the job name{{{}{}}}

 

{{Job job = NutchJob.getInstance(getConf());}}

{{job.setJobName("read " + segment);}}

 

Some examples mention the job name, others don't. Some use upper case, others 
don't, etc.

I think we can standardize the NutchJob job names. This would help when 
filtering jobs in YARN ResourceManager UI as well.

I propose we implement the following convention
 * *Nutch* (mandatory) - static value which prepends the job name, assists with 
distinguishing the Job as a NutchJob and making it easily findable.
 * *${ClassName}* (mandatory) - literally the name of the Class the job is 
encoded in
 * *${additional info}* (optional) - value could further distinguish the type 
of job (LinkRank Counter, LinkRank Initializer, LinkRank Inverter, etc.)

_*Nutch ${ClassName}* *${additional info}*_

_Examples:_
 * _Nutch LinkRank Inverter_
 * _Nutch CrawlDb + $crawldb_
 * _Nutch LinkDbReader + $linkdb_

Thanks for any suggestions/comments.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)