[ 
https://issues.apache.org/jira/browse/NUTCH-2935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17477378#comment-17477378
 ] 

Hudson commented on NUTCH-2935:
-------------------------------

SUCCESS: Integrated in Jenkins build Nutch ยป Nutch-trunk #70 (See 
[https://ci-builds.apache.org/job/Nutch/job/Nutch-trunk/70/])
NUTCH-2935 DeduplicationJob: failure on URLs with invalid percent encoding 
(snagel: 
[https://github.com/apache/nutch/commit/d565f45a67d2491b7b536ae95560522aa20b8c26])
* (add) src/testresources/deduplication-crawldb/current/part-r-00000/.index.crc
* (edit) src/java/org/apache/nutch/crawl/DeduplicationJob.java
* (add) src/testresources/deduplication-crawldb/current/part-r-00000/index
* (add) src/test/org/apache/nutch/crawl/TestCrawlDbDeduplication.java
* (add) src/testresources/deduplication-crawldb/current/part-r-00000/.data.crc
* (add) src/testresources/deduplication-crawldb/current/part-r-00000/data


> DeduplicationJob: failure on URLs with invalid percent encoding
> ---------------------------------------------------------------
>
>                 Key: NUTCH-2935
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2935
>             Project: Nutch
>          Issue Type: Bug
>          Components: crawldb
>    Affects Versions: 1.18
>            Reporter: Sebastian Nagel
>            Assignee: Sebastian Nagel
>            Priority: Major
>             Fix For: 1.19
>
>
> The DeduplicationJob may fail with an IllegalArgumentException on invalid 
> percent encodings in URLs:
> {noformat}
> 2021-11-25 04:36:41,747 INFO mapreduce.Job: Task Id : 
> attempt_1637669672674_0018_r_000193_0, Status : FAILED
> Error: java.lang.IllegalArgumentException: URLDecoder: Illegal hex characters 
> in escape (%) pattern - Error at index 0 in: "YR"
>         at java.base/java.net.URLDecoder.decode(URLDecoder.java:232)
>         at java.base/java.net.URLDecoder.decode(URLDecoder.java:142)
>         at 
> org.apache.nutch.crawl.DeduplicationJob$DedupReducer.getDuplicate(DeduplicationJob.java:211)
> ...
> Exception in thread "main" java.lang.RuntimeException: Crawl job did not 
> succeed, job status:FAILED, reason: Task failed 
> task_1637669672674_0018_r_000193
> Job failed as tasks failed. failedMaps:0 failedReduces:1 killedMaps:0 
> killedReduces: 0
> {noformat}
> The IllegalArgumentException should be caught, logged and if only one of the 
> two URLs with duplicated content is invalid, it should be flagged as 
> duplicate while the valid URL "survives".



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to