[ 
https://issues.apache.org/jira/browse/NUTCH-2935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-2935.
------------------------------------
    Resolution: Fixed

> DeduplicationJob: failure on URLs with invalid percent encoding
> ---------------------------------------------------------------
>
>                 Key: NUTCH-2935
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2935
>             Project: Nutch
>          Issue Type: Bug
>          Components: crawldb
>    Affects Versions: 1.18
>            Reporter: Sebastian Nagel
>            Assignee: Sebastian Nagel
>            Priority: Major
>             Fix For: 1.19
>
>
> The DeduplicationJob may fail with an IllegalArgumentException on invalid 
> percent encodings in URLs:
> {noformat}
> 2021-11-25 04:36:41,747 INFO mapreduce.Job: Task Id : 
> attempt_1637669672674_0018_r_000193_0, Status : FAILED
> Error: java.lang.IllegalArgumentException: URLDecoder: Illegal hex characters 
> in escape (%) pattern - Error at index 0 in: "YR"
>         at java.base/java.net.URLDecoder.decode(URLDecoder.java:232)
>         at java.base/java.net.URLDecoder.decode(URLDecoder.java:142)
>         at 
> org.apache.nutch.crawl.DeduplicationJob$DedupReducer.getDuplicate(DeduplicationJob.java:211)
> ...
> Exception in thread "main" java.lang.RuntimeException: Crawl job did not 
> succeed, job status:FAILED, reason: Task failed 
> task_1637669672674_0018_r_000193
> Job failed as tasks failed. failedMaps:0 failedReduces:1 killedMaps:0 
> killedReduces: 0
> {noformat}
> The IllegalArgumentException should be caught, logged and if only one of the 
> two URLs with duplicated content is invalid, it should be flagged as 
> duplicate while the valid URL "survives".



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to