[ https://issues.apache.org/jira/browse/NUTCH-2935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sebastian Nagel resolved NUTCH-2935. ------------------------------------ Resolution: Fixed > DeduplicationJob: failure on URLs with invalid percent encoding > --------------------------------------------------------------- > > Key: NUTCH-2935 > URL: https://issues.apache.org/jira/browse/NUTCH-2935 > Project: Nutch > Issue Type: Bug > Components: crawldb > Affects Versions: 1.18 > Reporter: Sebastian Nagel > Assignee: Sebastian Nagel > Priority: Major > Fix For: 1.19 > > > The DeduplicationJob may fail with an IllegalArgumentException on invalid > percent encodings in URLs: > {noformat} > 2021-11-25 04:36:41,747 INFO mapreduce.Job: Task Id : > attempt_1637669672674_0018_r_000193_0, Status : FAILED > Error: java.lang.IllegalArgumentException: URLDecoder: Illegal hex characters > in escape (%) pattern - Error at index 0 in: "YR" > at java.base/java.net.URLDecoder.decode(URLDecoder.java:232) > at java.base/java.net.URLDecoder.decode(URLDecoder.java:142) > at > org.apache.nutch.crawl.DeduplicationJob$DedupReducer.getDuplicate(DeduplicationJob.java:211) > ... > Exception in thread "main" java.lang.RuntimeException: Crawl job did not > succeed, job status:FAILED, reason: Task failed > task_1637669672674_0018_r_000193 > Job failed as tasks failed. failedMaps:0 failedReduces:1 killedMaps:0 > killedReduces: 0 > {noformat} > The IllegalArgumentException should be caught, logged and if only one of the > two URLs with duplicated content is invalid, it should be flagged as > duplicate while the valid URL "survives". -- This message was sent by Atlassian Jira (v8.20.1#820001)