[ https://issues.apache.org/jira/browse/NUTCH-2935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17477378#comment-17477378 ]
Hudson commented on NUTCH-2935: ------------------------------- SUCCESS: Integrated in Jenkins build Nutch ยป Nutch-trunk #70 (See [https://ci-builds.apache.org/job/Nutch/job/Nutch-trunk/70/]) NUTCH-2935 DeduplicationJob: failure on URLs with invalid percent encoding (snagel: [https://github.com/apache/nutch/commit/d565f45a67d2491b7b536ae95560522aa20b8c26]) * (add) src/testresources/deduplication-crawldb/current/part-r-00000/.index.crc * (edit) src/java/org/apache/nutch/crawl/DeduplicationJob.java * (add) src/testresources/deduplication-crawldb/current/part-r-00000/index * (add) src/test/org/apache/nutch/crawl/TestCrawlDbDeduplication.java * (add) src/testresources/deduplication-crawldb/current/part-r-00000/.data.crc * (add) src/testresources/deduplication-crawldb/current/part-r-00000/data > DeduplicationJob: failure on URLs with invalid percent encoding > --------------------------------------------------------------- > > Key: NUTCH-2935 > URL: https://issues.apache.org/jira/browse/NUTCH-2935 > Project: Nutch > Issue Type: Bug > Components: crawldb > Affects Versions: 1.18 > Reporter: Sebastian Nagel > Assignee: Sebastian Nagel > Priority: Major > Fix For: 1.19 > > > The DeduplicationJob may fail with an IllegalArgumentException on invalid > percent encodings in URLs: > {noformat} > 2021-11-25 04:36:41,747 INFO mapreduce.Job: Task Id : > attempt_1637669672674_0018_r_000193_0, Status : FAILED > Error: java.lang.IllegalArgumentException: URLDecoder: Illegal hex characters > in escape (%) pattern - Error at index 0 in: "YR" > at java.base/java.net.URLDecoder.decode(URLDecoder.java:232) > at java.base/java.net.URLDecoder.decode(URLDecoder.java:142) > at > org.apache.nutch.crawl.DeduplicationJob$DedupReducer.getDuplicate(DeduplicationJob.java:211) > ... > Exception in thread "main" java.lang.RuntimeException: Crawl job did not > succeed, job status:FAILED, reason: Task failed > task_1637669672674_0018_r_000193 > Job failed as tasks failed. failedMaps:0 failedReduces:1 killedMaps:0 > killedReduces: 0 > {noformat} > The IllegalArgumentException should be caught, logged and if only one of the > two URLs with duplicated content is invalid, it should be flagged as > duplicate while the valid URL "survives". -- This message was sent by Atlassian Jira (v8.20.1#820001)