[ https://issues.apache.org/jira/browse/NUTCH-2570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sebastian Nagel resolved NUTCH-2570. ------------------------------------ Resolution: Fixed > Deduplication job fails to install deduplicated CrawlDb > ------------------------------------------------------- > > Key: NUTCH-2570 > URL: https://issues.apache.org/jira/browse/NUTCH-2570 > Project: Nutch > Issue Type: Bug > Components: crawldb > Affects Versions: 1.15 > Reporter: Sebastian Nagel > Assignee: Sebastian Nagel > Priority: Critical > Fix For: 1.15 > > > The DeduplicationJob ("nutch dedup") fails to install the deduplicated > CrawlDb and leaves only the "old" crawldb (if "db.preserve.backup" is true): > {noformat} > % tree crawldb > crawldb > ├── current > │ └── part-r-00000 > │ ├── data > │ └── index > └── old > └── part-r-00000 > ├── data > └── index > % bin/nutch dedup crawldb > DeduplicationJob: starting at 2018-04-22 21:48:08 > Deduplication: 6 documents marked as duplicates > Deduplication: Updating status of duplicate urls into crawl db. > Exception in thread "main" java.io.FileNotFoundException: File > file:/tmp/crawldb/1742327020 does not exist > at > org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:611) > at > org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:824) > at > org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:601) > at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:337) > at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:289) > at org.apache.hadoop.fs.RawLocalFileSystem.rename(RawLocalFileSystem.java:374) > at org.apache.hadoop.fs.ChecksumFileSystem.rename(ChecksumFileSystem.java:613) > at org.apache.nutch.util.FSUtils.replace(FSUtils.java:58) > at org.apache.nutch.crawl.CrawlDb.install(CrawlDb.java:212) > at org.apache.nutch.crawl.CrawlDb.install(CrawlDb.java:225) > at org.apache.nutch.crawl.DeduplicationJob.run(DeduplicationJob.java:366) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) > at org.apache.nutch.crawl.DeduplicationJob.main(DeduplicationJob.java:379) > % tree crawldb > crawldb > └── old > └── part-r-00000 > ├── data > └── index > {noformat} > In pseudo-distributed mode it's even worse: only the "old" CrawlDb is left > without any error. -- This message was sent by Atlassian JIRA (v7.6.3#76005)