[ https://issues.apache.org/jira/browse/NUTCH-2279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sebastian Nagel reassigned NUTCH-2279: -------------------------------------- Assignee: Sebastian Nagel > LinkRank fails when using Hadoop MR output compression > ------------------------------------------------------ > > Key: NUTCH-2279 > URL: https://issues.apache.org/jira/browse/NUTCH-2279 > Project: Nutch > Issue Type: Bug > Affects Versions: 1.12 > Reporter: Joseph Naegele > Assignee: Sebastian Nagel > Priority: Major > Fix For: 1.16 > > > When using MapReduce job output compression, i.e. > {{mapreduce.output.fileoutputformat.compress=true}}, LinkRank can't read the > results of its {{Counter}} MR job due to the additional, generated file > extension. > For example, using the default compression codec (which appears to be > DEFLATE), the counter file is written to > {{crawl/webgraph/_num_nodes_/part-00000.deflate}}. Then, the LinkRank job > attempts to manually read this file to obtain the number of links using the > following code: > {code} > FSDataInputStream readLinks = fs.open(new Path(numLinksPath, "part-00000")); > {code} > which fails because the file {{part-00000}} doesn't exist: > {code} > LinkAnalysis: java.io.FileNotFoundException: File > crawl/webgraph/_num_nodes_/part-00000 does not exist > at > org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:606) > at > org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:819) > at > org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:596) > at > org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:421) > at > org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.<init>(ChecksumFileSystem.java:140) > at > org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:341) > at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:767) > at > org.apache.nutch.scoring.webgraph.LinkRank.runCounter(LinkRank.java:124) > at > org.apache.nutch.scoring.webgraph.LinkRank.analyze(LinkRank.java:633) > at org.apache.nutch.scoring.webgraph.LinkRank.run(LinkRank.java:713) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) > at org.apache.nutch.scoring.webgraph.LinkRank.main(LinkRank.java:680) > {code} > To reproduce, add {{-D mapreduce.output.fileoutputformat.compress=true}} to > the properties for {{bin/nutch linkrank ...}} -- This message was sent by Atlassian JIRA (v7.6.3#76005)