Matt Parker created MAPREDUCE-5050: -------------------------------------- Summary: Cannot find partition.lst in Terasort on Hadoop/Local File System Key: MAPREDUCE-5050 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5050 Project: Hadoop Map/Reduce Issue Type: Bug Components: examples Affects Versions: 0.20.2 Environment: Cloudera VM CDH3u4, VMWare, Linux, Java SE 1.6.0_31-b04 Reporter: Matt Parker Priority: Minor
I'm trying to simulate running Hadoop on Lustre by configuring it to use the local file system using a single cloudera VM (cdh3u4). I can generate the data just fine, but when running the sorting portion of the program, I get an error about not being able to find the _partition.lst file. It exists in the generated data directory. Perusing the Terasort code, I see in the main method that has a Path reference to partition.lst, which is created with the parent directory. public int run(String[] args) throws Exception { LOG.info("starting"); JobConf job = (JobConf) getConf(); >> Path inputDir = new Path(args[0]); >> inputDir = inputDir.makeQualified(inputDir.getFileSystem(job)); >> Path partitionFile = new Path(inputDir, TeraInputFormat.PARTITION_FILENAME); URI partitionUri = new URI(partitionFile.toString() + "#" + TeraInputFormat.PARTITION_FILENAME); TeraInputFormat.setInputPaths(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.setJobName("TeraSort"); job.setJarByClass(TeraSort.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(Text.class); job.setInputFormat(TeraInputFormat.class); job.setOutputFormat(TeraOutputFormat.class); job.setPartitionerClass(TotalOrderPartitioner.class); TeraInputFormat.writePartitionFile(job, partitionFile); DistributedCache.addCacheFile(partitionUri, job); DistributedCache.createSymlink(job); job.setInt("dfs.replication", 1); TeraOutputFormat.setFinalSync(job, true); JobClient.runJob(job); LOG.info("done"); return 0; } But in the configure method, the Path isn't created with the parent directory reference. public void configure(JobConf job) { try { FileSystem fs = FileSystem.getLocal(job); >> Path partFile = new Path(TeraInputFormat.PARTITION_FILENAME); splitPoints = readPartitions(fs, partFile, job); trie = buildTrie(splitPoints, 0, splitPoints.length, new Text(), 2); } catch (IOException ie) { throw new IllegalArgumentException("can't read paritions file", ie); } } I modified the code as follows, and now sorting portion of the Terasort test works using the general file system. I think the above code is a bug. public void configure(JobConf job) { try { FileSystem fs = FileSystem.getLocal(job); >> Path[] inputPaths = TeraInputFormat.getInputPaths(job); >> Path partFile = new Path(inputPaths[0], TeraInputFormat.PARTITION_FILENAME); splitPoints = readPartitions(fs, partFile, job); trie = buildTrie(splitPoints, 0, splitPoints.length, new Text(), 2); } catch (IOException ie) { throw new IllegalArgumentException("can't read paritions file", ie); } } -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira