Matt Parker created MAPREDUCE-5050:
--------------------------------------

             Summary: Cannot find partition.lst in Terasort on Hadoop/Local 
File System
                 Key: MAPREDUCE-5050
                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5050
             Project: Hadoop Map/Reduce
          Issue Type: Bug
          Components: examples
    Affects Versions: 0.20.2
         Environment: Cloudera VM CDH3u4, VMWare, Linux, Java SE 1.6.0_31-b04
            Reporter: Matt Parker
            Priority: Minor


I'm trying to simulate running Hadoop on Lustre by configuring it to use the 
local file system using a single cloudera VM (cdh3u4).

I can generate the data just fine, but when running the sorting portion of the 
program, I get an error about not being able to find the _partition.lst file. 
It exists in the generated data directory.

Perusing the Terasort code, I see in the main method that has a Path reference 
to partition.lst, which is created with the parent directory.

  public int run(String[] args) throws Exception {
       LOG.info("starting");
      JobConf job = (JobConf) getConf();
>>  Path inputDir = new Path(args[0]);
>>  inputDir = inputDir.makeQualified(inputDir.getFileSystem(job));
>>  Path partitionFile = new Path(inputDir, TeraInputFormat.PARTITION_FILENAME);
      URI partitionUri = new URI(partitionFile.toString() +
                               "#" + TeraInputFormat.PARTITION_FILENAME);
      TeraInputFormat.setInputPaths(job, new Path(args[0]));
      FileOutputFormat.setOutputPath(job, new Path(args[1]));
      job.setJobName("TeraSort");
      job.setJarByClass(TeraSort.class);
      job.setOutputKeyClass(Text.class);
      job.setOutputValueClass(Text.class);
      job.setInputFormat(TeraInputFormat.class);
      job.setOutputFormat(TeraOutputFormat.class);
      job.setPartitionerClass(TotalOrderPartitioner.class);
      TeraInputFormat.writePartitionFile(job, partitionFile);
      DistributedCache.addCacheFile(partitionUri, job);
      DistributedCache.createSymlink(job);
      job.setInt("dfs.replication", 1);
      TeraOutputFormat.setFinalSync(job, true);
      JobClient.runJob(job);
      LOG.info("done");
      return 0;
  }

But in the configure method, the Path isn't created with the parent directory 
reference.

    public void configure(JobConf job) {

      try {
        FileSystem fs = FileSystem.getLocal(job);
>>    Path partFile = new Path(TeraInputFormat.PARTITION_FILENAME);
        splitPoints = readPartitions(fs, partFile, job);
        trie = buildTrie(splitPoints, 0, splitPoints.length, new Text(), 2);
      } catch (IOException ie) {
        throw new IllegalArgumentException("can't read paritions file", ie);
      }

    }

I modified the code as follows, and now sorting portion of the Terasort test 
works using the
general file system. I think the above code is a bug.

    public void configure(JobConf job) {

      try {
        FileSystem fs = FileSystem.getLocal(job);

  >>  Path[] inputPaths = TeraInputFormat.getInputPaths(job);
  >>  Path partFile = new Path(inputPaths[0], 
TeraInputFormat.PARTITION_FILENAME);

        splitPoints = readPartitions(fs, partFile, job);
        trie = buildTrie(splitPoints, 0, splitPoints.length, new Text(), 2);
      } catch (IOException ie) {
        throw new IllegalArgumentException("can't read paritions file", ie);
      }

    }


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to