In debugging some replication issues in our HDFS environment, I noticed that 
the MapReduce framework uses the following algorithm for setting the 
replication on submitted job files:

1.     Create the file with *default* DFS replication factor (i.e. 
'dfs.replication')

2.     Subsequently alter the replication of the file based on the 
'mapred.submit.replication' config value

  private static FSDataOutputStream createFile(FileSystem fs, Path splitFile,
      Configuration job)  throws IOException {
    FSDataOutputStream out = FileSystem.create(fs, splitFile,
        new FsPermission(JobSubmissionFiles.JOB_FILE_PERMISSION));
    int replication = job.getInt("mapred.submit.replication", 10);
    fs.setReplication(splitFile, (short)replication);
    writeSplitHeader(out);
    return out;
  }

If I understand currectly, the net functional effect of this approach is that

-       The initial write pipeline is setup with 'dfs.replication' nodes (i.e. 
3)

-       The namenode triggers additional inter-datanode replications in the 
background (as it detects the blocks as "under-replicated").

I'm assuming this is intentional?  Alternatively, if the 
mapred.submit.replication was specified on initial create, the write pipeline 
would be significantly larger.

The reason I noticed is that we had inadvertently specified 
mapred.submit.replication as *less than* dfs.replication in our configuration, 
which caused a bunch of excess replica pruning (and ultimately IOExceptions in 
our datanode logs).

Thanks,
Eric

Reply via email to