In debugging some replication issues in our HDFS environment, I noticed that the MapReduce framework uses the following algorithm for setting the replication on submitted job files:
1. Create the file with *default* DFS replication factor (i.e. 'dfs.replication') 2. Subsequently alter the replication of the file based on the 'mapred.submit.replication' config value private static FSDataOutputStream createFile(FileSystem fs, Path splitFile, Configuration job) throws IOException { FSDataOutputStream out = FileSystem.create(fs, splitFile, new FsPermission(JobSubmissionFiles.JOB_FILE_PERMISSION)); int replication = job.getInt("mapred.submit.replication", 10); fs.setReplication(splitFile, (short)replication); writeSplitHeader(out); return out; } If I understand currectly, the net functional effect of this approach is that - The initial write pipeline is setup with 'dfs.replication' nodes (i.e. 3) - The namenode triggers additional inter-datanode replications in the background (as it detects the blocks as "under-replicated"). I'm assuming this is intentional? Alternatively, if the mapred.submit.replication was specified on initial create, the write pipeline would be significantly larger. The reason I noticed is that we had inadvertently specified mapred.submit.replication as *less than* dfs.replication in our configuration, which caused a bunch of excess replica pruning (and ultimately IOExceptions in our datanode logs). Thanks, Eric