In debugging some replication issues in our HDFS environment, I noticed that
the MapReduce framework uses the following algorithm for setting the
replication on submitted job files:
1. Create the file with *default* DFS replication factor (i.e.
'dfs.replication')
2. Subsequently alter the replication of the file based on the
'mapred.submit.replication' config value
private static FSDataOutputStream createFile(FileSystem fs, Path splitFile,
Configuration job) throws IOException {
FSDataOutputStream out = FileSystem.create(fs, splitFile,
new FsPermission(JobSubmissionFiles.JOB_FILE_PERMISSION));
int replication = job.getInt("mapred.submit.replication", 10);
fs.setReplication(splitFile, (short)replication);
writeSplitHeader(out);
return out;
}
If I understand currectly, the net functional effect of this approach is that
- The initial write pipeline is setup with 'dfs.replication' nodes (i.e.
3)
- The namenode triggers additional inter-datanode replications in the
background (as it detects the blocks as "under-replicated").
I'm assuming this is intentional? Alternatively, if the
mapred.submit.replication was specified on initial create, the write pipeline
would be significantly larger.
The reason I noticed is that we had inadvertently specified
mapred.submit.replication as *less than* dfs.replication in our configuration,
which caused a bunch of excess replica pruning (and ultimately IOExceptions in
our datanode logs).
Thanks,
Eric