Hi Eric, Yes, this is intentional. The job.xml file and the job jar file get read from every node running a map or reduce task. Because of this, using a higher than normal replication factor on these files improves locality. More than 3 task slots will have access to local replicas. These files tend to be much smaller than the actual data read by a job, so there tends to be little harm done in terms of disk space consumption.
Why not create the file initially with 10 replicas instead of creating it with 3 and then dialing up? I imagine this is so that job submission doesn't block on a synchronous write to a long pipeline. The extra replicas aren't necessary for correctness, and a long-running job will get the locality benefits in the long term once more replicas are created in the background. I recommend submitting a new jira describing the problem that you saw. We probably can handle this better, and a jira would be a good place to discuss the trade-offs. A few possibilities: Log a warning if mapred.submit.replication < dfs.replication. Skip resetting replication if mapred.submit.replication <= dfs.replication. Fail with error if mapred.submit.replication < dfs.replication. Chris Nauroth Hortonworks http://hortonworks.com/ On Thu, Aug 15, 2013 at 6:21 AM, Sirianni, Eric <eric.siria...@netapp.com>wrote: > In debugging some replication issues in our HDFS environment, I noticed > that the MapReduce framework uses the following algorithm for setting the > replication on submitted job files: > > 1. Create the file with *default* DFS replication factor (i.e. > 'dfs.replication') > > 2. Subsequently alter the replication of the file based on the > 'mapred.submit.replication' config value > > private static FSDataOutputStream createFile(FileSystem fs, Path > splitFile, > Configuration job) throws IOException { > FSDataOutputStream out = FileSystem.create(fs, splitFile, > new FsPermission(JobSubmissionFiles.JOB_FILE_PERMISSION)); > int replication = job.getInt("mapred.submit.replication", 10); > fs.setReplication(splitFile, (short)replication); > writeSplitHeader(out); > return out; > } > > If I understand currectly, the net functional effect of this approach is > that > > - The initial write pipeline is setup with 'dfs.replication' nodes > (i.e. 3) > > - The namenode triggers additional inter-datanode replications in > the background (as it detects the blocks as "under-replicated"). > > I'm assuming this is intentional? Alternatively, if the > mapred.submit.replication was specified on initial create, the write > pipeline would be significantly larger. > > The reason I noticed is that we had inadvertently specified > mapred.submit.replication as *less than* dfs.replication in our > configuration, which caused a bunch of excess replica pruning (and > ultimately IOExceptions in our datanode logs). > > Thanks, > Eric > > -- CONFIDENTIALITY NOTICE NOTICE: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is confidential, privileged and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, you are hereby notified that any printing, copying, dissemination, distribution, disclosure or forwarding of this communication is strictly prohibited. If you have received this communication in error, please contact the sender immediately and delete it from your system. Thank You.