Why should this lead to an IOException? Is it because the pruning of replicas is asynchronous and the datanodes try to access nonexistent files? If so that seems like a pretty major bug
On Fri, Aug 16, 2013 at 5:21 PM, Chris Nauroth <cnaur...@hortonworks.com>wrote: > Hi Eric, > > Yes, this is intentional. The job.xml file and the job jar file get read > from every node running a map or reduce task. Because of this, using a > higher than normal replication factor on these files improves locality. > More than 3 task slots will have access to local replicas. These files > tend to be much smaller than the actual data read by a job, so there tends > to be little harm done in terms of disk space consumption. > > Why not create the file initially with 10 replicas instead of creating it > with 3 and then dialing up? I imagine this is so that job submission > doesn't block on a synchronous write to a long pipeline. The extra > replicas aren't necessary for correctness, and a long-running job will get > the locality benefits in the long term once more replicas are created in > the background. > > I recommend submitting a new jira describing the problem that you saw. We > probably can handle this better, and a jira would be a good place to > discuss the trade-offs. A few possibilities: > > Log a warning if mapred.submit.replication < dfs.replication. > Skip resetting replication if mapred.submit.replication <= dfs.replication. > Fail with error if mapred.submit.replication < dfs.replication. > > Chris Nauroth > Hortonworks > http://hortonworks.com/ > > > > On Thu, Aug 15, 2013 at 6:21 AM, Sirianni, Eric <eric.siria...@netapp.com > >wrote: > > > In debugging some replication issues in our HDFS environment, I noticed > > that the MapReduce framework uses the following algorithm for setting the > > replication on submitted job files: > > > > 1. Create the file with *default* DFS replication factor (i.e. > > 'dfs.replication') > > > > 2. Subsequently alter the replication of the file based on the > > 'mapred.submit.replication' config value > > > > private static FSDataOutputStream createFile(FileSystem fs, Path > > splitFile, > > Configuration job) throws IOException { > > FSDataOutputStream out = FileSystem.create(fs, splitFile, > > new FsPermission(JobSubmissionFiles.JOB_FILE_PERMISSION)); > > int replication = job.getInt("mapred.submit.replication", 10); > > fs.setReplication(splitFile, (short)replication); > > writeSplitHeader(out); > > return out; > > } > > > > If I understand currectly, the net functional effect of this approach is > > that > > > > - The initial write pipeline is setup with 'dfs.replication' nodes > > (i.e. 3) > > > > - The namenode triggers additional inter-datanode replications in > > the background (as it detects the blocks as "under-replicated"). > > > > I'm assuming this is intentional? Alternatively, if the > > mapred.submit.replication was specified on initial create, the write > > pipeline would be significantly larger. > > > > The reason I noticed is that we had inadvertently specified > > mapred.submit.replication as *less than* dfs.replication in our > > configuration, which caused a bunch of excess replica pruning (and > > ultimately IOExceptions in our datanode logs). > > > > Thanks, > > Eric > > > > > > -- > CONFIDENTIALITY NOTICE > NOTICE: This message is intended for the use of the individual or entity to > which it is addressed and may contain information that is confidential, > privileged and exempt from disclosure under applicable law. If the reader > of this message is not the intended recipient, you are hereby notified that > any printing, copying, dissemination, distribution, disclosure or > forwarding of this communication is strictly prohibited. If you have > received this communication in error, please contact the sender immediately > and delete it from your system. Thank You. > -- Jay Vyas http://jayunit100.blogspot.com