Hi Eric,

Yes, this is intentional.  The job.xml file and the job jar file get read
from every node running a map or reduce task.  Because of this, using a
higher than normal replication factor on these files improves locality.
 More than 3 task slots will have access to local replicas.  These files
tend to be much smaller than the actual data read by a job, so there tends
to be little harm done in terms of disk space consumption.

Why not create the file initially with 10 replicas instead of creating it
with 3 and then dialing up?  I imagine this is so that job submission
doesn't block on a synchronous write to a long pipeline.  The extra
replicas aren't necessary for correctness, and a long-running job will get
the locality benefits in the long term once more replicas are created in
the background.

I recommend submitting a new jira describing the problem that you saw.  We
probably can handle this better, and a jira would be a good place to
discuss the trade-offs.  A few possibilities:

Log a warning if mapred.submit.replication < dfs.replication.
Skip resetting replication if mapred.submit.replication <= dfs.replication.
Fail with error if mapred.submit.replication < dfs.replication.

Chris Nauroth
Hortonworks
http://hortonworks.com/



On Thu, Aug 15, 2013 at 6:21 AM, Sirianni, Eric <eric.siria...@netapp.com>wrote:

> In debugging some replication issues in our HDFS environment, I noticed
> that the MapReduce framework uses the following algorithm for setting the
> replication on submitted job files:
>
> 1.     Create the file with *default* DFS replication factor (i.e.
> 'dfs.replication')
>
> 2.     Subsequently alter the replication of the file based on the
> 'mapred.submit.replication' config value
>
>   private static FSDataOutputStream createFile(FileSystem fs, Path
> splitFile,
>       Configuration job)  throws IOException {
>     FSDataOutputStream out = FileSystem.create(fs, splitFile,
>         new FsPermission(JobSubmissionFiles.JOB_FILE_PERMISSION));
>     int replication = job.getInt("mapred.submit.replication", 10);
>     fs.setReplication(splitFile, (short)replication);
>     writeSplitHeader(out);
>     return out;
>   }
>
> If I understand currectly, the net functional effect of this approach is
> that
>
> -       The initial write pipeline is setup with 'dfs.replication' nodes
> (i.e. 3)
>
> -       The namenode triggers additional inter-datanode replications in
> the background (as it detects the blocks as "under-replicated").
>
> I'm assuming this is intentional?  Alternatively, if the
> mapred.submit.replication was specified on initial create, the write
> pipeline would be significantly larger.
>
> The reason I noticed is that we had inadvertently specified
> mapred.submit.replication as *less than* dfs.replication in our
> configuration, which caused a bunch of excess replica pruning (and
> ultimately IOExceptions in our datanode logs).
>
> Thanks,
> Eric
>
>

-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

Reply via email to