[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xi Fang updated MAPREDUCE-5278:
-------------------------------

    Description: 
Today, we set the JobTracker staging dir 
("mapreduce.jobtracker.staging.root.dir) to point to HDFS even though ASV is 
the default file system. There are a few reason why this config was chosen:

1. To prevent leak of the storage account creds to the user's storage account 
(IOW, keep job.xml in the cluster). 
2. It uses HDFS for the transient job files what is good for two reasons – a) 
it does not flood the user's storage account with irrelevant data/files b) it 
leverages HDFS locality for small files

However, this approach conflicts with how distributed cache caching works, 
completely negating the feature's functionality.

When files are added to the distributed cache (thru files/achieves/libjars 
hadoop generic options), they are copied to the job tracker staging dir only if 
they reside on a file system different that the jobtracker's. Later on, this 
path is used as a "key" to cache the files locally on the tasktracker's 
machine, and avoid localization (download/unzip) of the distributed cache files 
if they are already localized.

In our configuration the caching is completely disabled and we always end up 
copying dist cache files to the JT staging dir first and localizing them on the 
tasktracker machine second.

This is especially not good for Oozie scenarios as Oozie uses dist cache to 
populate Hive/Pig jars throughout the cluster.

Easy workaround is to config mapreduce.jobtracker.staging.root.dir in 
mapred-site.xml to be on the default FS.

  was:
Today, we set the JobTracker staging dir 
("mapreduce.jobtracker.staging.root.dir) to point to HDFS even though ASV is 
the default file system. There are a few reason why this config was chosen:
To prevent leak of the storage account creds to the user's storage account 
(IOW, keep job.xml in the cluster). This is needed until HADOOP-444 is fixed.
It uses HDFS for the transient job files what is good for two reasons – a) it 
does not flood the user's storage account with irrelevant data/files b) it 
leverages HDFS locality for small files
However, this approach conflicts with how distributed cache caching works, 
completely negating the feature's functionality.
When files are added to the distributed cache (thru files/achieves/libjars 
hadoop generic options), they are copied to the job tracker staging dir only if 
they reside on a file system different that the jobtracker's. Later on, this 
path is used as a "key" to cache the files locally on the tasktracker's 
machine, and avoid localization (download/unzip) of the distributed cache files 
if they are already localized.
In our configuration the caching is completely disabled and we always end up 
copying dist cache files to the JT staging dir first and localizing them on the 
tasktracker machine second.
This is especially not good for Oozie scenarios as Oozie uses dist cache to 
populate Hive/Pig jars throughout the cluster.
Easy workaround is to config mapreduce.jobtracker.staging.root.dir in 
mapred-site.xml to be on the default FS.

    
> Perf: Distributed cache is broken when JT staging dir is not on the default FS
> ------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-5278
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5278
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: distributed-cache
>    Affects Versions: 1-win
>         Environment: Windows
>            Reporter: Xi Fang
>            Assignee: Xi Fang
>
> Today, we set the JobTracker staging dir 
> ("mapreduce.jobtracker.staging.root.dir) to point to HDFS even though ASV is 
> the default file system. There are a few reason why this config was chosen:
> 1. To prevent leak of the storage account creds to the user's storage account 
> (IOW, keep job.xml in the cluster). 
> 2. It uses HDFS for the transient job files what is good for two reasons – a) 
> it does not flood the user's storage account with irrelevant data/files b) it 
> leverages HDFS locality for small files
> However, this approach conflicts with how distributed cache caching works, 
> completely negating the feature's functionality.
> When files are added to the distributed cache (thru files/achieves/libjars 
> hadoop generic options), they are copied to the job tracker staging dir only 
> if they reside on a file system different that the jobtracker's. Later on, 
> this path is used as a "key" to cache the files locally on the tasktracker's 
> machine, and avoid localization (download/unzip) of the distributed cache 
> files if they are already localized.
> In our configuration the caching is completely disabled and we always end up 
> copying dist cache files to the JT staging dir first and localizing them on 
> the tasktracker machine second.
> This is especially not good for Oozie scenarios as Oozie uses dist cache to 
> populate Hive/Pig jars throughout the cluster.
> Easy workaround is to config mapreduce.jobtracker.staging.root.dir in 
> mapred-site.xml to be on the default FS.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to