[ https://issues.apache.org/jira/browse/SPARK-2669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14073338#comment-14073338 ]
Apache Spark commented on SPARK-2669: ------------------------------------- User 'redbaron' has created a pull request for this issue: https://github.com/apache/spark/pull/1574 > Hadoop configuration is not localised when submitting job in yarn-cluster mode > ------------------------------------------------------------------------------ > > Key: SPARK-2669 > URL: https://issues.apache.org/jira/browse/SPARK-2669 > Project: Spark > Issue Type: Bug > Reporter: Maxim Ivanov > > I'd like to propose a fix for a problem when Hadoop configuration is not > localized when job is submitted in yarn-cluster mode. Here is a description > from github pull request https://github.com/apache/spark/pull/1574 > This patch fixes a problem when Spark driver is run in the container > managed by YARN ResourceManager it inherits configuration from a > NodeManager process, which can be different from the Hadoop > configuration present on the client (submitting machine). Problem is > most vivid when fs.defaultFS property differs between these two. > Hadoop MR solves it by serializing client's Hadoop configuration into > job.xml in application staging directory and then making Application > Master to use it. That guarantees that regardless of execution nodes > configurations all application containers use same config identical to > one on the client side. > This patch uses similar approach. YARN ClientBase serializes > configuration and adds it to ClientDistributedCacheManager under > "job.xml" link name. ClientDistributedCacheManager is then utilizes > Hadoop localizer to deliver it to whatever container is started by this > application, including the one running Spark driver. > YARN ClientBase also adds "SPARK_LOCAL_HADOOPCONF" env variable to AM > container request which is then used by SparkHadoopUtil.newConfiguration > to trigger new behavior when machine-wide hadoop configuration is merged > with application specific job.xml (exactly how it is done in Hadoop MR). > SparkContext is then follows same approach, adding > SPARK_LOCAL_HADOOPCONF env to all spawned containers to make them use > client-side Hadopo configuration. > Also all the references to "new Configuration()" which might be executed > on YARN cluster side are changed to use SparkHadoopUtil.get.conf > Please note that it fixes only core Spark, the part which I am > comfortable to test and verify the result. I didn't descend into > steaming/shark directories, so things might need to be changed there too. -- This message was sent by Atlassian JIRA (v6.2#6252)