[jira] [Updated] (SPARK-2669) Hadoop configuration is not localised when submitting job in yarn-cluster mode

2014-09-11 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-2669:
-
Affects Version/s: 1.0.0

 Hadoop configuration is not localised when submitting job in yarn-cluster mode
 --

 Key: SPARK-2669
 URL: https://issues.apache.org/jira/browse/SPARK-2669
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.0.0
Reporter: Maxim Ivanov

 I'd like to propose a fix for a problem when Hadoop configuration is not 
 localized when job is submitted in yarn-cluster mode. Here is a description 
 from github pull request https://github.com/apache/spark/pull/1574
 This patch fixes a problem when Spark driver is run in the container
 managed by YARN ResourceManager it inherits configuration from a
 NodeManager process, which can be different from the Hadoop
 configuration present on the client (submitting machine). Problem is
 most vivid when fs.defaultFS property differs between these two.
 Hadoop MR solves it by serializing client's Hadoop configuration into
 job.xml in application staging directory and then making Application
 Master to use it. That guarantees that regardless of execution nodes
 configurations all application containers use same config identical to
 one on the client side.
 This patch uses similar approach. YARN ClientBase serializes
 configuration and adds it to ClientDistributedCacheManager under
 job.xml link name. ClientDistributedCacheManager is then utilizes
 Hadoop localizer to deliver it to whatever container is started by this
 application, including the one running Spark driver.
 YARN ClientBase also adds SPARK_LOCAL_HADOOPCONF env variable to AM
 container request which is then used by SparkHadoopUtil.newConfiguration
 to trigger new behavior when machine-wide hadoop configuration is merged
 with application specific job.xml (exactly how it is done in Hadoop MR).
 SparkContext is then follows same approach, adding
 SPARK_LOCAL_HADOOPCONF env to all spawned containers to make them use
 client-side Hadopo configuration.
 Also all the references to new Configuration() which might be executed
 on YARN cluster side are changed to use SparkHadoopUtil.get.conf
 Please note that it fixes only core Spark, the part which I am
 comfortable to test and verify the result. I didn't descend into
 steaming/shark directories, so things might need to be changed there too.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2669) Hadoop configuration is not localised when submitting job in yarn-cluster mode

2014-09-11 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-2669:
-
Component/s: YARN

 Hadoop configuration is not localised when submitting job in yarn-cluster mode
 --

 Key: SPARK-2669
 URL: https://issues.apache.org/jira/browse/SPARK-2669
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.0.0
Reporter: Maxim Ivanov

 I'd like to propose a fix for a problem when Hadoop configuration is not 
 localized when job is submitted in yarn-cluster mode. Here is a description 
 from github pull request https://github.com/apache/spark/pull/1574
 This patch fixes a problem when Spark driver is run in the container
 managed by YARN ResourceManager it inherits configuration from a
 NodeManager process, which can be different from the Hadoop
 configuration present on the client (submitting machine). Problem is
 most vivid when fs.defaultFS property differs between these two.
 Hadoop MR solves it by serializing client's Hadoop configuration into
 job.xml in application staging directory and then making Application
 Master to use it. That guarantees that regardless of execution nodes
 configurations all application containers use same config identical to
 one on the client side.
 This patch uses similar approach. YARN ClientBase serializes
 configuration and adds it to ClientDistributedCacheManager under
 job.xml link name. ClientDistributedCacheManager is then utilizes
 Hadoop localizer to deliver it to whatever container is started by this
 application, including the one running Spark driver.
 YARN ClientBase also adds SPARK_LOCAL_HADOOPCONF env variable to AM
 container request which is then used by SparkHadoopUtil.newConfiguration
 to trigger new behavior when machine-wide hadoop configuration is merged
 with application specific job.xml (exactly how it is done in Hadoop MR).
 SparkContext is then follows same approach, adding
 SPARK_LOCAL_HADOOPCONF env to all spawned containers to make them use
 client-side Hadopo configuration.
 Also all the references to new Configuration() which might be executed
 on YARN cluster side are changed to use SparkHadoopUtil.get.conf
 Please note that it fixes only core Spark, the part which I am
 comfortable to test and verify the result. I didn't descend into
 steaming/shark directories, so things might need to be changed there too.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org