[GitHub] spark pull request: Localise hadoop configuration when submitting ...

2014-07-24 Thread redbaron
GitHub user redbaron opened a pull request:

https://github.com/apache/spark/pull/1574

Localise hadoop configuration when submitting in yarn-cluster mode

This patch fixes a problem when Spark driver is run in the container
managed by YARN ResourceManager it inherits configuration from a
NodeManager process, which can be different from the Hadoop
configuration present on the client (submitting machine). Problem is
most vivid when fs.defaultFS property differs between these two.

Hadoop MR solves it by serializing client's Hadoop configuration into
job.xml  in application staging directory and then making Application
Master to use it. That guarantees that regardless of execution nodes
configurations all application use same config.

This patch uses similar approach. YARN ClientBase serializes
configuration and adds it to ClientDistributedCacheManager under
job.xml link name. ClientDistributedCacheManager is then utilizes
Hadoop localizer to deliver it to whatever container is started by this
application, including the one running Spark driver.

YARN ClientBase also adds SPARK_LOCAL_HADOOPCONF env variable to AM
container request which is then used by SparkHadoopUtil.newConfiguration
to trigger new behavior when machine-wide hadoop configuration is merged
with application specific job.xml (exactly how it is done in Hadoop MR).

SparkContext is then follows same approach, adding
SPARK_LOCAL_HADOOPCONF env to all spawned containers to make them use
client-side Hadopo configuration.

Also all the references to new Configuration() which might be executed
on YARN cluster side are changed to use SparkHadoopUtil.get.conf

Please note that it fixes only core Spark, the part which I am
comfortable to test and verify the result. I didn't descend into
steaming/shark directories, so things might need to be changed there too.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/redbaron/spark transfer_hadoopconf

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/1574.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1574


commit 29bd724878cf9522578b9677bba26bc9dc593d3e
Author: Maxim Ivanov ivanov.ma...@gmail.com
Date:   2014-07-24T13:38:08Z

Localise hadoop configuration when submitting in yarn-cluster mode

This patch fixes a problem when Spark driver is run in the container
managed by YARN ResourceManager it inherits configuration from a
NodeManager process, which can be different from the Hadoop
configuration present on the client (submitting machine). Problem is
most vivid when fs.defaultFS property differs between these two.

Hadoop MR solves it by serializing client's Hadoop configuration into
job.xml  in application staging directory and then making Application
Master to use it. That guarantees that regardless of execution nodes
configurations all application use same config.

This patch uses similar approach. YARN ClientBase serializes
configuration and adds it to ClientDistributedCacheManager under
job.xml link name. ClientDistributedCacheManager is then utilizes
Hadoop localizer to deliver it to whatever container is started by this
application, including the one running Spark driver.

YARN ClientBase also adds SPARK_LOCAL_HADOOPCONF env variable to AM
container request which is then used by SparkHadoopUtil.newConfiguration
to trigger new behavior when machine-wide hadoop configuration is merged
with application specific job.xml (exactly how it is done in Hadoop MR).

SparkContext is then follows same approach, adding
SPARK_LOCAL_HADOOPCONF env to all spawned containers to make them use
client-side Hadopo configuration.

Also all the references to new Configuration() which might be executed
on YARN cluster side are changed to use SparkHadoopUtil.get.conf

Please note that it fixes only core Spark, the part which I am
comfortable to test and verify the result. I didn't descend into
steaming/shark directories, so things might need to be changed there too.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Localise hadoop configuration when submitting ...

2014-07-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/1574#issuecomment-50022339
  
Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Localise hadoop configuration when submitting ...

2014-07-24 Thread witgo
Github user witgo commented on the pull request:

https://github.com/apache/spark/pull/1574#issuecomment-50024391
  
This is a good idea.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Localise hadoop configuration when submitting ...

2014-07-24 Thread tgravescs
Github user tgravescs commented on the pull request:

https://github.com/apache/spark/pull/1574#issuecomment-50027254
  
Please file jira and put description in there also. It will be nice to have 
a way to specify the hadoop configs.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---