[GitHub] spark pull request: [SPARK-1502][YARN]Add config option to not inc...

2015-04-02 Thread tgravescs
Github user tgravescs commented on the pull request:

https://github.com/apache/spark/pull/5294#issuecomment-88908858
  
you have to package or send all the appropriate stuff with your spark jar. 
For instance the hadoop configs.   How were you running it?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-1502][YARN]Add config option to not inc...

2015-04-01 Thread tgravescs
Github user tgravescs commented on the pull request:

https://github.com/apache/spark/pull/5294#issuecomment-88507012
  
@Sephiroth-Lin  what testing have you done with this?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-1502][YARN]Add config option to not inc...

2015-04-01 Thread Sephiroth-Lin
Github user Sephiroth-Lin commented on the pull request:

https://github.com/apache/spark/pull/5294#issuecomment-88708773
  
@tgravescs @srowen @sryza As i have retest again, if we don't populate 
hadoop classpath, then in all case it dosen't work. This PR cann't solve this 
issue, i will close it later, thank you.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-1502][YARN]Add config option to not inc...

2015-04-01 Thread Sephiroth-Lin
Github user Sephiroth-Lin closed the pull request at:

https://github.com/apache/spark/pull/5294


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-1502][YARN]Add config option to not inc...

2015-04-01 Thread tgravescs
Github user tgravescs commented on the pull request:

https://github.com/apache/spark/pull/5294#issuecomment-88590039
  
Jenkins, test this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-1502][YARN]Add config option to not inc...

2015-03-31 Thread Sephiroth-Lin
GitHub user Sephiroth-Lin opened a pull request:

https://github.com/apache/spark/pull/5294

[SPARK-1502][YARN]Add config option to not include yarn/mapred cluster 
classpath



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/Sephiroth-Lin/spark SPARK-1502

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/5294.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #5294


commit 96aa689b8b65ce73e13e4f48a49b85a5f8ed751a
Author: unknown l00251...@hghy1l002515991.china.huawei.com
Date:   2015-03-31T11:31:13Z

Add config option to not include yarn/mapred cluster classpath




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-1502][YARN]Add config option to not inc...

2015-03-31 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/5294#issuecomment-88054697
  
Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-1502][YARN]Add config option to not inc...

2015-03-31 Thread srowen
Github user srowen commented on the pull request:

https://github.com/apache/spark/pull/5294#issuecomment-88054581
  
CC @vanzin The cluster's assembly would generally have Hadoop provided 
right? so you would want the cluster's classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-1502][YARN]Add config option to not inc...

2015-03-31 Thread sryza
Github user sryza commented on the pull request:

https://github.com/apache/spark/pull/5294#issuecomment-88165688
  
@srowen I assume this would be for running assemblies that already include 
the Hadoop classes.  @Sephiroth-Lin do you mind going into detail about the 
situations you need this in?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-1502][YARN]Add config option to not inc...

2015-03-31 Thread vanzin
Github user vanzin commented on the pull request:

https://github.com/apache/spark/pull/5294#issuecomment-88171273
  
Yeah, unless there's an actual use case for this, it doesn't sound like we 
need the change. The classpath is added after Spark's assembly, so if the 
assembly includes the Hadoop/YARN classes, it will override the cluster ones.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-1502][YARN]Add config option to not inc...

2015-03-31 Thread srowen
Github user srowen commented on the pull request:

https://github.com/apache/spark/pull/5294#issuecomment-88210123
  
@tgravescs in that case you're running Spark with a slightly different 
version of Hadoop classes than is found on the local machine or on the rest of 
the cluster. I can imagine that being the right thing to do in the odd rare 
case. I am wondering if it's something that at this stage is worth formally 
supporting?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-1502][YARN]Add config option to not inc...

2015-03-31 Thread vanzin
Github user vanzin commented on the pull request:

https://github.com/apache/spark/pull/5294#issuecomment-88174985
  
/cc @tgravescs


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-1502][YARN]Add config option to not inc...

2015-03-31 Thread tgravescs
Github user tgravescs commented on the pull request:

https://github.com/apache/spark/pull/5294#issuecomment-88183017
  
So we had run into an issue where something in hadoop changed that required 
me to recompile Spark, whereas if it hadn't included the stuff from hadoop in 
my classpath it should have worked because the spark assembly jar already 
included hadoop.  Looks like I unfortunately didn't put the details in the 
jira, but I filed this so that you could potentially package Spark and any 
confs you needed and be completely independent of what is on the cluster.  This 
would allow minor incompatibilities/changes between whats on the Hadoop cluster 
and the hadoop version Spark was compiled with.

Thinking now I think it had to do with a conf changed in hadoop which 
required a new class but it wasn't in the spark assembly.  If we hadn't 
included the hadoop/yarn stuff in the classpath it would have worked as it 
wouldn't have picked up said conf.  This is basically the opposite packaging 
mechanism then the hadoop provided option.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-1502][YARN]Add config option to not inc...

2015-03-31 Thread tgravescs
Github user tgravescs commented on the pull request:

https://github.com/apache/spark/pull/5294#issuecomment-88239503
  
Yes it should be an odd case but if you are using it in production and it 
suddenly breaks while Hadoop is doing a rolling upgrade then it could be a 
major issue.   I haven't actually had time to work on this yet but my plan was 
to package things separate to prevent this from happening, so I would need this 
jira for that.

When I filed the jira I didn't expect this to be contraversial. I haven't 
tested the patch but its basically one config and one if statement, why is this 
such a big deal?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-1502][YARN]Add config option to not inc...

2015-03-31 Thread tgravescs
Github user tgravescs commented on the pull request:

https://github.com/apache/spark/pull/5294#issuecomment-88254374
  
Yes that is basically the scenario.  Although I would expect it start out 
package hadoopA with Spark running on HadoopA, then hadoopB is deployed and 
spark with hadoopA runs just fine on hadoopB.

This allows for separate deployments of hadoop and spark. Otherwise you 
have to make sure spark and hadoop get deployed everywhere at the same time and 
everyone upgrades to new version of spark.

yes it did happen which is what lead me to filing this jira and plan on 
changing how we internally package spark. I don't think it will happen real 
often but I also don't want this to cause an issue on a production system.  
MapReduce has this same issue and we actually package that fully separate to 
prevent this.  With Hadoop now supporting rolling upgrades this is more of a 
concern.  

Personally I see things trying to go to more isolated environments where we 
aren't making the hadoop and its dependencies be included in everything that 
runs on YARN.  Many users have issues with dependencies and such and having 
this config should at least give them the option.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-1502][YARN]Add config option to not inc...

2015-03-31 Thread tgravescs
Github user tgravescs commented on the pull request:

https://github.com/apache/spark/pull/5294#issuecomment-88254838
  
Note, I do understand what you are saying with if there isn't really a use 
case we shouldn't include it as it costs in dev.  If everyone else disagrees 
with my use case thats fine.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-1502][YARN]Add config option to not inc...

2015-03-31 Thread srowen
Github user srowen commented on the pull request:

https://github.com/apache/spark/pull/5294#issuecomment-88255263
  
Yeah, that's the idea behind deploying a Spark that doesn't include Hadoop. 
In your scenario, if Spark totally works with Hadoop A and B, then 
Spark-without-Hadoop should work with both. I assume the scenario is that Spark 
on B doesn't work, but then does Spark+A on B work? OK. Well yeah maybe worth 
asking what others think? to get another data point on whether this is 
something affecting a critical mass of use cases.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-1502][YARN]Add config option to not inc...

2015-03-31 Thread tgravescs
Github user tgravescs commented on the pull request:

https://github.com/apache/spark/pull/5294#issuecomment-88259688
  
So yes I could use hadoop provided and then package my own hadoop but you 
end up with same scenario as I describe.  If I don't package hadoop then I rely 
on the version on the cluster then at any time they can deploy new hadoop 
version that breaks Spark. Note we've had issue with Hadoop breaking api's 
before.

This really shouldn't happen very often but the question comes down to the 
risk.  If I'm running on a production pipeline where its revenue bearing, do I 
want to potentially lose $$$ or should I isolate things and package it together 
and minimize my risk.  I'm leaning towards doing the latter.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-1502][YARN]Add config option to not inc...

2015-03-31 Thread srowen
Github user srowen commented on the pull request:

https://github.com/apache/spark/pull/5294#issuecomment-88247299
  
I don't know that it's controversial. As in all things, it's a question of 
how much of a problem it solves for how many users versus how much burden it 
puts on other users or current and future maintainers. I agree there's not a 
lot of complexity here besides yet another config parameter (albeit, OK, 
undocumented), so I was asking about how much problem it solves and when.

So, you package Hadoop A with Spark, which is compatible-enough with Hadoop 
B deployed on your cluster that you can run Spark jobs using Hadoop A on this 
cluster. But this is to defend against Hadoop C being deployed under you, which 
can't coexist with your Spark, but this Spark + Hadoop A combo still executes 
correctly on the Hadoop C cluster? Is that something that realistically happens?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-1502][YARN]Add config option to not inc...

2015-03-31 Thread vanzin
Github user vanzin commented on the pull request:

https://github.com/apache/spark/pull/5294#issuecomment-88271630
  
So I was mostly interested in understanding what the use case was, since 
the bug was a little short on details. Tom's explanation makes sense; the 
opposite (hadoopA built into Spark assembly breaking when it's run on the 
cluster's hadoopB) already has workarounds since Spark gives user control of 
the app's classpath in different ways.

Given that the patch looks good; probably should remain as an undocumented 
option, though.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-1502][YARN]Add config option to not inc...

2015-03-31 Thread sryza
Github user sryza commented on a diff in the pull request:

https://github.com/apache/spark/pull/5294#discussion_r27530621
  
--- Diff: yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala ---
@@ -809,7 +809,13 @@ object Client extends Logging {
   }
 }
 addFileToClasspath(new URI(sparkJar(sparkConf)), SPARK_JAR, env)
-populateHadoopClasspath(conf, env)
+// Since we have a spark assembly that is including all the yarn and 
other dependencies we need,
--- End diff --

Since makes it seem like we know this is the case.  I'd say Because the 
Spark assembly may already include Hadoop and its dependencies...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-1502][YARN]Add config option to not inc...

2015-03-31 Thread sryza
Github user sryza commented on a diff in the pull request:

https://github.com/apache/spark/pull/5294#discussion_r27530532
  
--- Diff: yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala ---
@@ -809,7 +809,13 @@ object Client extends Logging {
   }
 }
 addFileToClasspath(new URI(sparkJar(sparkConf)), SPARK_JAR, env)
-populateHadoopClasspath(conf, env)
+// Since we have a spark assembly that is including all the yarn and 
other dependencies we need,
+// add an option to allow the user to not include the cluster default 
yarn/mapreduce application
+// classpaths when running spark on yarn.
+val isPopulateHadoopClasspath = 
conf.getBoolean(spark.yarn.cluster.classpath.populate, true)
--- End diff --

This name seems a little weird to me.  In particular, periods should be 
used to separate config namespaces, not in places where we'd use spaces in 
English.  I'd go with something like `spark.yarn.includeClusterHadoopJars` or 
`spark.yarn.includeClusterHadoopClasspath`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-1502][YARN]Add config option to not inc...

2015-03-31 Thread sryza
Github user sryza commented on a diff in the pull request:

https://github.com/apache/spark/pull/5294#discussion_r27530361
  
--- Diff: yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala ---
@@ -809,7 +809,13 @@ object Client extends Logging {
   }
 }
 addFileToClasspath(new URI(sparkJar(sparkConf)), SPARK_JAR, env)
-populateHadoopClasspath(conf, env)
+// Since we have a spark assembly that is including all the yarn and 
other dependencies we need,
+// add an option to allow the user to not include the cluster default 
yarn/mapreduce application
+// classpaths when running spark on yarn.
+val isPopulateHadoopClasspath = 
conf.getBoolean(spark.yarn.cluster.classpath.populate, true)
+if (isPopulateHadoopClasspath) {
--- End diff --

The `conf.getBoolean` can just go in the if statement


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org