[GitHub] spark pull request: [SPARK-1502][YARN]Add config option to not inc...
Github user tgravescs commented on the pull request: https://github.com/apache/spark/pull/5294#issuecomment-88908858 you have to package or send all the appropriate stuff with your spark jar. For instance the hadoop configs. How were you running it? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-1502][YARN]Add config option to not inc...
Github user tgravescs commented on the pull request: https://github.com/apache/spark/pull/5294#issuecomment-88507012 @Sephiroth-Lin what testing have you done with this? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-1502][YARN]Add config option to not inc...
Github user Sephiroth-Lin commented on the pull request: https://github.com/apache/spark/pull/5294#issuecomment-88708773 @tgravescs @srowen @sryza As i have retest again, if we don't populate hadoop classpath, then in all case it dosen't work. This PR cann't solve this issue, i will close it later, thank you. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-1502][YARN]Add config option to not inc...
Github user Sephiroth-Lin closed the pull request at: https://github.com/apache/spark/pull/5294 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-1502][YARN]Add config option to not inc...
Github user tgravescs commented on the pull request: https://github.com/apache/spark/pull/5294#issuecomment-88590039 Jenkins, test this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-1502][YARN]Add config option to not inc...
GitHub user Sephiroth-Lin opened a pull request: https://github.com/apache/spark/pull/5294 [SPARK-1502][YARN]Add config option to not include yarn/mapred cluster classpath You can merge this pull request into a Git repository by running: $ git pull https://github.com/Sephiroth-Lin/spark SPARK-1502 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/5294.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #5294 commit 96aa689b8b65ce73e13e4f48a49b85a5f8ed751a Author: unknown l00251...@hghy1l002515991.china.huawei.com Date: 2015-03-31T11:31:13Z Add config option to not include yarn/mapred cluster classpath --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-1502][YARN]Add config option to not inc...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/5294#issuecomment-88054697 Can one of the admins verify this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-1502][YARN]Add config option to not inc...
Github user srowen commented on the pull request: https://github.com/apache/spark/pull/5294#issuecomment-88054581 CC @vanzin The cluster's assembly would generally have Hadoop provided right? so you would want the cluster's classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-1502][YARN]Add config option to not inc...
Github user sryza commented on the pull request: https://github.com/apache/spark/pull/5294#issuecomment-88165688 @srowen I assume this would be for running assemblies that already include the Hadoop classes. @Sephiroth-Lin do you mind going into detail about the situations you need this in? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-1502][YARN]Add config option to not inc...
Github user vanzin commented on the pull request: https://github.com/apache/spark/pull/5294#issuecomment-88171273 Yeah, unless there's an actual use case for this, it doesn't sound like we need the change. The classpath is added after Spark's assembly, so if the assembly includes the Hadoop/YARN classes, it will override the cluster ones. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-1502][YARN]Add config option to not inc...
Github user srowen commented on the pull request: https://github.com/apache/spark/pull/5294#issuecomment-88210123 @tgravescs in that case you're running Spark with a slightly different version of Hadoop classes than is found on the local machine or on the rest of the cluster. I can imagine that being the right thing to do in the odd rare case. I am wondering if it's something that at this stage is worth formally supporting? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-1502][YARN]Add config option to not inc...
Github user vanzin commented on the pull request: https://github.com/apache/spark/pull/5294#issuecomment-88174985 /cc @tgravescs --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-1502][YARN]Add config option to not inc...
Github user tgravescs commented on the pull request: https://github.com/apache/spark/pull/5294#issuecomment-88183017 So we had run into an issue where something in hadoop changed that required me to recompile Spark, whereas if it hadn't included the stuff from hadoop in my classpath it should have worked because the spark assembly jar already included hadoop. Looks like I unfortunately didn't put the details in the jira, but I filed this so that you could potentially package Spark and any confs you needed and be completely independent of what is on the cluster. This would allow minor incompatibilities/changes between whats on the Hadoop cluster and the hadoop version Spark was compiled with. Thinking now I think it had to do with a conf changed in hadoop which required a new class but it wasn't in the spark assembly. If we hadn't included the hadoop/yarn stuff in the classpath it would have worked as it wouldn't have picked up said conf. This is basically the opposite packaging mechanism then the hadoop provided option. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-1502][YARN]Add config option to not inc...
Github user tgravescs commented on the pull request: https://github.com/apache/spark/pull/5294#issuecomment-88239503 Yes it should be an odd case but if you are using it in production and it suddenly breaks while Hadoop is doing a rolling upgrade then it could be a major issue. I haven't actually had time to work on this yet but my plan was to package things separate to prevent this from happening, so I would need this jira for that. When I filed the jira I didn't expect this to be contraversial. I haven't tested the patch but its basically one config and one if statement, why is this such a big deal? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-1502][YARN]Add config option to not inc...
Github user tgravescs commented on the pull request: https://github.com/apache/spark/pull/5294#issuecomment-88254374 Yes that is basically the scenario. Although I would expect it start out package hadoopA with Spark running on HadoopA, then hadoopB is deployed and spark with hadoopA runs just fine on hadoopB. This allows for separate deployments of hadoop and spark. Otherwise you have to make sure spark and hadoop get deployed everywhere at the same time and everyone upgrades to new version of spark. yes it did happen which is what lead me to filing this jira and plan on changing how we internally package spark. I don't think it will happen real often but I also don't want this to cause an issue on a production system. MapReduce has this same issue and we actually package that fully separate to prevent this. With Hadoop now supporting rolling upgrades this is more of a concern. Personally I see things trying to go to more isolated environments where we aren't making the hadoop and its dependencies be included in everything that runs on YARN. Many users have issues with dependencies and such and having this config should at least give them the option. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-1502][YARN]Add config option to not inc...
Github user tgravescs commented on the pull request: https://github.com/apache/spark/pull/5294#issuecomment-88254838 Note, I do understand what you are saying with if there isn't really a use case we shouldn't include it as it costs in dev. If everyone else disagrees with my use case thats fine. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-1502][YARN]Add config option to not inc...
Github user srowen commented on the pull request: https://github.com/apache/spark/pull/5294#issuecomment-88255263 Yeah, that's the idea behind deploying a Spark that doesn't include Hadoop. In your scenario, if Spark totally works with Hadoop A and B, then Spark-without-Hadoop should work with both. I assume the scenario is that Spark on B doesn't work, but then does Spark+A on B work? OK. Well yeah maybe worth asking what others think? to get another data point on whether this is something affecting a critical mass of use cases. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-1502][YARN]Add config option to not inc...
Github user tgravescs commented on the pull request: https://github.com/apache/spark/pull/5294#issuecomment-88259688 So yes I could use hadoop provided and then package my own hadoop but you end up with same scenario as I describe. If I don't package hadoop then I rely on the version on the cluster then at any time they can deploy new hadoop version that breaks Spark. Note we've had issue with Hadoop breaking api's before. This really shouldn't happen very often but the question comes down to the risk. If I'm running on a production pipeline where its revenue bearing, do I want to potentially lose $$$ or should I isolate things and package it together and minimize my risk. I'm leaning towards doing the latter. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-1502][YARN]Add config option to not inc...
Github user srowen commented on the pull request: https://github.com/apache/spark/pull/5294#issuecomment-88247299 I don't know that it's controversial. As in all things, it's a question of how much of a problem it solves for how many users versus how much burden it puts on other users or current and future maintainers. I agree there's not a lot of complexity here besides yet another config parameter (albeit, OK, undocumented), so I was asking about how much problem it solves and when. So, you package Hadoop A with Spark, which is compatible-enough with Hadoop B deployed on your cluster that you can run Spark jobs using Hadoop A on this cluster. But this is to defend against Hadoop C being deployed under you, which can't coexist with your Spark, but this Spark + Hadoop A combo still executes correctly on the Hadoop C cluster? Is that something that realistically happens? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-1502][YARN]Add config option to not inc...
Github user vanzin commented on the pull request: https://github.com/apache/spark/pull/5294#issuecomment-88271630 So I was mostly interested in understanding what the use case was, since the bug was a little short on details. Tom's explanation makes sense; the opposite (hadoopA built into Spark assembly breaking when it's run on the cluster's hadoopB) already has workarounds since Spark gives user control of the app's classpath in different ways. Given that the patch looks good; probably should remain as an undocumented option, though. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-1502][YARN]Add config option to not inc...
Github user sryza commented on a diff in the pull request: https://github.com/apache/spark/pull/5294#discussion_r27530621 --- Diff: yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala --- @@ -809,7 +809,13 @@ object Client extends Logging { } } addFileToClasspath(new URI(sparkJar(sparkConf)), SPARK_JAR, env) -populateHadoopClasspath(conf, env) +// Since we have a spark assembly that is including all the yarn and other dependencies we need, --- End diff -- Since makes it seem like we know this is the case. I'd say Because the Spark assembly may already include Hadoop and its dependencies... --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-1502][YARN]Add config option to not inc...
Github user sryza commented on a diff in the pull request: https://github.com/apache/spark/pull/5294#discussion_r27530532 --- Diff: yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala --- @@ -809,7 +809,13 @@ object Client extends Logging { } } addFileToClasspath(new URI(sparkJar(sparkConf)), SPARK_JAR, env) -populateHadoopClasspath(conf, env) +// Since we have a spark assembly that is including all the yarn and other dependencies we need, +// add an option to allow the user to not include the cluster default yarn/mapreduce application +// classpaths when running spark on yarn. +val isPopulateHadoopClasspath = conf.getBoolean(spark.yarn.cluster.classpath.populate, true) --- End diff -- This name seems a little weird to me. In particular, periods should be used to separate config namespaces, not in places where we'd use spaces in English. I'd go with something like `spark.yarn.includeClusterHadoopJars` or `spark.yarn.includeClusterHadoopClasspath`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-1502][YARN]Add config option to not inc...
Github user sryza commented on a diff in the pull request: https://github.com/apache/spark/pull/5294#discussion_r27530361 --- Diff: yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala --- @@ -809,7 +809,13 @@ object Client extends Logging { } } addFileToClasspath(new URI(sparkJar(sparkConf)), SPARK_JAR, env) -populateHadoopClasspath(conf, env) +// Since we have a spark assembly that is including all the yarn and other dependencies we need, +// add an option to allow the user to not include the cluster default yarn/mapreduce application +// classpaths when running spark on yarn. +val isPopulateHadoopClasspath = conf.getBoolean(spark.yarn.cluster.classpath.populate, true) +if (isPopulateHadoopClasspath) { --- End diff -- The `conf.getBoolean` can just go in the if statement --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org