[GitHub] spark pull request: [RFC] Disable local execution of Spark jobs by...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1321#issuecomment-52146688 QA tests have started for PR 1321. This patch merges cleanly. View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18523/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [RFC] Disable local execution of Spark jobs by...
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/1321#issuecomment-52146322 Jenkins, retest this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [RFC] Disable local execution of Spark jobs by...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/1321#issuecomment-52146296 Now I think about it more. LGTM. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [RFC] Disable local execution of Spark jobs by...
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/1321#issuecomment-52146302 :) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [RFC] Disable local execution of Spark jobs by...
Github user aarondav commented on the pull request: https://github.com/apache/spark/pull/1321#issuecomment-49002113 I think it makes more sense if you can't run a command than certain commands happen to be runnable while there are no cluster resources. This sort of execution puts more stress on the driver, as well, and things like OutOfMemoryErrors on the driver are far more serious than on an Executor (for example, [this issue](https://groups.google.com/forum/#!msg/spark-users/eu9RJc3nQng/-T6wmcjMFiwJ)). My hypothesis is that this feature is rarely useful, and often leads to more confusion for users and potentially less stability. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [RFC] Disable local execution of Spark jobs by...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/1321#issuecomment-48999161 When the cluster is busy and backlogged ... --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [RFC] Disable local execution of Spark jobs by...
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/1321#issuecomment-48999051 @rxin is there a case where you think local execution will yield a relevant performance improvement? I don't see why shopping a task for a few milliseconds is a bit deal. The main use case I see for this is people running `take` in a repl... in this case the cluster scheduler is not backlogged because they can't access the repl at all until the prior command has finished anyways. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [RFC] Disable local execution of Spark jobs by...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/1321#issuecomment-48280965 Maybe we should also solve the problem that local execution should not transfer the whole in-memory block (as a matter of fact, perhaps local execution should just bypass the in-memory data)? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [RFC] Disable local execution of Spark jobs by...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/1321#issuecomment-48250513 Merged build finished. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [RFC] Disable local execution of Spark jobs by...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/1321#issuecomment-48250515 Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16384/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [RFC] Disable local execution of Spark jobs by...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/1321#issuecomment-48242208 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [RFC] Disable local execution of Spark jobs by...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/1321#issuecomment-48242188 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [RFC] Disable local execution of Spark jobs by...
GitHub user aarondav opened a pull request: https://github.com/apache/spark/pull/1321 [RFC] Disable local execution of Spark jobs by default Currently, local execution of Spark jobs is only used by take(), and it can be problematic as it can load a significant amount of data onto the driver. The worst case scenarios occur if the RDD is cached (guaranteed to load whole partition), has very large elements, or the partition is just large and we apply a filter with high selectivity or computational overhead. Additionally, jobs that run locally in this manner do not show up in the web UI, and are thus harder to track or understand what is occurring. This PR adds a flag to disable local execution, which is turned OFF by default, with the intention of perhaps eventually removing this functionality altogether. Removing it now is a tougher proposition since it is part of the public runJob API. An alternative solution would be to limit the flag to take()/first() to avoid impacting any external users of this API, but such usage (or, at least, reliance upon the feature) is hopefully minimal. You can merge this pull request into a Git repository by running: $ git pull https://github.com/aarondav/spark allowlocal Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/1321.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1321 commit 164b08a67ff05ce422cb2ec382c5b08469bb1e4e Author: Aaron Davidson Date: 2014-07-07T20:52:12Z [RFC] Disable local execution of Spark jobs by default Currently, local execution of Spark jobs is only used by take(), and it can be problematic as it can load a significant amount of data onto the driver. The worst case scenarios occur if the RDD is cached (guaranteed to load whole partition), has very large elements, or the partition is just large and we apply a filter with high selectivity or computational overhead. Additionally, jobs that run locally in this manner do not show up in the web UI, and are thus harder to track or understand what is occurring. This PR adds a flag to disable local execution, which is turned OFF by default, with the intention of perhaps eventually removing this functionality altogether. Removing it now is a tougher proposition since it is part of the public runJob API. An alternative solution would be to limit the flag to take()/first() to avoid impacting any external users of this API, but such usage (or at least, reliance upon the feature) is hopefully minimal. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---