[GitHub] spark pull request: [RFC] Disable local execution of Spark jobs by...

2014-08-13 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1321#issuecomment-52146688
  
QA tests have started for PR 1321. This patch merges cleanly. View 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18523/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [RFC] Disable local execution of Spark jobs by...

2014-08-13 Thread mengxr
Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/1321#issuecomment-52146322
  
Jenkins, retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [RFC] Disable local execution of Spark jobs by...

2014-08-13 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/1321#issuecomment-52146296
  
Now I think about it more. LGTM.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [RFC] Disable local execution of Spark jobs by...

2014-08-13 Thread mengxr
Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/1321#issuecomment-52146302
  
:)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [RFC] Disable local execution of Spark jobs by...

2014-07-15 Thread aarondav
Github user aarondav commented on the pull request:

https://github.com/apache/spark/pull/1321#issuecomment-49002113
  
I think it makes more sense if you can't run a command than certain 
commands happen to be runnable while there are no cluster resources. This sort 
of execution puts more stress on the driver, as well, and things like 
OutOfMemoryErrors on the driver are far more serious than on an Executor (for 
example, [this 
issue](https://groups.google.com/forum/#!msg/spark-users/eu9RJc3nQng/-T6wmcjMFiwJ)).

My hypothesis is that this feature is rarely useful, and often leads to 
more confusion for users and potentially less stability.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [RFC] Disable local execution of Spark jobs by...

2014-07-15 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/1321#issuecomment-48999161
  
When the cluster is busy and backlogged ...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [RFC] Disable local execution of Spark jobs by...

2014-07-15 Thread pwendell
Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/1321#issuecomment-48999051
  
@rxin is there a case where you think local execution will yield a relevant 
performance improvement? I don't see why shopping a task for a few milliseconds 
is a bit deal. The main use case I see for this is people running `take` in a 
repl... in this case the cluster scheduler is not backlogged because they can't 
access the repl at all until the prior command has finished anyways.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [RFC] Disable local execution of Spark jobs by...

2014-07-08 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/1321#issuecomment-48280965
  
Maybe we should also solve the problem that local execution should not 
transfer the whole in-memory block (as a matter of fact, perhaps local 
execution should just bypass the in-memory data)?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [RFC] Disable local execution of Spark jobs by...

2014-07-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/1321#issuecomment-48250513
  
Merged build finished. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [RFC] Disable local execution of Spark jobs by...

2014-07-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/1321#issuecomment-48250515
  

Refer to this link for build results: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16384/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [RFC] Disable local execution of Spark jobs by...

2014-07-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/1321#issuecomment-48242208
  
Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [RFC] Disable local execution of Spark jobs by...

2014-07-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/1321#issuecomment-48242188
  
 Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [RFC] Disable local execution of Spark jobs by...

2014-07-07 Thread aarondav
GitHub user aarondav opened a pull request:

https://github.com/apache/spark/pull/1321

[RFC] Disable local execution of Spark jobs by default

Currently, local execution of Spark jobs is only used by take(), and it can 
be problematic as it can load a significant amount of data onto the driver. The 
worst case scenarios occur if the RDD is cached (guaranteed to load whole 
partition), has very large elements, or the partition is just large and we 
apply a filter with high selectivity or computational overhead.

Additionally, jobs that run locally in this manner do not show up in the 
web UI, and are thus harder to track or understand what is occurring.

This PR adds a flag to disable local execution, which is turned OFF by 
default, with the intention of perhaps eventually removing this functionality 
altogether. Removing it now is a tougher proposition since it is part of the 
public runJob API. An alternative solution would be to limit the flag to 
take()/first() to avoid impacting any external users of this API, but such 
usage (or, at least, reliance upon the feature) is hopefully minimal.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/aarondav/spark allowlocal

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/1321.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1321


commit 164b08a67ff05ce422cb2ec382c5b08469bb1e4e
Author: Aaron Davidson 
Date:   2014-07-07T20:52:12Z

[RFC] Disable local execution of Spark jobs by default

Currently, local execution of Spark jobs is only used by take(), and it can
be problematic as it can load a significant amount of data onto the driver.
The worst case scenarios occur if the RDD is cached (guaranteed to load 
whole
partition), has very large elements, or the partition is just large and we
apply a filter with high selectivity or computational overhead.

Additionally, jobs that run locally in this manner do not show up in the 
web UI,
and are thus harder to track or understand what is occurring.

This PR adds a flag to disable local execution, which is turned OFF by 
default, with
the intention of perhaps eventually removing this functionality altogether. 
Removing it
now is a tougher proposition since it is part of the public runJob API. An 
alternative
solution would be to limit the flag to take()/first() to avoid impacting 
any external
users of this API, but such usage (or at least, reliance upon the feature) 
is hopefully
minimal.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---