[GitHub] spark pull request: [RFC] Disable local execution of Spark jobs by...

aarondav Mon, 07 Jul 2014 14:12:14 -0700

GitHub user aarondav opened a pull request:

    https://github.com/apache/spark/pull/1321


    [RFC] Disable local execution of Spark jobs by default

    Currently, local execution of Spark jobs is only used by take(), and it can 
be problematic as it can load a significant amount of data onto the driver. The 
worst case scenarios occur if the RDD is cached (guaranteed to load whole 
partition), has very large elements, or the partition is just large and we 
apply a filter with high selectivity or computational overhead.
    
    Additionally, jobs that run locally in this manner do not show up in the 
web UI, and are thus harder to track or understand what is occurring.
    
    This PR adds a flag to disable local execution, which is turned OFF by 
default, with the intention of perhaps eventually removing this functionality 
altogether. Removing it now is a tougher proposition since it is part of the 
public runJob API. An alternative solution would be to limit the flag to 
take()/first() to avoid impacting any external users of this API, but such 
usage (or, at least, reliance upon the feature) is hopefully minimal.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/aarondav/spark allowlocal

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/1321.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #1321
    
----
commit 164b08a67ff05ce422cb2ec382c5b08469bb1e4e
Author: Aaron Davidson <aa...@databricks.com>
Date:   2014-07-07T20:52:12Z

    [RFC] Disable local execution of Spark jobs by default
    
    Currently, local execution of Spark jobs is only used by take(), and it can
    be problematic as it can load a significant amount of data onto the driver.
    The worst case scenarios occur if the RDD is cached (guaranteed to load 
whole
    partition), has very large elements, or the partition is just large and we
    apply a filter with high selectivity or computational overhead.
    
    Additionally, jobs that run locally in this manner do not show up in the 
web UI,
    and are thus harder to track or understand what is occurring.
    
    This PR adds a flag to disable local execution, which is turned OFF by 
default, with
    the intention of perhaps eventually removing this functionality altogether. 
Removing it
    now is a tougher proposition since it is part of the public runJob API. An 
alternative
    solution would be to limit the flag to take()/first() to avoid impacting 
any external
    users of this API, but such usage (or at least, reliance upon the feature) 
is hopefully
    minimal.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [RFC] Disable local execution of Spark jobs by...

Reply via email to