[GitHub] spark pull request: SPARK-3526 Add section about data locality to ...

2014-12-11 Thread ash211
Github user ash211 commented on the pull request:

https://github.com/apache/spark/pull/2519#issuecomment-66710155
  
Agreed that there's probably not a ton that's immediately tunable.  But 
someone looking to "make it faster" could read this section, realize that they 
have bad locality, and move their HDFS and Spark workers closer together as a 
result.

I view this page as part prescriptive and part informative, and this 
section is definitely more on the informative side.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-3526 Add section about data locality to ...

2014-12-10 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/2519


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-3526 Add section about data locality to ...

2014-12-10 Thread pwendell
Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/2519#issuecomment-66540898
  
Hey @ash211 I'm going to pull this in, thanks for working on it. One thing 
I do wonder is if there are more actionable take-aways from this for users. In 
my experience the defaults are usually just fine, it's not super clear to me 
when users would need to tune this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-3526 Add section about data locality to ...

2014-11-14 Thread ash211
Github user ash211 commented on the pull request:

https://github.com/apache/spark/pull/2519#issuecomment-63024861
  
In the absence of feedback about the above questions and in an effort to 
clarify this at least somewhat in the docs, I think we should merge this 
docs-only PR as-is for the Spark 1.2.0 release.  We can always extend the docs 
later with clarifications if needed.

@pwendell would you please merge?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-3526 Add section about data locality to ...

2014-09-25 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2519#issuecomment-56920607
  
  [QA tests have 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20842/consoleFull)
 for   PR 2519 at commit 
[`44cff28`](https://github.com/apache/spark/commit/44cff28f183d5ba85d9395dda699faa137cad377).
 * This patch **passes** unit tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-3526 Add section about data locality to ...

2014-09-25 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/2519#issuecomment-56920613
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/20842/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-3526 Add section about data locality to ...

2014-09-25 Thread ash211
Github user ash211 commented on the pull request:

https://github.com/apache/spark/pull/2519#issuecomment-56917577
  
My recent commits address @pwendell 's comments but I'd like to include an 
answer to my first two bullet points from the summary before merging:

- What's the difference between NO_PREF and ANY? I understand the 
implications of the ordering but don't know what an example of each would be
- Why is NO_PREF ahead of RACK_LOCAL? I would think it'd be better to 
schedule rack-local tasks ahead of no preference if you could only do one or 
the other. Is the idea to wait longer and hope for the rack-local tasks to turn 
into node-local or better?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-3526 Add section about data locality to ...

2014-09-25 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2519#issuecomment-56917562
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20842/consoleFull)
 for   PR 2519 at commit 
[`44cff28`](https://github.com/apache/spark/commit/44cff28f183d5ba85d9395dda699faa137cad377).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-3526 Add section about data locality to ...

2014-09-24 Thread pwendell
Github user pwendell commented on a diff in the pull request:

https://github.com/apache/spark/pull/2519#discussion_r18014202
  
--- Diff: docs/tuning.md ---
@@ -247,6 +247,39 @@ Spark prints the serialized size of each task on the 
master, so you can look at
 decide whether your tasks are too large; in general tasks larger than 
about 20 KB are probably
 worth optimizing.
 
+## Data Locality
+
+One of the most important principles of distributed computing is data 
locality.  If data and the
--- End diff --

It might be good to say something more like "Data locality can have a major 
impact on the performance of Spark jobs."


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-3526 Add section about data locality to ...

2014-09-24 Thread pwendell
Github user pwendell commented on a diff in the pull request:

https://github.com/apache/spark/pull/2519#discussion_r18014164
  
--- Diff: docs/tuning.md ---
@@ -247,6 +247,39 @@ Spark prints the serialized size of each task on the 
master, so you can look at
 decide whether your tasks are too large; in general tasks larger than 
about 20 KB are probably
 worth optimizing.
 
+## Data Locality
+
+One of the most important principles of distributed computing is data 
locality.  If data and the
+code that operates on it are together than computation tends to be fast.  
But if code and data are
+separated, one must move to the other.  Typically it is faster to ship 
serialized code from place to
+place than a chunk of data because code size is much smaller than data.  
Spark builds its scheduling
+around this general principle of data locality.
+
+Data locality is how close data is to the code processing it.  There are 
several levels of
+locality based on the data's current location.  In order from closest to 
farthest:
+
+- `PROCESS_LOCAL` data is in the same JVM as the running code.  This is 
the best locality
+  possible
+- `NODE_LOCAL` data is on the same node.  Examples might be in HDFS on the 
same node, or in
+  another executor on the same node.  This is a little slower than 
`PROCESS_LOCAL` because the data
+  has to travel between processes
+- `NO_PREF` data is accessed equally quickly from anywhere and has no 
locality preference
+- `RACK_LOCAL` data is on the same rack of servers.  Data is on a 
different server on the same rack
+  so needs to be sent over the network, typically through a single switch
+- `ANY` data is elsewhere on the network and not in the same rack
+
+Spark prefers to schedule all tasks at the best locality level, but this 
is not always possible.  In
+situations where there is no unprocessed data on any idle executor, Spark 
switches to lower locality
+levels. There are two options: a) wait until a busy CPU frees up to start 
a task on data on the same
+server, or b) immediately start a new task in a farther away place that 
requires moving data there.
+
+What Spark typically does is wait a bit in the hopes that a busy CPU frees 
up.  Once that timeout
+expires, it starts moving the data from far away to the free CPU.  The 
wait timeout for fallback
--- End diff --

Here I would link to the configuration page instead of enumerating the 
configs here. We try not to have two copies of things like this in the docs or 
else people could forget to update this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-3526 Add section about data locality to ...

2014-09-24 Thread rnowling
Github user rnowling commented on the pull request:

https://github.com/apache/spark/pull/2519#issuecomment-56725894
  
I suggest moving NO_PREF to the end.  RACK_LOCAL should certainly be above 
it according to the sort order given in the paragraph above the list.  I think 
ANY should be above NO_PREF as well.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-3526 Add section about data locality to ...

2014-09-24 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2519#issuecomment-56649704
  
  [QA tests have 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20746/consoleFull)
 for   PR 2519 at commit 
[`20e0e31`](https://github.com/apache/spark/commit/20e0e31158fe0350b8f59617f2228a48c34274ef).
 * This patch **passes** unit tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-3526 Add section about data locality to ...

2014-09-24 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2519#issuecomment-56649713
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/20746/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-3526 Add section about data locality to ...

2014-09-24 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2519#issuecomment-56642802
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20746/consoleFull)
 for   PR 2519 at commit 
[`20e0e31`](https://github.com/apache/spark/commit/20e0e31158fe0350b8f59617f2228a48c34274ef).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-3526 Add section about data locality to ...

2014-09-24 Thread ash211
GitHub user ash211 opened a pull request:

https://github.com/apache/spark/pull/2519

SPARK-3526 Add section about data locality to the tuning guide

cc @kayousterhout

I have a few outstanding questions from compiling this documentation:
- What's the difference between NO_PREF and ANY?  I understand the 
implications of the ordering but don't know what an example of each would be
- Why is NO_PREF ahead of RACK_LOCAL?  I would think it'd be better to 
schedule rack-local tasks ahead of no preference if you could only do one or 
the other.  Is the idea to wait longer and hope for the rack-local tasks to 
turn into node-local or better?
- Will there be a datacenter-local locality level in the future?  Apache 
Cassandra for example has this level

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/ash211/spark SPARK-3526

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/2519.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #2519


commit 20e0e31158fe0350b8f59617f2228a48c34274ef
Author: Andrew Ash 
Date:   2014-09-24T08:50:07Z

SPARK-3526 Add section about data locality to the tuning guide




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org