subject:"\[GitHub\] spark pull request\: SPARK\-1767\: Prefer HDFS\-cached replicas when s..."

[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...

2014-10-02 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/1486


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...

2014-10-02 Thread sryza

Github user sryza commented on the pull request:

https://github.com/apache/spark/pull/1486#issuecomment-57593579
  
w00t!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...

2014-10-02 Thread pwendell

Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/1486#issuecomment-57592942
  
Okay - gonna merge this. Glad it's in good shape now. Thanks @cmccabe for 
the contribution.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...

2014-10-02 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/1486#issuecomment-57592632
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21177/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...

2014-10-02 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1486#issuecomment-57592627
  
  [QA tests have 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21177/consoleFull)
 for   PR 1486 at commit 
[`338d4f8`](https://github.com/apache/spark/commit/338d4f8fedd68b64a7fdfaf078afcc2623072501).
 * This patch **passes** unit tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...

2014-10-01 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1486#issuecomment-57588340
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21177/consoleFull)
 for   PR 1486 at commit 
[`338d4f8`](https://github.com/apache/spark/commit/338d4f8fedd68b64a7fdfaf078afcc2623072501).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...

2014-10-01 Thread pwendell

Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/1486#issuecomment-57588159
  
Jenkins, retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...

2014-10-01 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/1486#issuecomment-57531253
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21153/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...

2014-10-01 Thread cmccabe

Github user cmccabe commented on the pull request:

https://github.com/apache/spark/pull/1486#issuecomment-57529171
  
I just rebased on master and re-pushed.  It looks like this merge conflict 
was caused by another change to the MimaExcludes file, just like the previous 
merge conflict.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...

2014-09-30 Thread pwendell

Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/1486#issuecomment-57351505
  
@cmccabe if you look at the message here it is saying that it doesn't merge 
cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...

2014-09-29 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/1486#issuecomment-57246819
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21001/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...

2014-09-29 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1486#issuecomment-57246815
  
  [QA tests have 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21001/consoleFull)
 for   PR 1486 at commit 
[`dfab423`](https://github.com/apache/spark/commit/dfab423a9986032d35907389ea6dfa913d53a28e).
 * This patch **passes** unit tests.
 * This patch **does not** merge cleanly!



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...

2014-09-29 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/1486#issuecomment-57242587
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21004/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...

2014-09-29 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1486#issuecomment-57242585
  
  [QA tests have 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21004/consoleFull)
 for   PR 1486 at commit 
[`f99cb60`](https://github.com/apache/spark/commit/f99cb6041a088ebadc1a9fdbd2f99ce4d54075d4).
 * This patch **fails** unit tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...

2014-09-29 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1486#issuecomment-57242523
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21004/consoleFull)
 for   PR 1486 at commit 
[`f99cb60`](https://github.com/apache/spark/commit/f99cb6041a088ebadc1a9fdbd2f99ce4d54075d4).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...

2014-09-29 Thread cmccabe

Github user cmccabe commented on the pull request:

https://github.com/apache/spark/pull/1486#issuecomment-57241921
  
Rebasing on master.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...

2014-09-29 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1486#issuecomment-57237832
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21001/consoleFull)
 for   PR 1486 at commit 
[`dfab423`](https://github.com/apache/spark/commit/dfab423a9986032d35907389ea6dfa913d53a28e).
 * This patch **does not** merge cleanly!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...

2014-09-29 Thread pwendell

Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/1486#issuecomment-57237531
  
Jenkins, test this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...

2014-09-29 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/1486#issuecomment-57235482
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/20996/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...

2014-09-29 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1486#issuecomment-57235474
  
  [QA tests have 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20996/consoleFull)
 for   PR 1486 at commit 
[`dfab423`](https://github.com/apache/spark/commit/dfab423a9986032d35907389ea6dfa913d53a28e).
 * This patch **fails** unit tests.
 * This patch **does not** merge cleanly!



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...

2014-09-29 Thread pwendell

Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/1486#issuecomment-57225796
  
@cmccabe you'll need to up-merge this. I guess something changed over the 
weekend.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...

2014-09-29 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1486#issuecomment-57225023
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20996/consoleFull)
 for   PR 1486 at commit 
[`dfab423`](https://github.com/apache/spark/commit/dfab423a9986032d35907389ea6dfa913d53a28e).
 * This patch **does not** merge cleanly!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...

2014-09-27 Thread pwendell

Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/1486#issuecomment-57068065
  
Hm this exclusion might not work in the case that a class is changed to an 
interface. Maybe just also add the specific recommended exclusion here:

```

ProblemFilters.exclude[IncompatibleTemplateDefProblem]("org.apache.spark.scheduler.TaskLocation")
```

Once this passes tests LGTM.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...

2014-09-26 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/1486#issuecomment-57039931
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/20896/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...

2014-09-26 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1486#issuecomment-57039928
  
  [QA tests have 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20896/consoleFull)
 for   PR 1486 at commit 
[`a9b70b0`](https://github.com/apache/spark/commit/a9b70b0f138b470f8519312cafb4dc8c630bf802).
 * This patch **fails** unit tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...

2014-09-26 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1486#issuecomment-57038176
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20896/consoleFull)
 for   PR 1486 at commit 
[`a9b70b0`](https://github.com/apache/spark/commit/a9b70b0f138b470f8519312cafb4dc8c630bf802).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...

2014-09-26 Thread cmccabe

Github user cmccabe commented on the pull request:

https://github.com/apache/spark/pull/1486#issuecomment-57038092
  
Thanks, being able to run ./dev/mima helps a lot.  This latest one should 
work with mima.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...

2014-09-26 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/1486#issuecomment-57026822
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/20881/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...

2014-09-26 Thread pwendell

Github user pwendell commented on a diff in the pull request:

https://github.com/apache/spark/pull/1486#discussion_r18110514
  
--- Diff: project/MimaExcludes.scala ---
@@ -39,7 +39,10 @@ object MimaExcludes {
 MimaBuild.excludeSparkPackage("graphx")
   ) ++
   MimaBuild.excludeSparkClass("mllib.linalg.Matrix") ++
-  MimaBuild.excludeSparkClass("mllib.linalg.Vector")
+  MimaBuild.excludeSparkClass("mllib.linalg.Vector") ++
+  Seq(
+
ProblemFilters.excludeSparkClass("org.apache.spark.scheduler.TaskLocation")
--- End diff --

this should be `MimaBuild.excludeSparkClass`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...

2014-09-26 Thread pwendell

Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/1486#issuecomment-57012765
  
This code has a compile error now. You can run this locally with 
`./dev/mima`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...

2014-09-25 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/1486#issuecomment-56873516
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/20817/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...

2014-09-25 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1486#issuecomment-56873514
  
  [QA tests have 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20817/consoleFull)
 for   PR 1486 at commit 
[`c6390f3`](https://github.com/apache/spark/commit/c6390f3c3f776e189f8919855a988eae03de8af9).
 * This patch **fails** unit tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...

2014-09-25 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1486#issuecomment-56873419
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20817/consoleFull)
 for   PR 1486 at commit 
[`c6390f3`](https://github.com/apache/spark/commit/c6390f3c3f776e189f8919855a988eae03de8af9).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...

2014-09-25 Thread cmccabe

Github user cmccabe commented on the pull request:

https://github.com/apache/spark/pull/1486#issuecomment-56873120
  
I have pushed a new version that updated the MimaExcludes.scala file with 
ProblemFilters.excludeSparkClass("org.apache.spark.scheduler.TaskLocation")... 
hopefully that will take care of it.  Is there an sbt target for running the 
mima check locally?  I didn't see one


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...

2014-09-24 Thread pwendell

Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/1486#issuecomment-56769831
  
@cmccabe this is still failing the MIMA checks:

[error]  * declaration of class org.apache.spark.scheduler.TaskLocation has 
changed to interface org.apache.spark.scheduler.TaskLocation in new version; 
changing class to interface breaks client code
[error]filter with: 
ProblemFilters.exclude[IncompatibleTemplateDefProblem]("org.apache.spark.scheduler.TaskLocation")


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...

2014-09-24 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/1486#issuecomment-56763276
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/20770/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...

2014-09-24 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1486#issuecomment-56763272
  
  [QA tests have 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20770/consoleFull)
 for   PR 1486 at commit 
[`9c4933c`](https://github.com/apache/spark/commit/9c4933c6e18db8bf2e0cbd0deb85b46c2ca0d2b2).
 * This patch **fails** unit tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...

2014-09-24 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1486#issuecomment-56763240
  
  [QA tests have 
finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/148/consoleFull)
 for   PR 1486 at commit 
[`9c4933c`](https://github.com/apache/spark/commit/9c4933c6e18db8bf2e0cbd0deb85b46c2ca0d2b2).
 * This patch **fails** unit tests.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `logInfo("Interrupting user class to stop.")`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...

2014-09-24 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1486#issuecomment-56759037
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20770/consoleFull)
 for   PR 1486 at commit 
[`9c4933c`](https://github.com/apache/spark/commit/9c4933c6e18db8bf2e0cbd0deb85b46c2ca0d2b2).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...

2014-09-24 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1486#issuecomment-56759058
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/148/consoleFull)
 for   PR 1486 at commit 
[`9c4933c`](https://github.com/apache/spark/commit/9c4933c6e18db8bf2e0cbd0deb85b46c2ca0d2b2).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...

2014-09-24 Thread pwendell

Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/1486#issuecomment-56758490
  
Jenkins, test this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...

2014-09-24 Thread cmccabe

Github user cmccabe commented on the pull request:

https://github.com/apache/spark/pull/1486#issuecomment-56743271
  
Yes, let's file a follow-up JIRA to discuss a design that can take into 
account any kind of different replica location.  This patch doesn't expose any 
new APIs-- it's all internal to Spark, and so we can easily fit it into a 
bigger design when that arrives.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...

2014-09-23 Thread pwendell

Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/1486#issuecomment-56586230
  
Basically my feeling is not to block user-submitted patches on someone 
making a broader re-design if they are fairly isolated and only change internal 
API's.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...

2014-09-23 Thread pwendell

Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/1486#issuecomment-56585915
  
I am totally 100% in support of adding a general mechanism for this and 
exposing it as a public API based on URI's. And pushing this general thing into 
the TaskSetMangaer etc. That's for sure what we need to do longer term.

The idea here was just to do something less ambitious for this internal use 
case - and we explicitly didn't document it or make it external at all.

I think once we see a few difference cases doing this, the it will be the 
time a more general public API for it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...

2014-09-23 Thread mridulm

Github user mridulm commented on the pull request:

https://github.com/apache/spark/pull/1486#issuecomment-56506066
  
@pwendell This is not hadoop RDD specific functionality - it is a general 
requirement which can be leveraged by any RDD in spark - and hadoop RDD 
currently happens to have a usecase for this when dfs caching is used.
The fact that preferred location is currently a String might be the 
limitation here : and so extending it for uri or whatever else will add 
overhead (including current patch).

For example: RDD which pulls data from tachyon or other distributed memory 
stores, loading data into accelerator cards and specifying process local 
locality for the block, etc are all uses of the same functionality imo.

If not addressed properly, when the next similar requirement comes along - 
either we will be rewriting this code; or adding more surgical hacks along same 
lines.


If the expectation is that spark wont need to support these other 
requirements [1], then we can definitely punt on doing a proper design change.

Given this is not user facing change (right ?), we can definitely take 
current approach and replace it later; or do a more principled solution upfront.
@kayousterhout @markhamstra @mateiz any thoughts given this modifies 
TaskSetManager for addition of this feature ?


[1] which is unlikely given mllib's rapid pace of development - it is 
fairly inevitable to have the need to support accelerator cards sooner rather 
than later - atleast given the arc of our past efforts with ml on spark.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...

2014-09-23 Thread pwendell

Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/1486#issuecomment-56484097
  
@mridulm the proposal here was to avoid proposing a generalized/public API 
for these and instead do something simple/internal for the case of Hadoop RDD. 
The underscore is not a valid character in a hostname, so we can use it safely 
and continue to support it going forward at low cost. This just piggy-backs on 
the existing support we already have for in-memory input blocks.

I'd like to see ups adding a publicly documented complete interface for 
specifying task locality levels like you said and supporting them in a general 
way in the TaskSetManager. URI's could be good for this, or some other 
structured format. But that is a much more complicated proposition, and one 
that requires some design discussion. The purpose of this patch is to do 
something more surgical in the short term.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...

2014-09-22 Thread mridulm

Github user mridulm commented on the pull request:

https://github.com/apache/spark/pull/1486#issuecomment-56480392
  
Are we proposing to introduce hdfs caching tags/idioms directly into 
TaskSetManager in this pr ?
That does not look right. We need to generalize this so that any rdd can 
specify process/host (maybe rack also ?) annotations. 
Once done, HadoopRdd can leverage that.

Depending on underscore not being in name, etc is fragile.
One option would be to define our uri's: with default reverting to host 
only.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...

2014-09-22 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1486#issuecomment-56470414
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/20678/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...

2014-09-22 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1486#issuecomment-56470411
  
  [QA tests have 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20678/consoleFull)
 for   PR 1486 at commit 
[`9c4933c`](https://github.com/apache/spark/commit/9c4933c6e18db8bf2e0cbd0deb85b46c2ca0d2b2).
 * This patch **fails** unit tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...

2014-09-22 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1486#issuecomment-56467045
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20678/consoleFull)
 for   PR 1486 at commit 
[`9c4933c`](https://github.com/apache/spark/commit/9c4933c6e18db8bf2e0cbd0deb85b46c2ca0d2b2).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...

2014-09-22 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1486#issuecomment-56465510
  
  [QA tests have 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20674/consoleFull)
 for   PR 1486 at commit 
[`8f9c5d6`](https://github.com/apache/spark/commit/8f9c5d66d7a630ebfee64afee7fa922c22f838ee).
 * This patch **fails** unit tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...

2014-09-22 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1486#issuecomment-56465520
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/20674/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...

2014-09-22 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1486#issuecomment-56463517
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20674/consoleFull)
 for   PR 1486 at commit 
[`8f9c5d6`](https://github.com/apache/spark/commit/8f9c5d66d7a630ebfee64afee7fa922c22f838ee).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...

2014-09-22 Thread cmccabe

Github user cmccabe commented on a diff in the pull request:

https://github.com/apache/spark/pull/1486#discussion_r17886353
  
--- Diff: core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala ---
@@ -309,4 +323,42 @@ private[spark] object HadoopRDD {
   f(inputSplit, firstParent[T].iterator(split, context))
 }
   }
+
+  private[spark] class SplitInfoReflections {
+val inputSplitWithLocationInfo =
+  Class.forName("org.apache.hadoop.mapred.InputSplitWithLocationInfo")
+val getLocationInfo = 
inputSplitWithLocationInfo.getMethod("getLocationInfo")
+val newInputSplit = 
Class.forName("org.apache.hadoop.mapreduce.InputSplit")
+val newGetLocationInfo = newInputSplit.getMethod("getLocationInfo")
+val splitLocationInfo = 
Class.forName("org.apache.hadoop.mapred.SplitLocationInfo")
+val isInMemory = splitLocationInfo.getMethod("isInMemory")
+val getLocation = splitLocationInfo.getMethod("getLocation")
+  }
+
+  private[spark] val SPLIT_INFO_REFLECTIONS = try {
--- End diff --

Sorry, I forgot about this one.  I added a type annotation. to 
SPLIT_INFO_REFLECTIONS.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...

2014-09-22 Thread cmccabe

Github user cmccabe commented on a diff in the pull request:

https://github.com/apache/spark/pull/1486#discussion_r17886029
  
--- Diff: core/src/main/scala/org/apache/spark/scheduler/TaskLocation.scala 
---
@@ -22,13 +22,35 @@ package org.apache.spark.scheduler
  * In the latter case, we will prefer to launch the task on that 
executorID, but our next level
  * of preference will be executors on the same host if this is not 
possible.
  */
-private[spark]
-class TaskLocation private (val host: String, val executorId: 
Option[String]) extends Serializable {
-  override def toString: String = "TaskLocation(" + host + ", " + 
executorId + ")"
+private[spark] sealed abstract class TaskLocation(val host: String) {
+}
+
+private [spark] case class ExecutorCacheTaskLocation(override val host: 
String,
+val executorId: String) extends TaskLocation(host) {
+}
+
+private [spark] case class HDFSCachedTaskLocation(override val host: 
String)
+extends TaskLocation(host) {
+  override def toString = TaskLocation.in_memory_location_tag + host
+}
+
+private [spark] case class HostTaskLocation(override val host: String) 
extends TaskLocation(host) {
+  override def toString = host
 }
 
 private[spark] object TaskLocation {
-  def apply(host: String, executorId: String) = new TaskLocation(host, 
Some(executorId))
+  // We identify hosts on which the block is cached with this prefix.  
Because this prefix contains
+  // underscores, which are not legal characters in hostnames, there 
should be no potential for
+  // confusion.  See  RFC 952 and RFC 1123 for information about the 
format of hostnames.
+  val in_memory_location_tag = "_hdfs_cache_"
+
+  def apply(host: String, executorId: String) = new 
ExecutorCacheTaskLocation(host, executorId)
 
-  def apply(host: String) = new TaskLocation(host, None)
+  def apply(str: String) = {
+if (str.startsWith(in_memory_location_tag)) {
+  new 
HDFSCachedTaskLocation(str.substring(in_memory_location_tag.length))
--- End diff --

ok


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...

2014-09-22 Thread cmccabe

Github user cmccabe commented on a diff in the pull request:

https://github.com/apache/spark/pull/1486#discussion_r17886024
  
--- Diff: core/src/main/scala/org/apache/spark/scheduler/TaskLocation.scala 
---
@@ -22,13 +22,35 @@ package org.apache.spark.scheduler
  * In the latter case, we will prefer to launch the task on that 
executorID, but our next level
  * of preference will be executors on the same host if this is not 
possible.
  */
-private[spark]
-class TaskLocation private (val host: String, val executorId: 
Option[String]) extends Serializable {
-  override def toString: String = "TaskLocation(" + host + ", " + 
executorId + ")"
+private[spark] sealed abstract class TaskLocation(val host: String) {
+}
+
+private [spark] case class ExecutorCacheTaskLocation(override val host: 
String,
+val executorId: String) extends TaskLocation(host) {
+}
+
+private [spark] case class HDFSCachedTaskLocation(override val host: 
String)
+extends TaskLocation(host) {
+  override def toString = TaskLocation.in_memory_location_tag + host
+}
+
+private [spark] case class HostTaskLocation(override val host: String) 
extends TaskLocation(host) {
+  override def toString = host
 }
 
 private[spark] object TaskLocation {
-  def apply(host: String, executorId: String) = new TaskLocation(host, 
Some(executorId))
+  // We identify hosts on which the block is cached with this prefix.  
Because this prefix contains
+  // underscores, which are not legal characters in hostnames, there 
should be no potential for
+  // confusion.  See  RFC 952 and RFC 1123 for information about the 
format of hostnames.
+  val in_memory_location_tag = "_hdfs_cache_"
+
+  def apply(host: String, executorId: String) = new 
ExecutorCacheTaskLocation(host, executorId)
 
-  def apply(host: String) = new TaskLocation(host, None)
+  def apply(str: String) = {
--- End diff --

added


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...

2014-09-22 Thread cmccabe

Github user cmccabe commented on a diff in the pull request:

https://github.com/apache/spark/pull/1486#discussion_r17885924
  
--- Diff: core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala ---
@@ -309,4 +323,42 @@ private[spark] object HadoopRDD {
   f(inputSplit, firstParent[T].iterator(split, context))
 }
   }
+
+  private[spark] class SplitInfoReflections {
+val inputSplitWithLocationInfo =
+  Class.forName("org.apache.hadoop.mapred.InputSplitWithLocationInfo")
+val getLocationInfo = 
inputSplitWithLocationInfo.getMethod("getLocationInfo")
+val newInputSplit = 
Class.forName("org.apache.hadoop.mapreduce.InputSplit")
+val newGetLocationInfo = newInputSplit.getMethod("getLocationInfo")
+val splitLocationInfo = 
Class.forName("org.apache.hadoop.mapred.SplitLocationInfo")
+val isInMemory = splitLocationInfo.getMethod("isInMemory")
+val getLocation = splitLocationInfo.getMethod("getLocation")
+  }
+
+  private[spark] val SPLIT_INFO_REFLECTIONS = try {
+Some(new SplitInfoReflections)
+  } catch {
+case e: Exception =>
+  logDebug("SplitLocationInfo and other new Hadoop classes are " +
+  "unavailable. Using the older Hadoop location info code.", e)
+  None
+  }
+
+  private[spark] def convertSplitLocationInfo(infos: Array[AnyRef]) 
:Seq[String] = {
--- End diff --

ok


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...

2014-09-22 Thread cmccabe

Github user cmccabe commented on a diff in the pull request:

https://github.com/apache/spark/pull/1486#discussion_r17885878
  
--- Diff: core/src/main/scala/org/apache/spark/scheduler/TaskLocation.scala 
---
@@ -22,13 +22,35 @@ package org.apache.spark.scheduler
  * In the latter case, we will prefer to launch the task on that 
executorID, but our next level
  * of preference will be executors on the same host if this is not 
possible.
  */
--- End diff --

I added JavaDoc here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...

2014-09-22 Thread cmccabe

Github user cmccabe commented on a diff in the pull request:

https://github.com/apache/spark/pull/1486#discussion_r17885653
  
--- Diff: core/src/main/scala/org/apache/spark/scheduler/TaskLocation.scala 
---
@@ -22,13 +22,35 @@ package org.apache.spark.scheduler
  * In the latter case, we will prefer to launch the task on that 
executorID, but our next level
  * of preference will be executors on the same host if this is not 
possible.
  */
-private[spark]
-class TaskLocation private (val host: String, val executorId: 
Option[String]) extends Serializable {
-  override def toString: String = "TaskLocation(" + host + ", " + 
executorId + ")"
+private[spark] sealed abstract class TaskLocation(val host: String) {
+}
+
+private [spark] case class ExecutorCacheTaskLocation(override val host: 
String,
+val executorId: String) extends TaskLocation(host) {
+}
+
+private [spark] case class HDFSCachedTaskLocation(override val host: 
String)
+extends TaskLocation(host) {
+  override def toString = TaskLocation.in_memory_location_tag + host
+}
+
+private [spark] case class HostTaskLocation(override val host: String) 
extends TaskLocation(host) {
+  override def toString = host
 }
 
 private[spark] object TaskLocation {
-  def apply(host: String, executorId: String) = new TaskLocation(host, 
Some(executorId))
+  // We identify hosts on which the block is cached with this prefix.  
Because this prefix contains
+  // underscores, which are not legal characters in hostnames, there 
should be no potential for
+  // confusion.  See  RFC 952 and RFC 1123 for information about the 
format of hostnames.
+  val in_memory_location_tag = "_hdfs_cache_"
--- End diff --

ok


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...

2014-09-22 Thread cmccabe

Github user cmccabe commented on a diff in the pull request:

https://github.com/apache/spark/pull/1486#discussion_r17881709
  
--- Diff: core/src/main/scala/org/apache/spark/scheduler/TaskLocation.scala 
---
@@ -22,13 +22,35 @@ package org.apache.spark.scheduler
  * In the latter case, we will prefer to launch the task on that 
executorID, but our next level
  * of preference will be executors on the same host if this is not 
possible.
  */
-private[spark]
-class TaskLocation private (val host: String, val executorId: 
Option[String]) extends Serializable {
-  override def toString: String = "TaskLocation(" + host + ", " + 
executorId + ")"
+private[spark] sealed abstract class TaskLocation(val host: String) {
+}
+
+private [spark] case class ExecutorCacheTaskLocation(override val host: 
String,
+val executorId: String) extends TaskLocation(host) {
+}
+
+private [spark] case class HDFSCachedTaskLocation(override val host: 
String)
--- End diff --

ok


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...

2014-09-18 Thread aarondav

Github user aarondav commented on the pull request:

https://github.com/apache/spark/pull/1486#issuecomment-56135114
  
I only had a few minor comments about documentation while trying to do a 
quick read-through of this patch. No substantive comments.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...

2014-09-18 Thread aarondav

Github user aarondav commented on a diff in the pull request:

https://github.com/apache/spark/pull/1486#discussion_r17769069
  
--- Diff: core/src/main/scala/org/apache/spark/scheduler/TaskLocation.scala 
---
@@ -22,13 +22,35 @@ package org.apache.spark.scheduler
  * In the latter case, we will prefer to launch the task on that 
executorID, but our next level
  * of preference will be executors on the same host if this is not 
possible.
  */
-private[spark]
-class TaskLocation private (val host: String, val executorId: 
Option[String]) extends Serializable {
-  override def toString: String = "TaskLocation(" + host + ", " + 
executorId + ")"
+private[spark] sealed abstract class TaskLocation(val host: String) {
+}
+
+private [spark] case class ExecutorCacheTaskLocation(override val host: 
String,
+val executorId: String) extends TaskLocation(host) {
+}
+
+private [spark] case class HDFSCachedTaskLocation(override val host: 
String)
+extends TaskLocation(host) {
+  override def toString = TaskLocation.in_memory_location_tag + host
+}
+
+private [spark] case class HostTaskLocation(override val host: String) 
extends TaskLocation(host) {
+  override def toString = host
 }
 
 private[spark] object TaskLocation {
-  def apply(host: String, executorId: String) = new TaskLocation(host, 
Some(executorId))
+  // We identify hosts on which the block is cached with this prefix.  
Because this prefix contains
+  // underscores, which are not legal characters in hostnames, there 
should be no potential for
+  // confusion.  See  RFC 952 and RFC 1123 for information about the 
format of hostnames.
+  val in_memory_location_tag = "_hdfs_cache_"
+
+  def apply(host: String, executorId: String) = new 
ExecutorCacheTaskLocation(host, executorId)
 
-  def apply(host: String) = new TaskLocation(host, None)
+  def apply(str: String) = {
--- End diff --

The contract of this method is kinda sketchy -- taking in a "str" which is 
either a host name or a tag. Would you mind adding a bit of Javadoc to explain 
that this is what is happening?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...

2014-09-18 Thread aarondav

Github user aarondav commented on a diff in the pull request:

https://github.com/apache/spark/pull/1486#discussion_r17769055
  
--- Diff: core/src/main/scala/org/apache/spark/scheduler/TaskLocation.scala 
---
@@ -22,13 +22,35 @@ package org.apache.spark.scheduler
  * In the latter case, we will prefer to launch the task on that 
executorID, but our next level
  * of preference will be executors on the same host if this is not 
possible.
  */
-private[spark]
-class TaskLocation private (val host: String, val executorId: 
Option[String]) extends Serializable {
-  override def toString: String = "TaskLocation(" + host + ", " + 
executorId + ")"
+private[spark] sealed abstract class TaskLocation(val host: String) {
+}
+
+private [spark] case class ExecutorCacheTaskLocation(override val host: 
String,
+val executorId: String) extends TaskLocation(host) {
+}
+
+private [spark] case class HDFSCachedTaskLocation(override val host: 
String)
+extends TaskLocation(host) {
+  override def toString = TaskLocation.in_memory_location_tag + host
+}
+
+private [spark] case class HostTaskLocation(override val host: String) 
extends TaskLocation(host) {
+  override def toString = host
 }
 
 private[spark] object TaskLocation {
-  def apply(host: String, executorId: String) = new TaskLocation(host, 
Some(executorId))
+  // We identify hosts on which the block is cached with this prefix.  
Because this prefix contains
+  // underscores, which are not legal characters in hostnames, there 
should be no potential for
+  // confusion.  See  RFC 952 and RFC 1123 for information about the 
format of hostnames.
+  val in_memory_location_tag = "_hdfs_cache_"
+
+  def apply(host: String, executorId: String) = new 
ExecutorCacheTaskLocation(host, executorId)
 
-  def apply(host: String) = new TaskLocation(host, None)
+  def apply(str: String) = {
+if (str.startsWith(in_memory_location_tag)) {
+  new 
HDFSCachedTaskLocation(str.substring(in_memory_location_tag.length))
--- End diff --

nit: `str.stripPrefix(in_memory_location_tag)`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...

2014-09-18 Thread aarondav

Github user aarondav commented on a diff in the pull request:

https://github.com/apache/spark/pull/1486#discussion_r17769053
  
--- Diff: core/src/main/scala/org/apache/spark/scheduler/TaskLocation.scala 
---
@@ -22,13 +22,35 @@ package org.apache.spark.scheduler
  * In the latter case, we will prefer to launch the task on that 
executorID, but our next level
  * of preference will be executors on the same host if this is not 
possible.
  */
-private[spark]
-class TaskLocation private (val host: String, val executorId: 
Option[String]) extends Serializable {
-  override def toString: String = "TaskLocation(" + host + ", " + 
executorId + ")"
+private[spark] sealed abstract class TaskLocation(val host: String) {
+}
+
+private [spark] case class ExecutorCacheTaskLocation(override val host: 
String,
+val executorId: String) extends TaskLocation(host) {
+}
+
+private [spark] case class HDFSCachedTaskLocation(override val host: 
String)
+extends TaskLocation(host) {
+  override def toString = TaskLocation.in_memory_location_tag + host
+}
+
+private [spark] case class HostTaskLocation(override val host: String) 
extends TaskLocation(host) {
+  override def toString = host
 }
 
 private[spark] object TaskLocation {
-  def apply(host: String, executorId: String) = new TaskLocation(host, 
Some(executorId))
+  // We identify hosts on which the block is cached with this prefix.  
Because this prefix contains
+  // underscores, which are not legal characters in hostnames, there 
should be no potential for
+  // confusion.  See  RFC 952 and RFC 1123 for information about the 
format of hostnames.
+  val in_memory_location_tag = "_hdfs_cache_"
--- End diff --

Also, nit: could you use camel case: `inMemoryLocationTag`, or all caps 
with underscores if you prefer.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...

2014-09-18 Thread aarondav

Github user aarondav commented on a diff in the pull request:

https://github.com/apache/spark/pull/1486#discussion_r17768989
  
--- Diff: core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala ---
@@ -309,4 +323,42 @@ private[spark] object HadoopRDD {
   f(inputSplit, firstParent[T].iterator(split, context))
 }
   }
+
+  private[spark] class SplitInfoReflections {
+val inputSplitWithLocationInfo =
+  Class.forName("org.apache.hadoop.mapred.InputSplitWithLocationInfo")
+val getLocationInfo = 
inputSplitWithLocationInfo.getMethod("getLocationInfo")
+val newInputSplit = 
Class.forName("org.apache.hadoop.mapreduce.InputSplit")
+val newGetLocationInfo = newInputSplit.getMethod("getLocationInfo")
+val splitLocationInfo = 
Class.forName("org.apache.hadoop.mapred.SplitLocationInfo")
+val isInMemory = splitLocationInfo.getMethod("isInMemory")
+val getLocation = splitLocationInfo.getMethod("getLocation")
+  }
+
+  private[spark] val SPLIT_INFO_REFLECTIONS = try {
+Some(new SplitInfoReflections)
+  } catch {
+case e: Exception =>
+  logDebug("SplitLocationInfo and other new Hadoop classes are " +
+  "unavailable. Using the older Hadoop location info code.", e)
+  None
+  }
+
+  private[spark] def convertSplitLocationInfo(infos: Array[AnyRef]) 
:Seq[String] = {
--- End diff --

nit: '): Seq[String]`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...

2014-09-18 Thread aarondav

Github user aarondav commented on a diff in the pull request:

https://github.com/apache/spark/pull/1486#discussion_r17768962
  
--- Diff: core/src/main/scala/org/apache/spark/scheduler/TaskLocation.scala 
---
@@ -22,13 +22,35 @@ package org.apache.spark.scheduler
  * In the latter case, we will prefer to launch the task on that 
executorID, but our next level
  * of preference will be executors on the same host if this is not 
possible.
  */
--- End diff --

Would you mind beefing up the documentation here a bit? I am having trouble 
reading through and quickly finding out the difference between HostTaskLocation 
and ExecutorCacheTaskLocation. I guess the latter is exclusively used for the 
BlockManager cache, but it would be good to be explicit.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...

2014-09-18 Thread aarondav

Github user aarondav commented on a diff in the pull request:

https://github.com/apache/spark/pull/1486#discussion_r17768928
  
--- Diff: core/src/main/scala/org/apache/spark/scheduler/TaskLocation.scala 
---
@@ -22,13 +22,35 @@ package org.apache.spark.scheduler
  * In the latter case, we will prefer to launch the task on that 
executorID, but our next level
  * of preference will be executors on the same host if this is not 
possible.
  */
-private[spark]
-class TaskLocation private (val host: String, val executorId: 
Option[String]) extends Serializable {
-  override def toString: String = "TaskLocation(" + host + ", " + 
executorId + ")"
+private[spark] sealed abstract class TaskLocation(val host: String) {
+}
+
+private [spark] case class ExecutorCacheTaskLocation(override val host: 
String,
+val executorId: String) extends TaskLocation(host) {
+}
+
+private [spark] case class HDFSCachedTaskLocation(override val host: 
String)
+extends TaskLocation(host) {
+  override def toString = TaskLocation.in_memory_location_tag + host
+}
+
+private [spark] case class HostTaskLocation(override val host: String) 
extends TaskLocation(host) {
--- End diff --

Minor, but `override val` on something that exports the same parameter is 
kinda weird, I think this could be cleaned up just slightly by making 
TaskLocation a trait instead with a `def host: String`. Then this still works 
and is the sole implementation.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...

2014-09-18 Thread pwendell

Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/1486#issuecomment-5619
  
Yes, this appears to be an issue with our checker and adding an exclusion 
is fine for now. The class is private.

Just had really minor comments and I can address them on merge if you want. 
This is looking good to me. Any other changes or is this good from your side?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...

2014-09-18 Thread pwendell

Github user pwendell commented on a diff in the pull request:

https://github.com/apache/spark/pull/1486#discussion_r17768506
  
--- Diff: core/src/main/scala/org/apache/spark/scheduler/TaskLocation.scala 
---
@@ -22,13 +22,35 @@ package org.apache.spark.scheduler
  * In the latter case, we will prefer to launch the task on that 
executorID, but our next level
  * of preference will be executors on the same host if this is not 
possible.
  */
-private[spark]
-class TaskLocation private (val host: String, val executorId: 
Option[String]) extends Serializable {
-  override def toString: String = "TaskLocation(" + host + ", " + 
executorId + ")"
+private[spark] sealed abstract class TaskLocation(val host: String) {
+}
+
+private [spark] case class ExecutorCacheTaskLocation(override val host: 
String,
+val executorId: String) extends TaskLocation(host) {
+}
+
+private [spark] case class HDFSCachedTaskLocation(override val host: 
String)
+extends TaskLocation(host) {
+  override def toString = TaskLocation.in_memory_location_tag + host
+}
+
+private [spark] case class HostTaskLocation(override val host: String) 
extends TaskLocation(host) {
+  override def toString = host
 }
 
 private[spark] object TaskLocation {
-  def apply(host: String, executorId: String) = new TaskLocation(host, 
Some(executorId))
+  // We identify hosts on which the block is cached with this prefix.  
Because this prefix contains
+  // underscores, which are not legal characters in hostnames, there 
should be no potential for
+  // confusion.  See  RFC 952 and RFC 1123 for information about the 
format of hostnames.
+  val in_memory_location_tag = "_hdfs_cache_"
--- End diff --

could you drop the prefixing `_` here to make it consistent with blockid? 
Having a trailing underscore seems sufficient to distinguish it from a real 
hostname.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...

2014-09-18 Thread pwendell

Github user pwendell commented on a diff in the pull request:

https://github.com/apache/spark/pull/1486#discussion_r17768479
  
--- Diff: core/src/main/scala/org/apache/spark/scheduler/TaskLocation.scala 
---
@@ -22,13 +22,35 @@ package org.apache.spark.scheduler
  * In the latter case, we will prefer to launch the task on that 
executorID, but our next level
  * of preference will be executors on the same host if this is not 
possible.
  */
-private[spark]
-class TaskLocation private (val host: String, val executorId: 
Option[String]) extends Serializable {
-  override def toString: String = "TaskLocation(" + host + ", " + 
executorId + ")"
+private[spark] sealed abstract class TaskLocation(val host: String) {
+}
+
+private [spark] case class ExecutorCacheTaskLocation(override val host: 
String,
+val executorId: String) extends TaskLocation(host) {
+}
+
+private [spark] case class HDFSCachedTaskLocation(override val host: 
String)
--- End diff --

should this be `HDFSCacheTaskLocation` to be consistent with 
`ExecutorCacheTaskLocation`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...

2014-09-18 Thread pwendell

Github user pwendell commented on a diff in the pull request:

https://github.com/apache/spark/pull/1486#discussion_r17768467
  
--- Diff: core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala ---
@@ -309,4 +323,42 @@ private[spark] object HadoopRDD {
   f(inputSplit, firstParent[T].iterator(split, context))
 }
   }
+
+  private[spark] class SplitInfoReflections {
+val inputSplitWithLocationInfo =
+  Class.forName("org.apache.hadoop.mapred.InputSplitWithLocationInfo")
+val getLocationInfo = 
inputSplitWithLocationInfo.getMethod("getLocationInfo")
+val newInputSplit = 
Class.forName("org.apache.hadoop.mapreduce.InputSplit")
+val newGetLocationInfo = newInputSplit.getMethod("getLocationInfo")
+val splitLocationInfo = 
Class.forName("org.apache.hadoop.mapred.SplitLocationInfo")
+val isInMemory = splitLocationInfo.getMethod("isInMemory")
+val getLocation = splitLocationInfo.getMethod("getLocation")
+  }
+
+  private[spark] val SPLIT_INFO_REFLECTIONS = try {
--- End diff --

did you decide you'd prefer not to do this?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...

2014-09-18 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1486#issuecomment-56125277
  
  [QA tests have 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20563/consoleFull)
 for   PR 1486 at commit 
[`d1f9fe3`](https://github.com/apache/spark/commit/d1f9fe36392ab18e36e8491cae4598e0063e59fa).
 * This patch **passes** unit tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...

2014-09-18 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1486#issuecomment-56120182
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20563/consoleFull)
 for   PR 1486 at commit 
[`d1f9fe3`](https://github.com/apache/spark/commit/d1f9fe36392ab18e36e8491cae4598e0063e59fa).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...

2014-09-17 Thread cmccabe

Github user cmccabe commented on the pull request:

https://github.com/apache/spark/pull/1486#issuecomment-55966988
  
The "unit test failure" mentioned here seems to be coming from the binary 
compatibility checker.  The text of the error is:
  [error]  * class org.apache.spark.scheduler.TaskLocation was concrete; is 
declared abstract in new version
  [error]filter with: 
ProblemFilters.exclude[AbstractClassProblem]("org.apache.spark.scheduler.TaskLocation")

This check seems too strict to me, since TaskLocation is a private[spark] 
class.  It's never exposed to users and isn't part of any user-facing API.

What should I do here?  I could add an "ignore" for this, I suppose.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...

2014-09-17 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1486#issuecomment-55966108
  
  [QA tests have 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20487/consoleFull)
 for   PR 1486 at commit 
[`b95ccd7`](https://github.com/apache/spark/commit/b95ccd74e5a1e9a8094189ba2400e26adea551a1).
 * This patch **fails** unit tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...

2014-09-17 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1486#issuecomment-55957034
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20487/consoleFull)
 for   PR 1486 at commit 
[`b95ccd7`](https://github.com/apache/spark/commit/b95ccd74e5a1e9a8094189ba2400e26adea551a1).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...

2014-09-17 Thread cmccabe

Github user cmccabe commented on a diff in the pull request:

https://github.com/apache/spark/pull/1486#discussion_r17691734
  
--- Diff: 
core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala ---
@@ -181,8 +181,24 @@ private[spark] class TaskSetManager(
 }
 
 for (loc <- tasks(index).preferredLocations) {
-  for (execId <- loc.executorId) {
-addTo(pendingTasksForExecutor.getOrElseUpdate(execId, new 
ArrayBuffer))
+  loc match {
+case e : ExecutorCacheTaskLocation =>
--- End diff --

ok


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...

2014-09-17 Thread cmccabe

Github user cmccabe commented on a diff in the pull request:

https://github.com/apache/spark/pull/1486#discussion_r17691691
  
--- Diff: 
core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala ---
@@ -181,8 +181,24 @@ private[spark] class TaskSetManager(
 }
 
 for (loc <- tasks(index).preferredLocations) {
-  for (execId <- loc.executorId) {
-addTo(pendingTasksForExecutor.getOrElseUpdate(execId, new 
ArrayBuffer))
+  loc match {
+case e : ExecutorCacheTaskLocation =>
+  addTo(pendingTasksForExecutor.getOrElseUpdate(e.executorId, new 
ArrayBuffer))
+case e : HDFSCachedTaskLocation => {
+  val exe = sched.getExecutorsAliveOnHost(loc.host)
+  exe match {
+case Some(set) => {
+  for (e <- set) {
+addTo(pendingTasksForExecutor.getOrElseUpdate(e, new 
ArrayBuffer))
+  }
+  logInfo("Pending task " + index + " has a cached location at 
" + e.host +
--- End diff --

ok


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...

2014-09-17 Thread cmccabe

Github user cmccabe commented on a diff in the pull request:

https://github.com/apache/spark/pull/1486#discussion_r17691660
  
--- Diff: core/src/main/scala/org/apache/spark/scheduler/TaskLocation.scala 
---
@@ -23,12 +23,35 @@ package org.apache.spark.scheduler
  * of preference will be executors on the same host if this is not 
possible.
  */
 private[spark]
-class TaskLocation private (val host: String, val executorId: 
Option[String]) extends Serializable {
-  override def toString: String = "TaskLocation(" + host + ", " + 
executorId + ")"
+sealed abstract class TaskLocation(val host: String) {
+}
+
+private [spark] case class ExecutorCacheTaskLocation(override val host: 
String,
+val executorId: String) extends TaskLocation(host) {
+}
+
+private [spark] case class HDFSCachedTaskLocation(override val host: 
String)
+extends TaskLocation(host) {
+  override def toString = TaskLocation.in_memory_location_tag + host
+}
+
+private [spark] case class HostTaskLocation(override val host: String) 
extends TaskLocation(host) {
+  override def toString = host
 }
 
 private[spark] object TaskLocation {
-  def apply(host: String, executorId: String) = new TaskLocation(host, 
Some(executorId))
+  // We identify hosts on which the block is cached with this prefix.  
Because this prefix contains
+  // underscores, which are not legal characters in hostnames, there 
should be no potential for
+  // confusion.  See  RFC 952 and RFC 1123 for information about the 
format of hostnames.
+  val in_memory_location_tag = "_M_"
--- End diff --

ok


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...

2014-09-16 Thread pwendell

Github user pwendell commented on a diff in the pull request:

https://github.com/apache/spark/pull/1486#discussion_r17614113
  
--- Diff: core/src/main/scala/org/apache/spark/scheduler/TaskLocation.scala 
---
@@ -23,12 +23,33 @@ package org.apache.spark.scheduler
  * of preference will be executors on the same host if this is not 
possible.
  */
 private[spark]
-class TaskLocation private (val host: String, val executorId: 
Option[String]) extends Serializable {
-  override def toString: String = "TaskLocation(" + host + ", " + 
executorId + ")"
+sealed abstract class TaskLocation(val host: String) {
+}
+
+case class ExecutorCacheTaskLocation(override val host: String, val 
executorId: String)
+extends TaskLocation(host) {
+}
+
+case class HDFSCachedTaskLocation(override val host: String) extends 
TaskLocation(host) {
+  override def toString = TaskLocation.IN_MEMORY_LOCATION_TAG + host
+}
+
+case class HostTaskLocation(override val host: String) extends 
TaskLocation(host) {
+  override def toString = host
 }
 
 private[spark] object TaskLocation {
-  def apply(host: String, executorId: String) = new TaskLocation(host, 
Some(executorId))
+  val IN_MEMORY_LOCATION_TAG = "_M_"
--- End diff --

Yes, I meant hostnames


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...

2014-09-16 Thread pwendell

Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/1486#issuecomment-55771271
  
On the visibility stuff, understood. I actually forgot the "old API" is 
still supported in newer versions of Hadoop. Otherwise, you could put this all 
in the new hadoop RDD.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...

2014-09-16 Thread pwendell

Github user pwendell commented on a diff in the pull request:

https://github.com/apache/spark/pull/1486#discussion_r17612925
  
--- Diff: 
core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala ---
@@ -181,8 +181,24 @@ private[spark] class TaskSetManager(
 }
 
 for (loc <- tasks(index).preferredLocations) {
-  for (execId <- loc.executorId) {
-addTo(pendingTasksForExecutor.getOrElseUpdate(execId, new 
ArrayBuffer))
+  loc match {
+case e : ExecutorCacheTaskLocation =>
--- End diff --

Should be `case e: ExecutorCacheTaskLocation` - the spacing around colons 
is off in a bunch of places in this patch, so maybe do an inventory on it. It 
should also be zzero spaces before and one space after.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...

2014-09-16 Thread pwendell

Github user pwendell commented on a diff in the pull request:

https://github.com/apache/spark/pull/1486#discussion_r17612896
  
--- Diff: 
core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala ---
@@ -181,8 +181,24 @@ private[spark] class TaskSetManager(
 }
 
 for (loc <- tasks(index).preferredLocations) {
-  for (execId <- loc.executorId) {
-addTo(pendingTasksForExecutor.getOrElseUpdate(execId, new 
ArrayBuffer))
+  loc match {
+case e : ExecutorCacheTaskLocation =>
+  addTo(pendingTasksForExecutor.getOrElseUpdate(e.executorId, new 
ArrayBuffer))
+case e : HDFSCachedTaskLocation => {
+  val exe = sched.getExecutorsAliveOnHost(loc.host)
+  exe match {
+case Some(set) => {
+  for (e <- set) {
+addTo(pendingTasksForExecutor.getOrElseUpdate(e, new 
ArrayBuffer))
+  }
+  logInfo("Pending task " + index + " has a cached location at 
" + e.host +
--- End diff --

not a big deal, but it's a bit nicer to use string interpoliation in cases 
like this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...

2014-09-16 Thread pwendell

Github user pwendell commented on a diff in the pull request:

https://github.com/apache/spark/pull/1486#discussion_r17612851
  
--- Diff: core/src/main/scala/org/apache/spark/scheduler/TaskLocation.scala 
---
@@ -23,12 +23,35 @@ package org.apache.spark.scheduler
  * of preference will be executors on the same host if this is not 
possible.
  */
 private[spark]
-class TaskLocation private (val host: String, val executorId: 
Option[String]) extends Serializable {
-  override def toString: String = "TaskLocation(" + host + ", " + 
executorId + ")"
+sealed abstract class TaskLocation(val host: String) {
+}
+
+private [spark] case class ExecutorCacheTaskLocation(override val host: 
String,
+val executorId: String) extends TaskLocation(host) {
+}
+
+private [spark] case class HDFSCachedTaskLocation(override val host: 
String)
+extends TaskLocation(host) {
+  override def toString = TaskLocation.in_memory_location_tag + host
+}
+
+private [spark] case class HostTaskLocation(override val host: String) 
extends TaskLocation(host) {
+  override def toString = host
 }
 
 private[spark] object TaskLocation {
-  def apply(host: String, executorId: String) = new TaskLocation(host, 
Some(executorId))
+  // We identify hosts on which the block is cached with this prefix.  
Because this prefix contains
+  // underscores, which are not legal characters in hostnames, there 
should be no potential for
+  // confusion.  See  RFC 952 and RFC 1123 for information about the 
format of hostnames.
+  val in_memory_location_tag = "_M_"
--- End diff --

Could you make this lower case to match other cases where we do this? e.g. 
`hdfs_cache_`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...

2014-09-15 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1486#issuecomment-55687848
  
  [QA tests have 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20359/consoleFull)
 for   PR 1486 at commit 
[`0d10adb`](https://github.com/apache/spark/commit/0d10adbc9abad5ac4bfd1e2d8538e2daf37a2a98).
 * This patch **fails** unit tests.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `sealed abstract class TaskLocation(val host: String) `



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...

2014-09-15 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1486#issuecomment-55683946
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20359/consoleFull)
 for   PR 1486 at commit 
[`0d10adb`](https://github.com/apache/spark/commit/0d10adbc9abad5ac4bfd1e2d8538e2daf37a2a98).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...

2014-09-15 Thread cmccabe

Github user cmccabe commented on the pull request:

https://github.com/apache/spark/pull/1486#issuecomment-55683469
  
I can see why you'd like to reduce visibility, but I don't think it's 
possible here.  In HadoopRDD, three new things are exposed with visibility 
private [spark].  They are SPLIT_INFO_REFLECTIONS, convertSplitLocations, and 
the SplitInfoReflections type.  These things are exposed for the benefit of 
NewHadoopRDD.  As far as I can see, you can't just have "a single... function 
call with a narrow interface" because HadoopRDD is dealing with different types 
than NewHadoopRDD.  HadoopRDD is dealing with HadoopPartition which needs to be 
cast to org.apache.hadoop.mapred.InputSplitWithLocationInfo; NewHadoopRDD is 
dealing with org.apache.hadoop.mapreduce.InputSplit.  You will need two 
separate code paths for this.  Ultimately, the goal is to get an array of 
org.apache.hadoop.mapred.SplitLocationInfo, at which point a common function 
can be called, HadoopRDD#convertSplitLocationInfo.  That common function 
currently lives in HadoopRDD.scala.

We always have the freedom to refactor this in the future, since it's not 
visible from outside Spark.  I think it's very unlikely that any other code 
inside Spark will start calling these methods, since they're obviously tied to 
a very specific goal of extracting information from hadoop 2.x types.  What do 
you guys think?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...

2014-09-15 Thread cmccabe

Github user cmccabe commented on a diff in the pull request:

https://github.com/apache/spark/pull/1486#discussion_r17579270
  
--- Diff: core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala ---
@@ -309,4 +323,42 @@ private[spark] object HadoopRDD {
   f(inputSplit, firstParent[T].iterator(split, context))
 }
   }
+
+  private[spark] class SplitInfoReflections {
--- End diff --

I commented about this below, but to summarize, the old and new rdds use 
different types and the code path is different.  There is a common function 
called HadoopRDD#convertSplitLocationInfo, but the types need to be 
"pre-processed" a bit to get to the point where we can call that.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...

2014-09-15 Thread cmccabe

Github user cmccabe commented on a diff in the pull request:

https://github.com/apache/spark/pull/1486#discussion_r17578280
  
--- Diff: core/src/main/scala/org/apache/spark/scheduler/TaskLocation.scala 
---
@@ -23,12 +23,33 @@ package org.apache.spark.scheduler
  * of preference will be executors on the same host if this is not 
possible.
  */
 private[spark]
-class TaskLocation private (val host: String, val executorId: 
Option[String]) extends Serializable {
-  override def toString: String = "TaskLocation(" + host + ", " + 
executorId + ")"
+sealed abstract class TaskLocation(val host: String) {
+}
+
+case class ExecutorCacheTaskLocation(override val host: String, val 
executorId: String)
+extends TaskLocation(host) {
+}
+
+case class HDFSCachedTaskLocation(override val host: String) extends 
TaskLocation(host) {
+  override def toString = TaskLocation.IN_MEMORY_LOCATION_TAG + host
+}
+
+case class HostTaskLocation(override val host: String) extends 
TaskLocation(host) {
+  override def toString = host
 }
 
 private[spark] object TaskLocation {
-  def apply(host: String, executorId: String) = new TaskLocation(host, 
Some(executorId))
+  val IN_MEMORY_LOCATION_TAG = "_M_"
--- End diff --

Underscores are not allowed in hostnames.  The hostname format was 
specified in RFC 952 and later updated in RFC 1123.  Legend has it that the 
underscore was not allowed because a popular teletype unit at the time didn't 
have an underscore key and the goal was interoperability.  I assume your 
reference to URIs was a typo since there's no URIs in this file.

I'll lowercase the constants.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...

2014-09-15 Thread cmccabe

Github user cmccabe commented on a diff in the pull request:

https://github.com/apache/spark/pull/1486#discussion_r17578110
  
--- Diff: core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala ---
@@ -309,4 +323,42 @@ private[spark] object HadoopRDD {
   f(inputSplit, firstParent[T].iterator(split, context))
 }
   }
+
+  private[spark] class SplitInfoReflections {
+val inputSplitWithLocationInfo =
+  Class.forName("org.apache.hadoop.mapred.InputSplitWithLocationInfo")
+val getLocationInfo = 
inputSplitWithLocationInfo.getMethod("getLocationInfo")
+val newInputSplit = 
Class.forName("org.apache.hadoop.mapreduce.InputSplit")
+val newGetLocationInfo = newInputSplit.getMethod("getLocationInfo")
+val splitLocationInfo = 
Class.forName("org.apache.hadoop.mapred.SplitLocationInfo")
+val isInMemory = splitLocationInfo.getMethod("isInMemory")
+val getLocation = splitLocationInfo.getMethod("getLocation")
+  }
+
+  private[spark] val SPLIT_INFO_REFLECTIONS = try {
+Some(new SplitInfoReflections)
+  } catch {
+case e: Exception =>
+  logDebug("SplitLocationInfo and other new Hadoop classes are " +
+  "unavailable. Using the older Hadoop location info code.", e)
+  None
+  }
+
+  private[spark] def convertSplitLocationInfo(infos : Array[AnyRef]) : 
Seq[String] = {
+val out = new ListBuffer[String]
+infos.foreach(loc => {
--- End diff --

k


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...

2014-09-15 Thread cmccabe

Github user cmccabe commented on a diff in the pull request:

https://github.com/apache/spark/pull/1486#discussion_r17578061
  
--- Diff: core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala ---
@@ -309,4 +323,42 @@ private[spark] object HadoopRDD {
   f(inputSplit, firstParent[T].iterator(split, context))
 }
   }
+
+  private[spark] class SplitInfoReflections {
+val inputSplitWithLocationInfo =
+  Class.forName("org.apache.hadoop.mapred.InputSplitWithLocationInfo")
+val getLocationInfo = 
inputSplitWithLocationInfo.getMethod("getLocationInfo")
+val newInputSplit = 
Class.forName("org.apache.hadoop.mapreduce.InputSplit")
+val newGetLocationInfo = newInputSplit.getMethod("getLocationInfo")
+val splitLocationInfo = 
Class.forName("org.apache.hadoop.mapred.SplitLocationInfo")
+val isInMemory = splitLocationInfo.getMethod("isInMemory")
+val getLocation = splitLocationInfo.getMethod("getLocation")
+  }
+
+  private[spark] val SPLIT_INFO_REFLECTIONS = try {
+Some(new SplitInfoReflections)
+  } catch {
+case e: Exception =>
+  logDebug("SplitLocationInfo and other new Hadoop classes are " +
+  "unavailable. Using the older Hadoop location info code.", e)
+  None
+  }
+
+  private[spark] def convertSplitLocationInfo(infos : Array[AnyRef]) : 
Seq[String] = {
--- End diff --

ok


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...

2014-09-15 Thread cmccabe

Github user cmccabe commented on a diff in the pull request:

https://github.com/apache/spark/pull/1486#discussion_r17578078
  
--- Diff: core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala ---
@@ -309,4 +323,42 @@ private[spark] object HadoopRDD {
   f(inputSplit, firstParent[T].iterator(split, context))
 }
   }
+
+  private[spark] class SplitInfoReflections {
+val inputSplitWithLocationInfo =
+  Class.forName("org.apache.hadoop.mapred.InputSplitWithLocationInfo")
+val getLocationInfo = 
inputSplitWithLocationInfo.getMethod("getLocationInfo")
+val newInputSplit = 
Class.forName("org.apache.hadoop.mapreduce.InputSplit")
+val newGetLocationInfo = newInputSplit.getMethod("getLocationInfo")
+val splitLocationInfo = 
Class.forName("org.apache.hadoop.mapred.SplitLocationInfo")
+val isInMemory = splitLocationInfo.getMethod("isInMemory")
+val getLocation = splitLocationInfo.getMethod("getLocation")
+  }
+
+  private[spark] val SPLIT_INFO_REFLECTIONS = try {
+Some(new SplitInfoReflections)
+  } catch {
+case e: Exception =>
+  logDebug("SplitLocationInfo and other new Hadoop classes are " +
+  "unavailable. Using the older Hadoop location info code.", e)
+  None
+  }
+
+  private[spark] def convertSplitLocationInfo(infos : Array[AnyRef]) : 
Seq[String] = {
+val out = new ListBuffer[String]
--- End diff --

ok


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...

2014-09-15 Thread cmccabe

Github user cmccabe commented on a diff in the pull request:

https://github.com/apache/spark/pull/1486#discussion_r17578041
  
--- Diff: core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala ---
@@ -309,4 +323,42 @@ private[spark] object HadoopRDD {
   f(inputSplit, firstParent[T].iterator(split, context))
 }
   }
+
+  private[spark] class SplitInfoReflections {
+val inputSplitWithLocationInfo =
+  Class.forName("org.apache.hadoop.mapred.InputSplitWithLocationInfo")
+val getLocationInfo = 
inputSplitWithLocationInfo.getMethod("getLocationInfo")
+val newInputSplit = 
Class.forName("org.apache.hadoop.mapreduce.InputSplit")
+val newGetLocationInfo = newInputSplit.getMethod("getLocationInfo")
+val splitLocationInfo = 
Class.forName("org.apache.hadoop.mapred.SplitLocationInfo")
+val isInMemory = splitLocationInfo.getMethod("isInMemory")
+val getLocation = splitLocationInfo.getMethod("getLocation")
+  }
+
+  private[spark] val SPLIT_INFO_REFLECTIONS = try {
--- End diff --

ok


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...

2014-09-15 Thread cmccabe

Github user cmccabe commented on a diff in the pull request:

https://github.com/apache/spark/pull/1486#discussion_r17577975
  
--- Diff: core/src/main/scala/org/apache/spark/rdd/RDD.scala ---
@@ -208,8 +208,10 @@ abstract class RDD[T: ClassTag](
   }
 
   /**
-   * Get the preferred locations of a partition (as hostnames), taking 
into account whether the
+   * Get the preferred locations of a partition, taking into account 
whether the
* RDD is checkpointed.
+   * The strings returned here can be parsed into TaskLocation objects by
--- End diff --

I will leave it as-is for now.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...

2014-09-15 Thread cmccabe

Github user cmccabe commented on a diff in the pull request:

https://github.com/apache/spark/pull/1486#discussion_r17577951
  
--- Diff: core/src/main/scala/org/apache/spark/scheduler/TaskLocation.scala 
---
@@ -23,12 +23,33 @@ package org.apache.spark.scheduler
  * of preference will be executors on the same host if this is not 
possible.
  */
 private[spark]
-class TaskLocation private (val host: String, val executorId: 
Option[String]) extends Serializable {
-  override def toString: String = "TaskLocation(" + host + ", " + 
executorId + ")"
+sealed abstract class TaskLocation(val host: String) {
+}
+
+case class ExecutorCacheTaskLocation(override val host: String, val 
executorId: String)
+extends TaskLocation(host) {
+}
+
+case class HDFSCachedTaskLocation(override val host: String) extends 
TaskLocation(host) {
+  override def toString = TaskLocation.IN_MEMORY_LOCATION_TAG + host
+}
+
+case class HostTaskLocation(override val host: String) extends 
TaskLocation(host) {
+  override def toString = host
 }
 
 private[spark] object TaskLocation {
-  def apply(host: String, executorId: String) = new TaskLocation(host, 
Some(executorId))
+  val IN_MEMORY_LOCATION_TAG = "_M_"
+
+  def apply(host: String, executorId: String) = new 
ExecutorCacheTaskLocation(host, executorId)
+
+  def apply(host: String) = new HostTaskLocation(host)
 
-  def apply(host: String) = new TaskLocation(host, None)
+  def fromString(str : String) = {
--- End diff --

ok


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...

2014-09-15 Thread cmccabe

Github user cmccabe commented on a diff in the pull request:

https://github.com/apache/spark/pull/1486#discussion_r17577914
  
--- Diff: core/src/main/scala/org/apache/spark/scheduler/TaskLocation.scala 
---
@@ -23,12 +23,33 @@ package org.apache.spark.scheduler
  * of preference will be executors on the same host if this is not 
possible.
  */
 private[spark]
-class TaskLocation private (val host: String, val executorId: 
Option[String]) extends Serializable {
-  override def toString: String = "TaskLocation(" + host + ", " + 
executorId + ")"
+sealed abstract class TaskLocation(val host: String) {
+}
+
+case class ExecutorCacheTaskLocation(override val host: String, val 
executorId: String)
--- End diff --

ok


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...

2014-09-15 Thread cmccabe

Github user cmccabe commented on a diff in the pull request:

https://github.com/apache/spark/pull/1486#discussion_r17577892
  
--- Diff: core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala ---
@@ -248,10 +250,22 @@ class HadoopRDD[K, V](
 new HadoopMapPartitionsWithSplitRDD(this, f, preservesPartitioning)
   }
 
-  override def getPreferredLocations(split: Partition): Seq[String] = {
-// TODO: Filtering out "localhost" in case of file:// URLs
-val hadoopSplit = split.asInstanceOf[HadoopPartition]
-hadoopSplit.inputSplit.value.getLocations.filter(_ != "localhost")
+  override def getPreferredLocations(hsplit: Partition): Seq[String] = {
--- End diff --

ok


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...

2014-09-15 Thread pwendell

Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/1486#issuecomment-55638524
  
Added a few more comments after thinking about this some more. As it stands 
the current factoring opens up a bunch of things at `private[spark]` 
visibility. We always try to use the narrowest visibility possible and I think 
it would be easy to change this to have only a single `private[spark]` function 
call with a narrow interface.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...

2014-09-15 Thread pwendell

Github user pwendell commented on a diff in the pull request:

https://github.com/apache/spark/pull/1486#discussion_r17561092
  
--- Diff: core/src/main/scala/org/apache/spark/scheduler/TaskLocation.scala 
---
@@ -23,12 +23,33 @@ package org.apache.spark.scheduler
  * of preference will be executors on the same host if this is not 
possible.
  */
 private[spark]
-class TaskLocation private (val host: String, val executorId: 
Option[String]) extends Serializable {
-  override def toString: String = "TaskLocation(" + host + ", " + 
executorId + ")"
+sealed abstract class TaskLocation(val host: String) {
+}
+
+case class ExecutorCacheTaskLocation(override val host: String, val 
executorId: String)
+extends TaskLocation(host) {
+}
+
+case class HDFSCachedTaskLocation(override val host: String) extends 
TaskLocation(host) {
+  override def toString = TaskLocation.IN_MEMORY_LOCATION_TAG + host
+}
+
+case class HostTaskLocation(override val host: String) extends 
TaskLocation(host) {
+  override def toString = host
 }
 
 private[spark] object TaskLocation {
-  def apply(host: String, executorId: String) = new TaskLocation(host, 
Some(executorId))
+  val IN_MEMORY_LOCATION_TAG = "_M_"
--- End diff --

To make this more consistent with `BlockId` can you use lower case prefixes 
like: `hdfs_cache_XXX`. Also is `_` reserved character in URI names? If so, it 
would be good to add a document in the code pointing to evidence of that fact.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...

2014-09-15 Thread pwendell

Github user pwendell commented on a diff in the pull request:

https://github.com/apache/spark/pull/1486#discussion_r17560907
  
--- Diff: core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala ---
@@ -309,4 +323,42 @@ private[spark] object HadoopRDD {
   f(inputSplit, firstParent[T].iterator(split, context))
 }
   }
+
+  private[spark] class SplitInfoReflections {
--- End diff --

the way this is factored exposes a bunch of stuff at `private[spark]` 
visibility that is really specific to the internals of `HadoopRDD`. What about 
just adding a single static utility function in `HadoopRDD`:

```
/*** Return a correctly formatted set of location strings for a 
HadoopPartition. */
private[spark] def getLocationPreferences(partition: HadoopPartition): 
Seq[String] {
}
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

1 2 >

1 - 100 of 168 matches

Mail list logo