[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/1486 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...
Github user sryza commented on the pull request: https://github.com/apache/spark/pull/1486#issuecomment-57593579 w00t! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/1486#issuecomment-57592942 Okay - gonna merge this. Glad it's in good shape now. Thanks @cmccabe for the contribution. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/1486#issuecomment-57592632 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21177/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1486#issuecomment-57592627 [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21177/consoleFull) for PR 1486 at commit [`338d4f8`](https://github.com/apache/spark/commit/338d4f8fedd68b64a7fdfaf078afcc2623072501). * This patch **passes** unit tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1486#issuecomment-57588340 [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21177/consoleFull) for PR 1486 at commit [`338d4f8`](https://github.com/apache/spark/commit/338d4f8fedd68b64a7fdfaf078afcc2623072501). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/1486#issuecomment-57588159 Jenkins, retest this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/1486#issuecomment-57531253 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21153/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...
Github user cmccabe commented on the pull request: https://github.com/apache/spark/pull/1486#issuecomment-57529171 I just rebased on master and re-pushed. It looks like this merge conflict was caused by another change to the MimaExcludes file, just like the previous merge conflict. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/1486#issuecomment-57351505 @cmccabe if you look at the message here it is saying that it doesn't merge cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/1486#issuecomment-57246819 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21001/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1486#issuecomment-57246815 [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21001/consoleFull) for PR 1486 at commit [`dfab423`](https://github.com/apache/spark/commit/dfab423a9986032d35907389ea6dfa913d53a28e). * This patch **passes** unit tests. * This patch **does not** merge cleanly! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/1486#issuecomment-57242587 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21004/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1486#issuecomment-57242585 [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21004/consoleFull) for PR 1486 at commit [`f99cb60`](https://github.com/apache/spark/commit/f99cb6041a088ebadc1a9fdbd2f99ce4d54075d4). * This patch **fails** unit tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1486#issuecomment-57242523 [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21004/consoleFull) for PR 1486 at commit [`f99cb60`](https://github.com/apache/spark/commit/f99cb6041a088ebadc1a9fdbd2f99ce4d54075d4). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...
Github user cmccabe commented on the pull request: https://github.com/apache/spark/pull/1486#issuecomment-57241921 Rebasing on master. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1486#issuecomment-57237832 [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21001/consoleFull) for PR 1486 at commit [`dfab423`](https://github.com/apache/spark/commit/dfab423a9986032d35907389ea6dfa913d53a28e). * This patch **does not** merge cleanly! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/1486#issuecomment-57237531 Jenkins, test this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/1486#issuecomment-57235482 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/20996/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1486#issuecomment-57235474 [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20996/consoleFull) for PR 1486 at commit [`dfab423`](https://github.com/apache/spark/commit/dfab423a9986032d35907389ea6dfa913d53a28e). * This patch **fails** unit tests. * This patch **does not** merge cleanly! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/1486#issuecomment-57225796 @cmccabe you'll need to up-merge this. I guess something changed over the weekend. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1486#issuecomment-57225023 [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20996/consoleFull) for PR 1486 at commit [`dfab423`](https://github.com/apache/spark/commit/dfab423a9986032d35907389ea6dfa913d53a28e). * This patch **does not** merge cleanly! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/1486#issuecomment-57068065 Hm this exclusion might not work in the case that a class is changed to an interface. Maybe just also add the specific recommended exclusion here: ``` ProblemFilters.exclude[IncompatibleTemplateDefProblem]("org.apache.spark.scheduler.TaskLocation") ``` Once this passes tests LGTM. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/1486#issuecomment-57039931 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/20896/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1486#issuecomment-57039928 [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20896/consoleFull) for PR 1486 at commit [`a9b70b0`](https://github.com/apache/spark/commit/a9b70b0f138b470f8519312cafb4dc8c630bf802). * This patch **fails** unit tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1486#issuecomment-57038176 [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20896/consoleFull) for PR 1486 at commit [`a9b70b0`](https://github.com/apache/spark/commit/a9b70b0f138b470f8519312cafb4dc8c630bf802). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...
Github user cmccabe commented on the pull request: https://github.com/apache/spark/pull/1486#issuecomment-57038092 Thanks, being able to run ./dev/mima helps a lot. This latest one should work with mima. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/1486#issuecomment-57026822 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/20881/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...
Github user pwendell commented on a diff in the pull request: https://github.com/apache/spark/pull/1486#discussion_r18110514 --- Diff: project/MimaExcludes.scala --- @@ -39,7 +39,10 @@ object MimaExcludes { MimaBuild.excludeSparkPackage("graphx") ) ++ MimaBuild.excludeSparkClass("mllib.linalg.Matrix") ++ - MimaBuild.excludeSparkClass("mllib.linalg.Vector") + MimaBuild.excludeSparkClass("mllib.linalg.Vector") ++ + Seq( + ProblemFilters.excludeSparkClass("org.apache.spark.scheduler.TaskLocation") --- End diff -- this should be `MimaBuild.excludeSparkClass` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/1486#issuecomment-57012765 This code has a compile error now. You can run this locally with `./dev/mima`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/1486#issuecomment-56873516 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/20817/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1486#issuecomment-56873514 [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20817/consoleFull) for PR 1486 at commit [`c6390f3`](https://github.com/apache/spark/commit/c6390f3c3f776e189f8919855a988eae03de8af9). * This patch **fails** unit tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1486#issuecomment-56873419 [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20817/consoleFull) for PR 1486 at commit [`c6390f3`](https://github.com/apache/spark/commit/c6390f3c3f776e189f8919855a988eae03de8af9). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...
Github user cmccabe commented on the pull request: https://github.com/apache/spark/pull/1486#issuecomment-56873120 I have pushed a new version that updated the MimaExcludes.scala file with ProblemFilters.excludeSparkClass("org.apache.spark.scheduler.TaskLocation")... hopefully that will take care of it. Is there an sbt target for running the mima check locally? I didn't see one --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/1486#issuecomment-56769831 @cmccabe this is still failing the MIMA checks: [error] * declaration of class org.apache.spark.scheduler.TaskLocation has changed to interface org.apache.spark.scheduler.TaskLocation in new version; changing class to interface breaks client code [error]filter with: ProblemFilters.exclude[IncompatibleTemplateDefProblem]("org.apache.spark.scheduler.TaskLocation") --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/1486#issuecomment-56763276 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/20770/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1486#issuecomment-56763272 [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20770/consoleFull) for PR 1486 at commit [`9c4933c`](https://github.com/apache/spark/commit/9c4933c6e18db8bf2e0cbd0deb85b46c2ca0d2b2). * This patch **fails** unit tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1486#issuecomment-56763240 [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/148/consoleFull) for PR 1486 at commit [`9c4933c`](https://github.com/apache/spark/commit/9c4933c6e18db8bf2e0cbd0deb85b46c2ca0d2b2). * This patch **fails** unit tests. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `logInfo("Interrupting user class to stop.")` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1486#issuecomment-56759037 [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20770/consoleFull) for PR 1486 at commit [`9c4933c`](https://github.com/apache/spark/commit/9c4933c6e18db8bf2e0cbd0deb85b46c2ca0d2b2). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1486#issuecomment-56759058 [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/148/consoleFull) for PR 1486 at commit [`9c4933c`](https://github.com/apache/spark/commit/9c4933c6e18db8bf2e0cbd0deb85b46c2ca0d2b2). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/1486#issuecomment-56758490 Jenkins, test this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...
Github user cmccabe commented on the pull request: https://github.com/apache/spark/pull/1486#issuecomment-56743271 Yes, let's file a follow-up JIRA to discuss a design that can take into account any kind of different replica location. This patch doesn't expose any new APIs-- it's all internal to Spark, and so we can easily fit it into a bigger design when that arrives. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/1486#issuecomment-56586230 Basically my feeling is not to block user-submitted patches on someone making a broader re-design if they are fairly isolated and only change internal API's. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/1486#issuecomment-56585915 I am totally 100% in support of adding a general mechanism for this and exposing it as a public API based on URI's. And pushing this general thing into the TaskSetMangaer etc. That's for sure what we need to do longer term. The idea here was just to do something less ambitious for this internal use case - and we explicitly didn't document it or make it external at all. I think once we see a few difference cases doing this, the it will be the time a more general public API for it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...
Github user mridulm commented on the pull request: https://github.com/apache/spark/pull/1486#issuecomment-56506066 @pwendell This is not hadoop RDD specific functionality - it is a general requirement which can be leveraged by any RDD in spark - and hadoop RDD currently happens to have a usecase for this when dfs caching is used. The fact that preferred location is currently a String might be the limitation here : and so extending it for uri or whatever else will add overhead (including current patch). For example: RDD which pulls data from tachyon or other distributed memory stores, loading data into accelerator cards and specifying process local locality for the block, etc are all uses of the same functionality imo. If not addressed properly, when the next similar requirement comes along - either we will be rewriting this code; or adding more surgical hacks along same lines. If the expectation is that spark wont need to support these other requirements [1], then we can definitely punt on doing a proper design change. Given this is not user facing change (right ?), we can definitely take current approach and replace it later; or do a more principled solution upfront. @kayousterhout @markhamstra @mateiz any thoughts given this modifies TaskSetManager for addition of this feature ? [1] which is unlikely given mllib's rapid pace of development - it is fairly inevitable to have the need to support accelerator cards sooner rather than later - atleast given the arc of our past efforts with ml on spark. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/1486#issuecomment-56484097 @mridulm the proposal here was to avoid proposing a generalized/public API for these and instead do something simple/internal for the case of Hadoop RDD. The underscore is not a valid character in a hostname, so we can use it safely and continue to support it going forward at low cost. This just piggy-backs on the existing support we already have for in-memory input blocks. I'd like to see ups adding a publicly documented complete interface for specifying task locality levels like you said and supporting them in a general way in the TaskSetManager. URI's could be good for this, or some other structured format. But that is a much more complicated proposition, and one that requires some design discussion. The purpose of this patch is to do something more surgical in the short term. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...
Github user mridulm commented on the pull request: https://github.com/apache/spark/pull/1486#issuecomment-56480392 Are we proposing to introduce hdfs caching tags/idioms directly into TaskSetManager in this pr ? That does not look right. We need to generalize this so that any rdd can specify process/host (maybe rack also ?) annotations. Once done, HadoopRdd can leverage that. Depending on underscore not being in name, etc is fragile. One option would be to define our uri's: with default reverting to host only. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1486#issuecomment-56470414 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/20678/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1486#issuecomment-56470411 [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20678/consoleFull) for PR 1486 at commit [`9c4933c`](https://github.com/apache/spark/commit/9c4933c6e18db8bf2e0cbd0deb85b46c2ca0d2b2). * This patch **fails** unit tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1486#issuecomment-56467045 [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20678/consoleFull) for PR 1486 at commit [`9c4933c`](https://github.com/apache/spark/commit/9c4933c6e18db8bf2e0cbd0deb85b46c2ca0d2b2). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1486#issuecomment-56465510 [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20674/consoleFull) for PR 1486 at commit [`8f9c5d6`](https://github.com/apache/spark/commit/8f9c5d66d7a630ebfee64afee7fa922c22f838ee). * This patch **fails** unit tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1486#issuecomment-56465520 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/20674/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1486#issuecomment-56463517 [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20674/consoleFull) for PR 1486 at commit [`8f9c5d6`](https://github.com/apache/spark/commit/8f9c5d66d7a630ebfee64afee7fa922c22f838ee). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...
Github user cmccabe commented on a diff in the pull request: https://github.com/apache/spark/pull/1486#discussion_r17886353 --- Diff: core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala --- @@ -309,4 +323,42 @@ private[spark] object HadoopRDD { f(inputSplit, firstParent[T].iterator(split, context)) } } + + private[spark] class SplitInfoReflections { +val inputSplitWithLocationInfo = + Class.forName("org.apache.hadoop.mapred.InputSplitWithLocationInfo") +val getLocationInfo = inputSplitWithLocationInfo.getMethod("getLocationInfo") +val newInputSplit = Class.forName("org.apache.hadoop.mapreduce.InputSplit") +val newGetLocationInfo = newInputSplit.getMethod("getLocationInfo") +val splitLocationInfo = Class.forName("org.apache.hadoop.mapred.SplitLocationInfo") +val isInMemory = splitLocationInfo.getMethod("isInMemory") +val getLocation = splitLocationInfo.getMethod("getLocation") + } + + private[spark] val SPLIT_INFO_REFLECTIONS = try { --- End diff -- Sorry, I forgot about this one. I added a type annotation. to SPLIT_INFO_REFLECTIONS. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...
Github user cmccabe commented on a diff in the pull request: https://github.com/apache/spark/pull/1486#discussion_r17886029 --- Diff: core/src/main/scala/org/apache/spark/scheduler/TaskLocation.scala --- @@ -22,13 +22,35 @@ package org.apache.spark.scheduler * In the latter case, we will prefer to launch the task on that executorID, but our next level * of preference will be executors on the same host if this is not possible. */ -private[spark] -class TaskLocation private (val host: String, val executorId: Option[String]) extends Serializable { - override def toString: String = "TaskLocation(" + host + ", " + executorId + ")" +private[spark] sealed abstract class TaskLocation(val host: String) { +} + +private [spark] case class ExecutorCacheTaskLocation(override val host: String, +val executorId: String) extends TaskLocation(host) { +} + +private [spark] case class HDFSCachedTaskLocation(override val host: String) +extends TaskLocation(host) { + override def toString = TaskLocation.in_memory_location_tag + host +} + +private [spark] case class HostTaskLocation(override val host: String) extends TaskLocation(host) { + override def toString = host } private[spark] object TaskLocation { - def apply(host: String, executorId: String) = new TaskLocation(host, Some(executorId)) + // We identify hosts on which the block is cached with this prefix. Because this prefix contains + // underscores, which are not legal characters in hostnames, there should be no potential for + // confusion. See RFC 952 and RFC 1123 for information about the format of hostnames. + val in_memory_location_tag = "_hdfs_cache_" + + def apply(host: String, executorId: String) = new ExecutorCacheTaskLocation(host, executorId) - def apply(host: String) = new TaskLocation(host, None) + def apply(str: String) = { +if (str.startsWith(in_memory_location_tag)) { + new HDFSCachedTaskLocation(str.substring(in_memory_location_tag.length)) --- End diff -- ok --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...
Github user cmccabe commented on a diff in the pull request: https://github.com/apache/spark/pull/1486#discussion_r17886024 --- Diff: core/src/main/scala/org/apache/spark/scheduler/TaskLocation.scala --- @@ -22,13 +22,35 @@ package org.apache.spark.scheduler * In the latter case, we will prefer to launch the task on that executorID, but our next level * of preference will be executors on the same host if this is not possible. */ -private[spark] -class TaskLocation private (val host: String, val executorId: Option[String]) extends Serializable { - override def toString: String = "TaskLocation(" + host + ", " + executorId + ")" +private[spark] sealed abstract class TaskLocation(val host: String) { +} + +private [spark] case class ExecutorCacheTaskLocation(override val host: String, +val executorId: String) extends TaskLocation(host) { +} + +private [spark] case class HDFSCachedTaskLocation(override val host: String) +extends TaskLocation(host) { + override def toString = TaskLocation.in_memory_location_tag + host +} + +private [spark] case class HostTaskLocation(override val host: String) extends TaskLocation(host) { + override def toString = host } private[spark] object TaskLocation { - def apply(host: String, executorId: String) = new TaskLocation(host, Some(executorId)) + // We identify hosts on which the block is cached with this prefix. Because this prefix contains + // underscores, which are not legal characters in hostnames, there should be no potential for + // confusion. See RFC 952 and RFC 1123 for information about the format of hostnames. + val in_memory_location_tag = "_hdfs_cache_" + + def apply(host: String, executorId: String) = new ExecutorCacheTaskLocation(host, executorId) - def apply(host: String) = new TaskLocation(host, None) + def apply(str: String) = { --- End diff -- added --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...
Github user cmccabe commented on a diff in the pull request: https://github.com/apache/spark/pull/1486#discussion_r17885924 --- Diff: core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala --- @@ -309,4 +323,42 @@ private[spark] object HadoopRDD { f(inputSplit, firstParent[T].iterator(split, context)) } } + + private[spark] class SplitInfoReflections { +val inputSplitWithLocationInfo = + Class.forName("org.apache.hadoop.mapred.InputSplitWithLocationInfo") +val getLocationInfo = inputSplitWithLocationInfo.getMethod("getLocationInfo") +val newInputSplit = Class.forName("org.apache.hadoop.mapreduce.InputSplit") +val newGetLocationInfo = newInputSplit.getMethod("getLocationInfo") +val splitLocationInfo = Class.forName("org.apache.hadoop.mapred.SplitLocationInfo") +val isInMemory = splitLocationInfo.getMethod("isInMemory") +val getLocation = splitLocationInfo.getMethod("getLocation") + } + + private[spark] val SPLIT_INFO_REFLECTIONS = try { +Some(new SplitInfoReflections) + } catch { +case e: Exception => + logDebug("SplitLocationInfo and other new Hadoop classes are " + + "unavailable. Using the older Hadoop location info code.", e) + None + } + + private[spark] def convertSplitLocationInfo(infos: Array[AnyRef]) :Seq[String] = { --- End diff -- ok --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...
Github user cmccabe commented on a diff in the pull request: https://github.com/apache/spark/pull/1486#discussion_r17885878 --- Diff: core/src/main/scala/org/apache/spark/scheduler/TaskLocation.scala --- @@ -22,13 +22,35 @@ package org.apache.spark.scheduler * In the latter case, we will prefer to launch the task on that executorID, but our next level * of preference will be executors on the same host if this is not possible. */ --- End diff -- I added JavaDoc here. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...
Github user cmccabe commented on a diff in the pull request: https://github.com/apache/spark/pull/1486#discussion_r17885653 --- Diff: core/src/main/scala/org/apache/spark/scheduler/TaskLocation.scala --- @@ -22,13 +22,35 @@ package org.apache.spark.scheduler * In the latter case, we will prefer to launch the task on that executorID, but our next level * of preference will be executors on the same host if this is not possible. */ -private[spark] -class TaskLocation private (val host: String, val executorId: Option[String]) extends Serializable { - override def toString: String = "TaskLocation(" + host + ", " + executorId + ")" +private[spark] sealed abstract class TaskLocation(val host: String) { +} + +private [spark] case class ExecutorCacheTaskLocation(override val host: String, +val executorId: String) extends TaskLocation(host) { +} + +private [spark] case class HDFSCachedTaskLocation(override val host: String) +extends TaskLocation(host) { + override def toString = TaskLocation.in_memory_location_tag + host +} + +private [spark] case class HostTaskLocation(override val host: String) extends TaskLocation(host) { + override def toString = host } private[spark] object TaskLocation { - def apply(host: String, executorId: String) = new TaskLocation(host, Some(executorId)) + // We identify hosts on which the block is cached with this prefix. Because this prefix contains + // underscores, which are not legal characters in hostnames, there should be no potential for + // confusion. See RFC 952 and RFC 1123 for information about the format of hostnames. + val in_memory_location_tag = "_hdfs_cache_" --- End diff -- ok --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...
Github user cmccabe commented on a diff in the pull request: https://github.com/apache/spark/pull/1486#discussion_r17881709 --- Diff: core/src/main/scala/org/apache/spark/scheduler/TaskLocation.scala --- @@ -22,13 +22,35 @@ package org.apache.spark.scheduler * In the latter case, we will prefer to launch the task on that executorID, but our next level * of preference will be executors on the same host if this is not possible. */ -private[spark] -class TaskLocation private (val host: String, val executorId: Option[String]) extends Serializable { - override def toString: String = "TaskLocation(" + host + ", " + executorId + ")" +private[spark] sealed abstract class TaskLocation(val host: String) { +} + +private [spark] case class ExecutorCacheTaskLocation(override val host: String, +val executorId: String) extends TaskLocation(host) { +} + +private [spark] case class HDFSCachedTaskLocation(override val host: String) --- End diff -- ok --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...
Github user aarondav commented on the pull request: https://github.com/apache/spark/pull/1486#issuecomment-56135114 I only had a few minor comments about documentation while trying to do a quick read-through of this patch. No substantive comments. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...
Github user aarondav commented on a diff in the pull request: https://github.com/apache/spark/pull/1486#discussion_r17769069 --- Diff: core/src/main/scala/org/apache/spark/scheduler/TaskLocation.scala --- @@ -22,13 +22,35 @@ package org.apache.spark.scheduler * In the latter case, we will prefer to launch the task on that executorID, but our next level * of preference will be executors on the same host if this is not possible. */ -private[spark] -class TaskLocation private (val host: String, val executorId: Option[String]) extends Serializable { - override def toString: String = "TaskLocation(" + host + ", " + executorId + ")" +private[spark] sealed abstract class TaskLocation(val host: String) { +} + +private [spark] case class ExecutorCacheTaskLocation(override val host: String, +val executorId: String) extends TaskLocation(host) { +} + +private [spark] case class HDFSCachedTaskLocation(override val host: String) +extends TaskLocation(host) { + override def toString = TaskLocation.in_memory_location_tag + host +} + +private [spark] case class HostTaskLocation(override val host: String) extends TaskLocation(host) { + override def toString = host } private[spark] object TaskLocation { - def apply(host: String, executorId: String) = new TaskLocation(host, Some(executorId)) + // We identify hosts on which the block is cached with this prefix. Because this prefix contains + // underscores, which are not legal characters in hostnames, there should be no potential for + // confusion. See RFC 952 and RFC 1123 for information about the format of hostnames. + val in_memory_location_tag = "_hdfs_cache_" + + def apply(host: String, executorId: String) = new ExecutorCacheTaskLocation(host, executorId) - def apply(host: String) = new TaskLocation(host, None) + def apply(str: String) = { --- End diff -- The contract of this method is kinda sketchy -- taking in a "str" which is either a host name or a tag. Would you mind adding a bit of Javadoc to explain that this is what is happening? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...
Github user aarondav commented on a diff in the pull request: https://github.com/apache/spark/pull/1486#discussion_r17769055 --- Diff: core/src/main/scala/org/apache/spark/scheduler/TaskLocation.scala --- @@ -22,13 +22,35 @@ package org.apache.spark.scheduler * In the latter case, we will prefer to launch the task on that executorID, but our next level * of preference will be executors on the same host if this is not possible. */ -private[spark] -class TaskLocation private (val host: String, val executorId: Option[String]) extends Serializable { - override def toString: String = "TaskLocation(" + host + ", " + executorId + ")" +private[spark] sealed abstract class TaskLocation(val host: String) { +} + +private [spark] case class ExecutorCacheTaskLocation(override val host: String, +val executorId: String) extends TaskLocation(host) { +} + +private [spark] case class HDFSCachedTaskLocation(override val host: String) +extends TaskLocation(host) { + override def toString = TaskLocation.in_memory_location_tag + host +} + +private [spark] case class HostTaskLocation(override val host: String) extends TaskLocation(host) { + override def toString = host } private[spark] object TaskLocation { - def apply(host: String, executorId: String) = new TaskLocation(host, Some(executorId)) + // We identify hosts on which the block is cached with this prefix. Because this prefix contains + // underscores, which are not legal characters in hostnames, there should be no potential for + // confusion. See RFC 952 and RFC 1123 for information about the format of hostnames. + val in_memory_location_tag = "_hdfs_cache_" + + def apply(host: String, executorId: String) = new ExecutorCacheTaskLocation(host, executorId) - def apply(host: String) = new TaskLocation(host, None) + def apply(str: String) = { +if (str.startsWith(in_memory_location_tag)) { + new HDFSCachedTaskLocation(str.substring(in_memory_location_tag.length)) --- End diff -- nit: `str.stripPrefix(in_memory_location_tag)` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...
Github user aarondav commented on a diff in the pull request: https://github.com/apache/spark/pull/1486#discussion_r17769053 --- Diff: core/src/main/scala/org/apache/spark/scheduler/TaskLocation.scala --- @@ -22,13 +22,35 @@ package org.apache.spark.scheduler * In the latter case, we will prefer to launch the task on that executorID, but our next level * of preference will be executors on the same host if this is not possible. */ -private[spark] -class TaskLocation private (val host: String, val executorId: Option[String]) extends Serializable { - override def toString: String = "TaskLocation(" + host + ", " + executorId + ")" +private[spark] sealed abstract class TaskLocation(val host: String) { +} + +private [spark] case class ExecutorCacheTaskLocation(override val host: String, +val executorId: String) extends TaskLocation(host) { +} + +private [spark] case class HDFSCachedTaskLocation(override val host: String) +extends TaskLocation(host) { + override def toString = TaskLocation.in_memory_location_tag + host +} + +private [spark] case class HostTaskLocation(override val host: String) extends TaskLocation(host) { + override def toString = host } private[spark] object TaskLocation { - def apply(host: String, executorId: String) = new TaskLocation(host, Some(executorId)) + // We identify hosts on which the block is cached with this prefix. Because this prefix contains + // underscores, which are not legal characters in hostnames, there should be no potential for + // confusion. See RFC 952 and RFC 1123 for information about the format of hostnames. + val in_memory_location_tag = "_hdfs_cache_" --- End diff -- Also, nit: could you use camel case: `inMemoryLocationTag`, or all caps with underscores if you prefer. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...
Github user aarondav commented on a diff in the pull request: https://github.com/apache/spark/pull/1486#discussion_r17768989 --- Diff: core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala --- @@ -309,4 +323,42 @@ private[spark] object HadoopRDD { f(inputSplit, firstParent[T].iterator(split, context)) } } + + private[spark] class SplitInfoReflections { +val inputSplitWithLocationInfo = + Class.forName("org.apache.hadoop.mapred.InputSplitWithLocationInfo") +val getLocationInfo = inputSplitWithLocationInfo.getMethod("getLocationInfo") +val newInputSplit = Class.forName("org.apache.hadoop.mapreduce.InputSplit") +val newGetLocationInfo = newInputSplit.getMethod("getLocationInfo") +val splitLocationInfo = Class.forName("org.apache.hadoop.mapred.SplitLocationInfo") +val isInMemory = splitLocationInfo.getMethod("isInMemory") +val getLocation = splitLocationInfo.getMethod("getLocation") + } + + private[spark] val SPLIT_INFO_REFLECTIONS = try { +Some(new SplitInfoReflections) + } catch { +case e: Exception => + logDebug("SplitLocationInfo and other new Hadoop classes are " + + "unavailable. Using the older Hadoop location info code.", e) + None + } + + private[spark] def convertSplitLocationInfo(infos: Array[AnyRef]) :Seq[String] = { --- End diff -- nit: '): Seq[String]` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...
Github user aarondav commented on a diff in the pull request: https://github.com/apache/spark/pull/1486#discussion_r17768962 --- Diff: core/src/main/scala/org/apache/spark/scheduler/TaskLocation.scala --- @@ -22,13 +22,35 @@ package org.apache.spark.scheduler * In the latter case, we will prefer to launch the task on that executorID, but our next level * of preference will be executors on the same host if this is not possible. */ --- End diff -- Would you mind beefing up the documentation here a bit? I am having trouble reading through and quickly finding out the difference between HostTaskLocation and ExecutorCacheTaskLocation. I guess the latter is exclusively used for the BlockManager cache, but it would be good to be explicit. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...
Github user aarondav commented on a diff in the pull request: https://github.com/apache/spark/pull/1486#discussion_r17768928 --- Diff: core/src/main/scala/org/apache/spark/scheduler/TaskLocation.scala --- @@ -22,13 +22,35 @@ package org.apache.spark.scheduler * In the latter case, we will prefer to launch the task on that executorID, but our next level * of preference will be executors on the same host if this is not possible. */ -private[spark] -class TaskLocation private (val host: String, val executorId: Option[String]) extends Serializable { - override def toString: String = "TaskLocation(" + host + ", " + executorId + ")" +private[spark] sealed abstract class TaskLocation(val host: String) { +} + +private [spark] case class ExecutorCacheTaskLocation(override val host: String, +val executorId: String) extends TaskLocation(host) { +} + +private [spark] case class HDFSCachedTaskLocation(override val host: String) +extends TaskLocation(host) { + override def toString = TaskLocation.in_memory_location_tag + host +} + +private [spark] case class HostTaskLocation(override val host: String) extends TaskLocation(host) { --- End diff -- Minor, but `override val` on something that exports the same parameter is kinda weird, I think this could be cleaned up just slightly by making TaskLocation a trait instead with a `def host: String`. Then this still works and is the sole implementation. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/1486#issuecomment-5619 Yes, this appears to be an issue with our checker and adding an exclusion is fine for now. The class is private. Just had really minor comments and I can address them on merge if you want. This is looking good to me. Any other changes or is this good from your side? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...
Github user pwendell commented on a diff in the pull request: https://github.com/apache/spark/pull/1486#discussion_r17768506 --- Diff: core/src/main/scala/org/apache/spark/scheduler/TaskLocation.scala --- @@ -22,13 +22,35 @@ package org.apache.spark.scheduler * In the latter case, we will prefer to launch the task on that executorID, but our next level * of preference will be executors on the same host if this is not possible. */ -private[spark] -class TaskLocation private (val host: String, val executorId: Option[String]) extends Serializable { - override def toString: String = "TaskLocation(" + host + ", " + executorId + ")" +private[spark] sealed abstract class TaskLocation(val host: String) { +} + +private [spark] case class ExecutorCacheTaskLocation(override val host: String, +val executorId: String) extends TaskLocation(host) { +} + +private [spark] case class HDFSCachedTaskLocation(override val host: String) +extends TaskLocation(host) { + override def toString = TaskLocation.in_memory_location_tag + host +} + +private [spark] case class HostTaskLocation(override val host: String) extends TaskLocation(host) { + override def toString = host } private[spark] object TaskLocation { - def apply(host: String, executorId: String) = new TaskLocation(host, Some(executorId)) + // We identify hosts on which the block is cached with this prefix. Because this prefix contains + // underscores, which are not legal characters in hostnames, there should be no potential for + // confusion. See RFC 952 and RFC 1123 for information about the format of hostnames. + val in_memory_location_tag = "_hdfs_cache_" --- End diff -- could you drop the prefixing `_` here to make it consistent with blockid? Having a trailing underscore seems sufficient to distinguish it from a real hostname. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...
Github user pwendell commented on a diff in the pull request: https://github.com/apache/spark/pull/1486#discussion_r17768479 --- Diff: core/src/main/scala/org/apache/spark/scheduler/TaskLocation.scala --- @@ -22,13 +22,35 @@ package org.apache.spark.scheduler * In the latter case, we will prefer to launch the task on that executorID, but our next level * of preference will be executors on the same host if this is not possible. */ -private[spark] -class TaskLocation private (val host: String, val executorId: Option[String]) extends Serializable { - override def toString: String = "TaskLocation(" + host + ", " + executorId + ")" +private[spark] sealed abstract class TaskLocation(val host: String) { +} + +private [spark] case class ExecutorCacheTaskLocation(override val host: String, +val executorId: String) extends TaskLocation(host) { +} + +private [spark] case class HDFSCachedTaskLocation(override val host: String) --- End diff -- should this be `HDFSCacheTaskLocation` to be consistent with `ExecutorCacheTaskLocation`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...
Github user pwendell commented on a diff in the pull request: https://github.com/apache/spark/pull/1486#discussion_r17768467 --- Diff: core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala --- @@ -309,4 +323,42 @@ private[spark] object HadoopRDD { f(inputSplit, firstParent[T].iterator(split, context)) } } + + private[spark] class SplitInfoReflections { +val inputSplitWithLocationInfo = + Class.forName("org.apache.hadoop.mapred.InputSplitWithLocationInfo") +val getLocationInfo = inputSplitWithLocationInfo.getMethod("getLocationInfo") +val newInputSplit = Class.forName("org.apache.hadoop.mapreduce.InputSplit") +val newGetLocationInfo = newInputSplit.getMethod("getLocationInfo") +val splitLocationInfo = Class.forName("org.apache.hadoop.mapred.SplitLocationInfo") +val isInMemory = splitLocationInfo.getMethod("isInMemory") +val getLocation = splitLocationInfo.getMethod("getLocation") + } + + private[spark] val SPLIT_INFO_REFLECTIONS = try { --- End diff -- did you decide you'd prefer not to do this? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1486#issuecomment-56125277 [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20563/consoleFull) for PR 1486 at commit [`d1f9fe3`](https://github.com/apache/spark/commit/d1f9fe36392ab18e36e8491cae4598e0063e59fa). * This patch **passes** unit tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1486#issuecomment-56120182 [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20563/consoleFull) for PR 1486 at commit [`d1f9fe3`](https://github.com/apache/spark/commit/d1f9fe36392ab18e36e8491cae4598e0063e59fa). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...
Github user cmccabe commented on the pull request: https://github.com/apache/spark/pull/1486#issuecomment-55966988 The "unit test failure" mentioned here seems to be coming from the binary compatibility checker. The text of the error is: [error] * class org.apache.spark.scheduler.TaskLocation was concrete; is declared abstract in new version [error]filter with: ProblemFilters.exclude[AbstractClassProblem]("org.apache.spark.scheduler.TaskLocation") This check seems too strict to me, since TaskLocation is a private[spark] class. It's never exposed to users and isn't part of any user-facing API. What should I do here? I could add an "ignore" for this, I suppose. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1486#issuecomment-55966108 [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20487/consoleFull) for PR 1486 at commit [`b95ccd7`](https://github.com/apache/spark/commit/b95ccd74e5a1e9a8094189ba2400e26adea551a1). * This patch **fails** unit tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1486#issuecomment-55957034 [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20487/consoleFull) for PR 1486 at commit [`b95ccd7`](https://github.com/apache/spark/commit/b95ccd74e5a1e9a8094189ba2400e26adea551a1). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...
Github user cmccabe commented on a diff in the pull request: https://github.com/apache/spark/pull/1486#discussion_r17691734 --- Diff: core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala --- @@ -181,8 +181,24 @@ private[spark] class TaskSetManager( } for (loc <- tasks(index).preferredLocations) { - for (execId <- loc.executorId) { -addTo(pendingTasksForExecutor.getOrElseUpdate(execId, new ArrayBuffer)) + loc match { +case e : ExecutorCacheTaskLocation => --- End diff -- ok --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...
Github user cmccabe commented on a diff in the pull request: https://github.com/apache/spark/pull/1486#discussion_r17691691 --- Diff: core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala --- @@ -181,8 +181,24 @@ private[spark] class TaskSetManager( } for (loc <- tasks(index).preferredLocations) { - for (execId <- loc.executorId) { -addTo(pendingTasksForExecutor.getOrElseUpdate(execId, new ArrayBuffer)) + loc match { +case e : ExecutorCacheTaskLocation => + addTo(pendingTasksForExecutor.getOrElseUpdate(e.executorId, new ArrayBuffer)) +case e : HDFSCachedTaskLocation => { + val exe = sched.getExecutorsAliveOnHost(loc.host) + exe match { +case Some(set) => { + for (e <- set) { +addTo(pendingTasksForExecutor.getOrElseUpdate(e, new ArrayBuffer)) + } + logInfo("Pending task " + index + " has a cached location at " + e.host + --- End diff -- ok --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...
Github user cmccabe commented on a diff in the pull request: https://github.com/apache/spark/pull/1486#discussion_r17691660 --- Diff: core/src/main/scala/org/apache/spark/scheduler/TaskLocation.scala --- @@ -23,12 +23,35 @@ package org.apache.spark.scheduler * of preference will be executors on the same host if this is not possible. */ private[spark] -class TaskLocation private (val host: String, val executorId: Option[String]) extends Serializable { - override def toString: String = "TaskLocation(" + host + ", " + executorId + ")" +sealed abstract class TaskLocation(val host: String) { +} + +private [spark] case class ExecutorCacheTaskLocation(override val host: String, +val executorId: String) extends TaskLocation(host) { +} + +private [spark] case class HDFSCachedTaskLocation(override val host: String) +extends TaskLocation(host) { + override def toString = TaskLocation.in_memory_location_tag + host +} + +private [spark] case class HostTaskLocation(override val host: String) extends TaskLocation(host) { + override def toString = host } private[spark] object TaskLocation { - def apply(host: String, executorId: String) = new TaskLocation(host, Some(executorId)) + // We identify hosts on which the block is cached with this prefix. Because this prefix contains + // underscores, which are not legal characters in hostnames, there should be no potential for + // confusion. See RFC 952 and RFC 1123 for information about the format of hostnames. + val in_memory_location_tag = "_M_" --- End diff -- ok --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...
Github user pwendell commented on a diff in the pull request: https://github.com/apache/spark/pull/1486#discussion_r17614113 --- Diff: core/src/main/scala/org/apache/spark/scheduler/TaskLocation.scala --- @@ -23,12 +23,33 @@ package org.apache.spark.scheduler * of preference will be executors on the same host if this is not possible. */ private[spark] -class TaskLocation private (val host: String, val executorId: Option[String]) extends Serializable { - override def toString: String = "TaskLocation(" + host + ", " + executorId + ")" +sealed abstract class TaskLocation(val host: String) { +} + +case class ExecutorCacheTaskLocation(override val host: String, val executorId: String) +extends TaskLocation(host) { +} + +case class HDFSCachedTaskLocation(override val host: String) extends TaskLocation(host) { + override def toString = TaskLocation.IN_MEMORY_LOCATION_TAG + host +} + +case class HostTaskLocation(override val host: String) extends TaskLocation(host) { + override def toString = host } private[spark] object TaskLocation { - def apply(host: String, executorId: String) = new TaskLocation(host, Some(executorId)) + val IN_MEMORY_LOCATION_TAG = "_M_" --- End diff -- Yes, I meant hostnames --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/1486#issuecomment-55771271 On the visibility stuff, understood. I actually forgot the "old API" is still supported in newer versions of Hadoop. Otherwise, you could put this all in the new hadoop RDD. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...
Github user pwendell commented on a diff in the pull request: https://github.com/apache/spark/pull/1486#discussion_r17612925 --- Diff: core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala --- @@ -181,8 +181,24 @@ private[spark] class TaskSetManager( } for (loc <- tasks(index).preferredLocations) { - for (execId <- loc.executorId) { -addTo(pendingTasksForExecutor.getOrElseUpdate(execId, new ArrayBuffer)) + loc match { +case e : ExecutorCacheTaskLocation => --- End diff -- Should be `case e: ExecutorCacheTaskLocation` - the spacing around colons is off in a bunch of places in this patch, so maybe do an inventory on it. It should also be zzero spaces before and one space after. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...
Github user pwendell commented on a diff in the pull request: https://github.com/apache/spark/pull/1486#discussion_r17612896 --- Diff: core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala --- @@ -181,8 +181,24 @@ private[spark] class TaskSetManager( } for (loc <- tasks(index).preferredLocations) { - for (execId <- loc.executorId) { -addTo(pendingTasksForExecutor.getOrElseUpdate(execId, new ArrayBuffer)) + loc match { +case e : ExecutorCacheTaskLocation => + addTo(pendingTasksForExecutor.getOrElseUpdate(e.executorId, new ArrayBuffer)) +case e : HDFSCachedTaskLocation => { + val exe = sched.getExecutorsAliveOnHost(loc.host) + exe match { +case Some(set) => { + for (e <- set) { +addTo(pendingTasksForExecutor.getOrElseUpdate(e, new ArrayBuffer)) + } + logInfo("Pending task " + index + " has a cached location at " + e.host + --- End diff -- not a big deal, but it's a bit nicer to use string interpoliation in cases like this. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...
Github user pwendell commented on a diff in the pull request: https://github.com/apache/spark/pull/1486#discussion_r17612851 --- Diff: core/src/main/scala/org/apache/spark/scheduler/TaskLocation.scala --- @@ -23,12 +23,35 @@ package org.apache.spark.scheduler * of preference will be executors on the same host if this is not possible. */ private[spark] -class TaskLocation private (val host: String, val executorId: Option[String]) extends Serializable { - override def toString: String = "TaskLocation(" + host + ", " + executorId + ")" +sealed abstract class TaskLocation(val host: String) { +} + +private [spark] case class ExecutorCacheTaskLocation(override val host: String, +val executorId: String) extends TaskLocation(host) { +} + +private [spark] case class HDFSCachedTaskLocation(override val host: String) +extends TaskLocation(host) { + override def toString = TaskLocation.in_memory_location_tag + host +} + +private [spark] case class HostTaskLocation(override val host: String) extends TaskLocation(host) { + override def toString = host } private[spark] object TaskLocation { - def apply(host: String, executorId: String) = new TaskLocation(host, Some(executorId)) + // We identify hosts on which the block is cached with this prefix. Because this prefix contains + // underscores, which are not legal characters in hostnames, there should be no potential for + // confusion. See RFC 952 and RFC 1123 for information about the format of hostnames. + val in_memory_location_tag = "_M_" --- End diff -- Could you make this lower case to match other cases where we do this? e.g. `hdfs_cache_` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1486#issuecomment-55687848 [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20359/consoleFull) for PR 1486 at commit [`0d10adb`](https://github.com/apache/spark/commit/0d10adbc9abad5ac4bfd1e2d8538e2daf37a2a98). * This patch **fails** unit tests. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `sealed abstract class TaskLocation(val host: String) ` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1486#issuecomment-55683946 [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20359/consoleFull) for PR 1486 at commit [`0d10adb`](https://github.com/apache/spark/commit/0d10adbc9abad5ac4bfd1e2d8538e2daf37a2a98). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...
Github user cmccabe commented on the pull request: https://github.com/apache/spark/pull/1486#issuecomment-55683469 I can see why you'd like to reduce visibility, but I don't think it's possible here. In HadoopRDD, three new things are exposed with visibility private [spark]. They are SPLIT_INFO_REFLECTIONS, convertSplitLocations, and the SplitInfoReflections type. These things are exposed for the benefit of NewHadoopRDD. As far as I can see, you can't just have "a single... function call with a narrow interface" because HadoopRDD is dealing with different types than NewHadoopRDD. HadoopRDD is dealing with HadoopPartition which needs to be cast to org.apache.hadoop.mapred.InputSplitWithLocationInfo; NewHadoopRDD is dealing with org.apache.hadoop.mapreduce.InputSplit. You will need two separate code paths for this. Ultimately, the goal is to get an array of org.apache.hadoop.mapred.SplitLocationInfo, at which point a common function can be called, HadoopRDD#convertSplitLocationInfo. That common function currently lives in HadoopRDD.scala. We always have the freedom to refactor this in the future, since it's not visible from outside Spark. I think it's very unlikely that any other code inside Spark will start calling these methods, since they're obviously tied to a very specific goal of extracting information from hadoop 2.x types. What do you guys think? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...
Github user cmccabe commented on a diff in the pull request: https://github.com/apache/spark/pull/1486#discussion_r17579270 --- Diff: core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala --- @@ -309,4 +323,42 @@ private[spark] object HadoopRDD { f(inputSplit, firstParent[T].iterator(split, context)) } } + + private[spark] class SplitInfoReflections { --- End diff -- I commented about this below, but to summarize, the old and new rdds use different types and the code path is different. There is a common function called HadoopRDD#convertSplitLocationInfo, but the types need to be "pre-processed" a bit to get to the point where we can call that. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...
Github user cmccabe commented on a diff in the pull request: https://github.com/apache/spark/pull/1486#discussion_r17578280 --- Diff: core/src/main/scala/org/apache/spark/scheduler/TaskLocation.scala --- @@ -23,12 +23,33 @@ package org.apache.spark.scheduler * of preference will be executors on the same host if this is not possible. */ private[spark] -class TaskLocation private (val host: String, val executorId: Option[String]) extends Serializable { - override def toString: String = "TaskLocation(" + host + ", " + executorId + ")" +sealed abstract class TaskLocation(val host: String) { +} + +case class ExecutorCacheTaskLocation(override val host: String, val executorId: String) +extends TaskLocation(host) { +} + +case class HDFSCachedTaskLocation(override val host: String) extends TaskLocation(host) { + override def toString = TaskLocation.IN_MEMORY_LOCATION_TAG + host +} + +case class HostTaskLocation(override val host: String) extends TaskLocation(host) { + override def toString = host } private[spark] object TaskLocation { - def apply(host: String, executorId: String) = new TaskLocation(host, Some(executorId)) + val IN_MEMORY_LOCATION_TAG = "_M_" --- End diff -- Underscores are not allowed in hostnames. The hostname format was specified in RFC 952 and later updated in RFC 1123. Legend has it that the underscore was not allowed because a popular teletype unit at the time didn't have an underscore key and the goal was interoperability. I assume your reference to URIs was a typo since there's no URIs in this file. I'll lowercase the constants. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...
Github user cmccabe commented on a diff in the pull request: https://github.com/apache/spark/pull/1486#discussion_r17578110 --- Diff: core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala --- @@ -309,4 +323,42 @@ private[spark] object HadoopRDD { f(inputSplit, firstParent[T].iterator(split, context)) } } + + private[spark] class SplitInfoReflections { +val inputSplitWithLocationInfo = + Class.forName("org.apache.hadoop.mapred.InputSplitWithLocationInfo") +val getLocationInfo = inputSplitWithLocationInfo.getMethod("getLocationInfo") +val newInputSplit = Class.forName("org.apache.hadoop.mapreduce.InputSplit") +val newGetLocationInfo = newInputSplit.getMethod("getLocationInfo") +val splitLocationInfo = Class.forName("org.apache.hadoop.mapred.SplitLocationInfo") +val isInMemory = splitLocationInfo.getMethod("isInMemory") +val getLocation = splitLocationInfo.getMethod("getLocation") + } + + private[spark] val SPLIT_INFO_REFLECTIONS = try { +Some(new SplitInfoReflections) + } catch { +case e: Exception => + logDebug("SplitLocationInfo and other new Hadoop classes are " + + "unavailable. Using the older Hadoop location info code.", e) + None + } + + private[spark] def convertSplitLocationInfo(infos : Array[AnyRef]) : Seq[String] = { +val out = new ListBuffer[String] +infos.foreach(loc => { --- End diff -- k --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...
Github user cmccabe commented on a diff in the pull request: https://github.com/apache/spark/pull/1486#discussion_r17578061 --- Diff: core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala --- @@ -309,4 +323,42 @@ private[spark] object HadoopRDD { f(inputSplit, firstParent[T].iterator(split, context)) } } + + private[spark] class SplitInfoReflections { +val inputSplitWithLocationInfo = + Class.forName("org.apache.hadoop.mapred.InputSplitWithLocationInfo") +val getLocationInfo = inputSplitWithLocationInfo.getMethod("getLocationInfo") +val newInputSplit = Class.forName("org.apache.hadoop.mapreduce.InputSplit") +val newGetLocationInfo = newInputSplit.getMethod("getLocationInfo") +val splitLocationInfo = Class.forName("org.apache.hadoop.mapred.SplitLocationInfo") +val isInMemory = splitLocationInfo.getMethod("isInMemory") +val getLocation = splitLocationInfo.getMethod("getLocation") + } + + private[spark] val SPLIT_INFO_REFLECTIONS = try { +Some(new SplitInfoReflections) + } catch { +case e: Exception => + logDebug("SplitLocationInfo and other new Hadoop classes are " + + "unavailable. Using the older Hadoop location info code.", e) + None + } + + private[spark] def convertSplitLocationInfo(infos : Array[AnyRef]) : Seq[String] = { --- End diff -- ok --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...
Github user cmccabe commented on a diff in the pull request: https://github.com/apache/spark/pull/1486#discussion_r17578078 --- Diff: core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala --- @@ -309,4 +323,42 @@ private[spark] object HadoopRDD { f(inputSplit, firstParent[T].iterator(split, context)) } } + + private[spark] class SplitInfoReflections { +val inputSplitWithLocationInfo = + Class.forName("org.apache.hadoop.mapred.InputSplitWithLocationInfo") +val getLocationInfo = inputSplitWithLocationInfo.getMethod("getLocationInfo") +val newInputSplit = Class.forName("org.apache.hadoop.mapreduce.InputSplit") +val newGetLocationInfo = newInputSplit.getMethod("getLocationInfo") +val splitLocationInfo = Class.forName("org.apache.hadoop.mapred.SplitLocationInfo") +val isInMemory = splitLocationInfo.getMethod("isInMemory") +val getLocation = splitLocationInfo.getMethod("getLocation") + } + + private[spark] val SPLIT_INFO_REFLECTIONS = try { +Some(new SplitInfoReflections) + } catch { +case e: Exception => + logDebug("SplitLocationInfo and other new Hadoop classes are " + + "unavailable. Using the older Hadoop location info code.", e) + None + } + + private[spark] def convertSplitLocationInfo(infos : Array[AnyRef]) : Seq[String] = { +val out = new ListBuffer[String] --- End diff -- ok --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...
Github user cmccabe commented on a diff in the pull request: https://github.com/apache/spark/pull/1486#discussion_r17578041 --- Diff: core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala --- @@ -309,4 +323,42 @@ private[spark] object HadoopRDD { f(inputSplit, firstParent[T].iterator(split, context)) } } + + private[spark] class SplitInfoReflections { +val inputSplitWithLocationInfo = + Class.forName("org.apache.hadoop.mapred.InputSplitWithLocationInfo") +val getLocationInfo = inputSplitWithLocationInfo.getMethod("getLocationInfo") +val newInputSplit = Class.forName("org.apache.hadoop.mapreduce.InputSplit") +val newGetLocationInfo = newInputSplit.getMethod("getLocationInfo") +val splitLocationInfo = Class.forName("org.apache.hadoop.mapred.SplitLocationInfo") +val isInMemory = splitLocationInfo.getMethod("isInMemory") +val getLocation = splitLocationInfo.getMethod("getLocation") + } + + private[spark] val SPLIT_INFO_REFLECTIONS = try { --- End diff -- ok --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...
Github user cmccabe commented on a diff in the pull request: https://github.com/apache/spark/pull/1486#discussion_r17577975 --- Diff: core/src/main/scala/org/apache/spark/rdd/RDD.scala --- @@ -208,8 +208,10 @@ abstract class RDD[T: ClassTag]( } /** - * Get the preferred locations of a partition (as hostnames), taking into account whether the + * Get the preferred locations of a partition, taking into account whether the * RDD is checkpointed. + * The strings returned here can be parsed into TaskLocation objects by --- End diff -- I will leave it as-is for now. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...
Github user cmccabe commented on a diff in the pull request: https://github.com/apache/spark/pull/1486#discussion_r17577951 --- Diff: core/src/main/scala/org/apache/spark/scheduler/TaskLocation.scala --- @@ -23,12 +23,33 @@ package org.apache.spark.scheduler * of preference will be executors on the same host if this is not possible. */ private[spark] -class TaskLocation private (val host: String, val executorId: Option[String]) extends Serializable { - override def toString: String = "TaskLocation(" + host + ", " + executorId + ")" +sealed abstract class TaskLocation(val host: String) { +} + +case class ExecutorCacheTaskLocation(override val host: String, val executorId: String) +extends TaskLocation(host) { +} + +case class HDFSCachedTaskLocation(override val host: String) extends TaskLocation(host) { + override def toString = TaskLocation.IN_MEMORY_LOCATION_TAG + host +} + +case class HostTaskLocation(override val host: String) extends TaskLocation(host) { + override def toString = host } private[spark] object TaskLocation { - def apply(host: String, executorId: String) = new TaskLocation(host, Some(executorId)) + val IN_MEMORY_LOCATION_TAG = "_M_" + + def apply(host: String, executorId: String) = new ExecutorCacheTaskLocation(host, executorId) + + def apply(host: String) = new HostTaskLocation(host) - def apply(host: String) = new TaskLocation(host, None) + def fromString(str : String) = { --- End diff -- ok --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...
Github user cmccabe commented on a diff in the pull request: https://github.com/apache/spark/pull/1486#discussion_r17577914 --- Diff: core/src/main/scala/org/apache/spark/scheduler/TaskLocation.scala --- @@ -23,12 +23,33 @@ package org.apache.spark.scheduler * of preference will be executors on the same host if this is not possible. */ private[spark] -class TaskLocation private (val host: String, val executorId: Option[String]) extends Serializable { - override def toString: String = "TaskLocation(" + host + ", " + executorId + ")" +sealed abstract class TaskLocation(val host: String) { +} + +case class ExecutorCacheTaskLocation(override val host: String, val executorId: String) --- End diff -- ok --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...
Github user cmccabe commented on a diff in the pull request: https://github.com/apache/spark/pull/1486#discussion_r17577892 --- Diff: core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala --- @@ -248,10 +250,22 @@ class HadoopRDD[K, V]( new HadoopMapPartitionsWithSplitRDD(this, f, preservesPartitioning) } - override def getPreferredLocations(split: Partition): Seq[String] = { -// TODO: Filtering out "localhost" in case of file:// URLs -val hadoopSplit = split.asInstanceOf[HadoopPartition] -hadoopSplit.inputSplit.value.getLocations.filter(_ != "localhost") + override def getPreferredLocations(hsplit: Partition): Seq[String] = { --- End diff -- ok --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/1486#issuecomment-55638524 Added a few more comments after thinking about this some more. As it stands the current factoring opens up a bunch of things at `private[spark]` visibility. We always try to use the narrowest visibility possible and I think it would be easy to change this to have only a single `private[spark]` function call with a narrow interface. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...
Github user pwendell commented on a diff in the pull request: https://github.com/apache/spark/pull/1486#discussion_r17561092 --- Diff: core/src/main/scala/org/apache/spark/scheduler/TaskLocation.scala --- @@ -23,12 +23,33 @@ package org.apache.spark.scheduler * of preference will be executors on the same host if this is not possible. */ private[spark] -class TaskLocation private (val host: String, val executorId: Option[String]) extends Serializable { - override def toString: String = "TaskLocation(" + host + ", " + executorId + ")" +sealed abstract class TaskLocation(val host: String) { +} + +case class ExecutorCacheTaskLocation(override val host: String, val executorId: String) +extends TaskLocation(host) { +} + +case class HDFSCachedTaskLocation(override val host: String) extends TaskLocation(host) { + override def toString = TaskLocation.IN_MEMORY_LOCATION_TAG + host +} + +case class HostTaskLocation(override val host: String) extends TaskLocation(host) { + override def toString = host } private[spark] object TaskLocation { - def apply(host: String, executorId: String) = new TaskLocation(host, Some(executorId)) + val IN_MEMORY_LOCATION_TAG = "_M_" --- End diff -- To make this more consistent with `BlockId` can you use lower case prefixes like: `hdfs_cache_XXX`. Also is `_` reserved character in URI names? If so, it would be good to add a document in the code pointing to evidence of that fact. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...
Github user pwendell commented on a diff in the pull request: https://github.com/apache/spark/pull/1486#discussion_r17560907 --- Diff: core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala --- @@ -309,4 +323,42 @@ private[spark] object HadoopRDD { f(inputSplit, firstParent[T].iterator(split, context)) } } + + private[spark] class SplitInfoReflections { --- End diff -- the way this is factored exposes a bunch of stuff at `private[spark]` visibility that is really specific to the internals of `HadoopRDD`. What about just adding a single static utility function in `HadoopRDD`: ``` /*** Return a correctly formatted set of location strings for a HadoopPartition. */ private[spark] def getLocationPreferences(partition: HadoopPartition): Seq[String] { } ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org