[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/12527#discussion_r60695719 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala --- @@ -131,4 +134,23 @@ class FileScanRDD( } override protected def getPartitions: Array[RDDPartition] = filePartitions.toArray + + override protected def getPreferredLocations(split: RDDPartition): Seq[String] = { +val files = split.asInstanceOf[FilePartition].files + +// Computes total number of bytes can be retrieved from each host. +val hostToNumBytes = mutable.HashMap.empty[String, Long] +files.foreach { file => + file.locations.filter(_ != "localhost").foreach { host => --- End diff -- We should filter them out. For a partition that doesn't have any preferred locations, it can be bundled with any other tasks and scheduled to any executor. But once it's marked with "localhost", delayed scheduling may be triggered because they have different host name as other tasks. Further more, "localhost" isn't a valid location for the `DAGScheduler` when deciding which executors to run the tasks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/12527 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...
Github user davies commented on the pull request: https://github.com/apache/spark/pull/12527#issuecomment-213253318 Merging this into master, thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/12527#issuecomment-213252687 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/12527#issuecomment-213252688 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/56639/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/12527#issuecomment-213252532 **[Test build #56639 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/56639/consoleFull)** for PR 12527 at commit [`2b36e97`](https://github.com/apache/spark/commit/2b36e97aafec3b11132426ffb550f2188c28f787). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...
Github user yhuai commented on the pull request: https://github.com/apache/spark/pull/12527#issuecomment-213230661 lgtm --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/12527#discussion_r60686408 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala --- @@ -131,4 +134,23 @@ class FileScanRDD( } override protected def getPartitions: Array[RDDPartition] = filePartitions.toArray + + override protected def getPreferredLocations(split: RDDPartition): Seq[String] = { +val files = split.asInstanceOf[FilePartition].files + +// Computes total number of bytes can be retrieved from each host. +val hostToNumBytes = mutable.HashMap.empty[String, Long] +files.foreach { file => + file.locations.filter(_ != "localhost").foreach { host => --- End diff -- ok HadoopRDD and NewHadoopRDD are filtering locationgs set to localhost out. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/12527#issuecomment-213225300 **[Test build #56639 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/56639/consoleFull)** for PR 12527 at commit [`2b36e97`](https://github.com/apache/spark/commit/2b36e97aafec3b11132426ffb550f2188c28f787). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/12527#issuecomment-213214152 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/12527#issuecomment-213214154 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/56615/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/12527#issuecomment-213213921 **[Test build #56615 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/56615/consoleFull)** for PR 12527 at commit [`e0bfa3e`](https://github.com/apache/spark/commit/e0bfa3ed67620ed878e69e341deb102a2defc001). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/12527#issuecomment-213167863 **[Test build #56615 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/56615/consoleFull)** for PR 12527 at commit [`e0bfa3e`](https://github.com/apache/spark/commit/e0bfa3ed67620ed878e69e341deb102a2defc001). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...
Github user liancheng commented on the pull request: https://github.com/apache/spark/pull/12527#issuecomment-213167317 retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/12527#discussion_r60643219 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala --- @@ -131,4 +134,23 @@ class FileScanRDD( } override protected def getPartitions: Array[RDDPartition] = filePartitions.toArray + + override protected def getPreferredLocations(split: RDDPartition): Seq[String] = { +val files = split.asInstanceOf[FilePartition].files + +// Computes total number of bytes can be retrieved from each host. +val hostToNumBytes = mutable.HashMap.empty[String, Long] +files.foreach { file => + file.locations.filter(_ != "localhost").foreach { host => --- End diff -- I am not sure we should filter it out. For that test, localhost is the location of files belonging to one of the input tables (the side that is not shuffled), right? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/12527#issuecomment-213072256 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/12527#issuecomment-213072264 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/56559/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/12527#issuecomment-213071922 **[Test build #56559 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/56559/consoleFull)** for PR 12527 at commit [`e0bfa3e`](https://github.com/apache/spark/commit/e0bfa3ed67620ed878e69e341deb102a2defc001). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...
Github user yhuai commented on the pull request: https://github.com/apache/spark/pull/12527#issuecomment-213037505 Just want to add a note. For that test case, we have a join that only shuffle one side of the input, so we have both preferred locations of original input files as well as preferred locations of map output files. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...
Github user liancheng commented on the pull request: https://github.com/apache/spark/pull/12527#issuecomment-213034425 The root cause of the deadlock has been found. Essentially, we should prevent "localhost" to be returned as `FileScanRDD` preferred locations. Here's a detailed description of the whole processes: 1. The test case involves a shuffle, which results in a `ShuffleRowRDD`, whose preferred locations are determined by the locations of the block manager that serves corresponding map output blocks. In a word, in the case of local test, the only preferred location string is the IP of the block manager. 1. In the case of local testing, `FileScanRDD.preferredLocations` always returns "localhost". 1. As a result, task set `ts1` derived from the `ShuffleRowRDD` and task set `ts2` derived from the `FileScanRDD` have different locality preference. 1. After job submission, `DAGScheduler` first schedules `ts1`. While trying to schedule `ts2`, delayed scheduling is triggered because `ts1` and `ts2` have different preferred locations. By default, `DAGScheduler` waits for 3s before trying `ts2` again. 1. 3s is long enough for all tasks in `ts1` to finish. However, `LocalBackend` doesn't revive offers periodically like other scheduler backends. It only revives offer when tasks are submitted, finish, or fail. Thus `ts2` never gets an opportunity to be scheduled again, and the submitted job never finishes. The only factor that is not clear for now is how the number of buckets (which affects number of submitted tasks) interact with the above process. The fix for this issue is simple, just filter out all "localhost" in `FileScanRDD.preferredLocations()` since "localhost" doesn't make sense as a preferred executor location. Actually this is exactly the last step of what `NewHadoopRDD.preferredLocations()` does. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/12527#issuecomment-213025638 **[Test build #56559 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/56559/consoleFull)** for PR 12527 at commit [`e0bfa3e`](https://github.com/apache/spark/commit/e0bfa3ed67620ed878e69e341deb102a2defc001). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...
Github user liancheng commented on the pull request: https://github.com/apache/spark/pull/12527#issuecomment-212696111 Yea, agree. Should either prove this is an existing bug (so that it can be fixed in another PR), or fix it if it's a bug introduced by this change. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/12527#discussion_r60456008 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/sources/BucketedReadSuite.scala --- @@ -301,8 +301,12 @@ class BucketedReadSuite extends QueryTest with SQLTestUtils with TestHiveSinglet } test("only shuffle one side when 2 bucketed tables have different bucket keys") { -val bucketSpec1 = Some(BucketSpec(8, Seq("i"), Nil)) -val bucketSpec2 = Some(BucketSpec(8, Seq("j"), Nil)) +// !! ALERT !! +// +// Setting `numBuckets` of the following two `BucketSpec` to 8 causes a deadlock in DAGScheduler +// due to unknown reasons. Need investigation. --- End diff -- Maybe it exposes a bug in somewhere. Let's look at it together. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...
Github user davies commented on the pull request: https://github.com/apache/spark/pull/12527#issuecomment-212517510 @liancheng We can't have a workaround in test case to hide a bug, should fix it before merging. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/12527#issuecomment-212465511 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/12527#issuecomment-212465518 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/56356/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/12527#issuecomment-212464920 **[Test build #56356 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/56356/consoleFull)** for PR 12527 at commit [`0358e33`](https://github.com/apache/spark/commit/0358e33c1a796b7875f932c4d4aee28db505f696). * This patch passes all tests. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * ` case class FakeBlockLocation(` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/12153#issuecomment-212462180 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/56354/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/12153#issuecomment-212462174 Build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/12153#issuecomment-212461507 **[Test build #56354 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/56354/consoleFull)** for PR 12153 at commit [`79beca6`](https://github.com/apache/spark/commit/79beca6ad09df76a39cc1a80eebed378e3b2fffc). * This patch passes all tests. * This patch **does not merge cleanly**. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...
Github user liancheng commented on the pull request: https://github.com/apache/spark/pull/12527#issuecomment-212425913 @yhuai This one can probably avoid the deadlock since I worked around it [here][1]. Also check [this comment] for more information. Not sure whether we should merge this one since I haven't figured out whether the deadlock is caused bug(s) in this PR or in `DAGScheduler`. [1]: https://github.com/apache/spark/pull/12527/files#diff-75164dfeac2c1507ae828d2ba5529470R304 [2]: https://github.com/apache/spark/pull/12153#issuecomment-212424785 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/12527#issuecomment-212425185 **[Test build #56356 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/56356/consoleFull)** for PR 12527 at commit [`0358e33`](https://github.com/apache/spark/commit/0358e33c1a796b7875f932c4d4aee28db505f696). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...
Github user liancheng commented on the pull request: https://github.com/apache/spark/pull/12153#issuecomment-212424940 I'm closing this one for #12527, which is a rebased version of this one. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...
Github user liancheng closed the pull request at: https://github.com/apache/spark/pull/12153 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...
Github user liancheng commented on the pull request: https://github.com/apache/spark/pull/12153#issuecomment-212424785 All the timeout in the Jenkins builds were due to a deadlock in `DAGScheduler`, and can be steadily reproduced locally by running the following test case > BucketedReadSuite.only shuffle one side when 2 bucketed tables have different bucket keys. This test case creates two bucketed tables both with 8 buckets and then joins them. Reducing 8 to 5 eliminates the deadlock. But I haven't figured out the real reason behind the deadlock. The deadlock also disappears if I remove FileScanRDD.preferredLocations(). Maybe that too many tasks are scheduled to the same place and exhausted some thread-pool? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...
GitHub user liancheng opened a pull request: https://github.com/apache/spark/pull/12527 [SPARK-14369][SQL] Locality support for FileScanRDD (This PR is a rebased version of PR #12153.) ## What changes were proposed in this pull request? This PR adds preliminary locality support for `FileFormat` data sources by overriding `FileScanRDD.preferredLocations()`. The strategy can be divided into two parts: 1. Block location lookup Unlike `HadoopRDD` or `NewHadoopRDD`, `FileScanRDD` doesn't have access to the underlying `InputFormat` or `InputSplit`, and thus can't rely on `InputSplit.getLocations()` to gather locality information. Instead, this PR queries block locations using `FileSystem.getBlockLocations()` after listing all `FileStatus`es in `HDFSFileCatalog` and convert all `FileStatus`es into `LocatedFileStatus`es. Note that although S3/S3A/S3N file systems don't provide valid locality information, their `getLocatedStatus()` implementations don't actually issue remote calls either. So there's no need to special case these file systems. 2. Selecting preferred locations For each `FilePartition`, we pick up top 3 locations that containing the most data to be retrieved. This isn't necessarily the best algorithm out there. Further improvements may be brought up in follow-up PRs. ## How was this patch tested? Tested by overriding default `FileSystem` implementation for `file:///` with a mocked one, which returns mocked block locations. You can merge this pull request into a Git repository by running: $ git pull https://github.com/liancheng/spark spark-14369-locality-rebased Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/12527.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #12527 commit 0358e33c1a796b7875f932c4d4aee28db505f696 Author: Cheng Lian Date: 2016-04-20T13:25:51Z Rebased PR #12153 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/12153#issuecomment-212408238 **[Test build #56354 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/56354/consoleFull)** for PR 12153 at commit [`79beca6`](https://github.com/apache/spark/commit/79beca6ad09df76a39cc1a80eebed378e3b2fffc). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/12153#issuecomment-212101265 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/56234/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/12153#issuecomment-212101095 **[Test build #56234 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/56234/consoleFull)** for PR 12153 at commit [`86f1195`](https://github.com/apache/spark/commit/86f1195c3f114aaf3a01f9043302379c09f0d755). * This patch **fails from timeout after a configured wait of \`250m\`**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * ` case class FakeBlockLocation(` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/12153#issuecomment-212101258 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...
Github user yhuai commented on the pull request: https://github.com/apache/spark/pull/12153#issuecomment-212043186 Seems it is hanging in BucketedReadSuite again... Let's look at it together and figure out what's wrong. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/12153#issuecomment-211988625 **[Test build #56234 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/56234/consoleFull)** for PR 12153 at commit [`86f1195`](https://github.com/apache/spark/commit/86f1195c3f114aaf3a01f9043302379c09f0d755). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/12153#discussion_r60174480 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala --- @@ -621,20 +621,40 @@ class HDFSFileCatalog( def getStatus(path: Path): Array[FileStatus] = leafDirToChildrenFiles(path) + private implicit class LocatedFileStatusIterator(iterator: RemoteIterator[LocatedFileStatus]) +extends Iterator[LocatedFileStatus] { + +override def hasNext: Boolean = iterator.hasNext + +override def next(): LocatedFileStatus = iterator.next() + } + private def listLeafFiles(paths: Seq[Path]): mutable.LinkedHashSet[FileStatus] = { if (paths.length >= sqlContext.conf.parallelPartitionDiscoveryThreshold) { HadoopFsRelation.listLeafFilesInParallel(paths, hadoopConf, sqlContext.sparkContext) --- End diff -- Let's also have a test. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/12153#discussion_r60174434 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala --- @@ -621,20 +621,40 @@ class HDFSFileCatalog( def getStatus(path: Path): Array[FileStatus] = leafDirToChildrenFiles(path) + private implicit class LocatedFileStatusIterator(iterator: RemoteIterator[LocatedFileStatus]) +extends Iterator[LocatedFileStatus] { + +override def hasNext: Boolean = iterator.hasNext + +override def next(): LocatedFileStatus = iterator.next() + } + private def listLeafFiles(paths: Seq[Path]): mutable.LinkedHashSet[FileStatus] = { if (paths.length >= sqlContext.conf.parallelPartitionDiscoveryThreshold) { HadoopFsRelation.listLeafFilesInParallel(paths, hadoopConf, sqlContext.sparkContext) --- End diff -- Seems we also need to update the `listLeafFiles` that is called by `listLeafFilesInParallel`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/12153#issuecomment-211721311 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/12153#issuecomment-211721313 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/56158/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/12153#issuecomment-211721217 **[Test build #56158 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/56158/consoleFull)** for PR 12153 at commit [`3484559`](https://github.com/apache/spark/commit/34845596b508715762f76374005f7e0e89db4ff0). * This patch **fails from timeout after a configured wait of \`250m\`**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/12153#issuecomment-211638453 **[Test build #56158 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/56158/consoleFull)** for PR 12153 at commit [`3484559`](https://github.com/apache/spark/commit/34845596b508715762f76374005f7e0e89db4ff0). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...
Github user yhuai commented on the pull request: https://github.com/apache/spark/pull/12153#issuecomment-211637249 test this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/12153#issuecomment-211594412 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/56090/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/12153#issuecomment-211594408 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/12153#issuecomment-211594306 **[Test build #56090 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/56090/consoleFull)** for PR 12153 at commit [`3484559`](https://github.com/apache/spark/commit/34845596b508715762f76374005f7e0e89db4ff0). * This patch **fails from timeout after a configured wait of \`250m\`**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/12153#issuecomment-211487877 **[Test build #56090 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/56090/consoleFull)** for PR 12153 at commit [`3484559`](https://github.com/apache/spark/commit/34845596b508715762f76374005f7e0e89db4ff0). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...
Github user liancheng commented on the pull request: https://github.com/apache/spark/pull/12153#issuecomment-211446073 Seems that Jenkins is down... --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...
Github user liancheng commented on the pull request: https://github.com/apache/spark/pull/12153#issuecomment-211423901 retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...
Github user yhuai commented on the pull request: https://github.com/apache/spark/pull/12153#issuecomment-210521600 @liancheng Add more tests? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/12153#issuecomment-210192942 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/12153#issuecomment-210192944 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/55834/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/12153#issuecomment-210192829 **[Test build #55834 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/55834/consoleFull)** for PR 12153 at commit [`63613ef`](https://github.com/apache/spark/commit/63613ef2d42ec93f0b466116c702316f06fbb7d6). * This patch **fails from timeout after a configured wait of \`250m\`**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/12153#issuecomment-210093794 **[Test build #55834 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/55834/consoleFull)** for PR 12153 at commit [`63613ef`](https://github.com/apache/spark/commit/63613ef2d42ec93f0b466116c702316f06fbb7d6). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...
Github user liancheng commented on the pull request: https://github.com/apache/spark/pull/12153#issuecomment-210092607 retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/12153#discussion_r59766576 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala --- @@ -624,20 +624,40 @@ class HDFSFileCatalog( def getStatus(path: Path): Array[FileStatus] = leafDirToChildrenFiles(path) + private implicit class LocatedFileStatusIterator(iterator: RemoteIterator[LocatedFileStatus]) --- End diff -- It's used [here][1]. `RemoteIterator` doesn't extend from `Iterator`. This implicit conversion helps to convert contents behind a `RemoteIterator` into an array. [1]: https://github.com/apache/spark/pull/12153/files#diff-40c347747af9101e7e9fee52fc4120b8R656 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/12153#issuecomment-210044469 **[Test build #55813 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/55813/consoleFull)** for PR 12153 at commit [`63613ef`](https://github.com/apache/spark/commit/63613ef2d42ec93f0b466116c702316f06fbb7d6). * This patch **fails from timeout after a configured wait of \`250m\`**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/12153#issuecomment-210044529 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/55813/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/12153#issuecomment-210044527 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/12153#issuecomment-210042698 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/55812/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/12153#issuecomment-210042693 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/12153#issuecomment-210042553 **[Test build #55812 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/55812/consoleFull)** for PR 12153 at commit [`c372a90`](https://github.com/apache/spark/commit/c372a90f2fc03e116f095c7cd1dce534b47104cc). * This patch **fails from timeout after a configured wait of \`250m\`**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/12153#discussion_r59733711 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/sources/hadoopFsRelationSuites.scala --- @@ -668,6 +671,42 @@ abstract class HadoopFsRelationTest extends QueryTest with SQLTestUtils with Tes df.write.format(dataSourceName).partitionBy("c", "d", "e").saveAsTable("t") } } + + test("Locality support for FileScanRDD") { +withHadoopConf( + "fs.file.impl" -> classOf[LocalityTestFileSystem].getName, + "fs.file.impl.disable.cache" -> "true" +) { + withTempPath { dir => +val path = "file://" + dir.getCanonicalPath +val df1 = sqlContext.range(4) + df1.coalesce(1).write.mode("overwrite").format(dataSourceName).save(path) + df1.coalesce(1).write.mode("append").format(dataSourceName).save(path) + +val df2 = sqlContext.read + .format(dataSourceName) + .option("dataSchema", df1.schema.json) + .load(path) + +val Some(fileScanRDD) = { + def allRDDsInDAG(rdd: RDD[_]): Seq[RDD[_]] = { +rdd +: rdd.dependencies.map(_.rdd).flatMap(allRDDsInDAG) + } + + // We have to search for the `FileScanRDD` along the RDD DAG since + // RDD.preferredLocations doesn't propagate along the DAG to the root RDD. + allRDDsInDAG(df2.rdd).collectFirst { +case f: FileScanRDD => f + } +} + +val partitions = fileScanRDD.partitions --- End diff -- If we expect them be packed in the same partition, let's add an `assert`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/12153#discussion_r59733621 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/sources/hadoopFsRelationSuites.scala --- @@ -668,6 +671,42 @@ abstract class HadoopFsRelationTest extends QueryTest with SQLTestUtils with Tes df.write.format(dataSourceName).partitionBy("c", "d", "e").saveAsTable("t") } } + + test("Locality support for FileScanRDD") { --- End diff -- How about we add the following test cases? * Every partition of the FileScanRDD has a single file. * There are some large files. So, there are some partitions and every such a partition has a part of a file. * If we are going to pick the top K locations that host the largest amount of data, let's also add tests. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/12153#discussion_r59730560 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala --- @@ -624,20 +624,40 @@ class HDFSFileCatalog( def getStatus(path: Path): Array[FileStatus] = leafDirToChildrenFiles(path) + private implicit class LocatedFileStatusIterator(iterator: RemoteIterator[LocatedFileStatus]) --- End diff -- Why do we need this? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/12153#issuecomment-209916891 **[Test build #55813 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/55813/consoleFull)** for PR 12153 at commit [`63613ef`](https://github.com/apache/spark/commit/63613ef2d42ec93f0b466116c702316f06fbb7d6). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/12153#issuecomment-209915502 **[Test build #55812 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/55812/consoleFull)** for PR 12153 at commit [`c372a90`](https://github.com/apache/spark/commit/c372a90f2fc03e116f095c7cd1dce534b47104cc). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/12153#discussion_r59707908 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala --- @@ -197,4 +204,26 @@ private[sql] object FileSourceStrategy extends Strategy with Logging { case _ => Nil } + + private def getBlockLocations(file: FileStatus): Array[BlockLocation] = file match { +case f: LocatedFileStatus => f.getBlockLocations +case f => Array.empty[BlockLocation] + } + + // Given locations of all blocks of a single file, `blockLocations`, and an `offset` position of + // the same file, find out the index of the block that contains this `offset`. If no such block + // can be found, returns -1. + private def getBlockIndex(blockLocations: Array[BlockLocation], offset: Long): Int = { +blockLocations.indexWhere { b => + b.getOffset <= offset && offset < b.getOffset + b.getLength +} + } + + // Given locations of all blocks of a single file, `blockLocations`, and an `offset` position of + // the same file, find out the block that contains this `offset`, and returns location hosts of + // that block. If no such block can be found, returns an empty array. + private def getBlockHosts(blockLocations: Array[BlockLocation], offset: Long): Array[String] = { --- End diff -- Makes sense, updated. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/12153#discussion_r59707891 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala --- @@ -86,4 +87,9 @@ class FileScanRDD( } override protected def getPartitions: Array[Partition] = filePartitions.toArray + + override protected def getPreferredLocations(split: Partition): Seq[String] = { +val files = split.asInstanceOf[FilePartition].files +if (files.isEmpty) Seq.empty else files.maxBy(_.length).locations --- End diff -- Updated. @cloud-fan also suggested the same strategy. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...
Github user davies commented on a diff in the pull request: https://github.com/apache/spark/pull/12153#discussion_r59603636 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala --- @@ -137,10 +140,14 @@ private[sql] object FileSourceStrategy extends Strategy with Logging { val splitFiles = selectedPartitions.flatMap { partition => partition.files.flatMap { file => + val blockLocations = getBlockLocations(file) (0L to file.getLen by maxSplitBytes).map { offset => val remaining = file.getLen - offset val size = if (remaining > maxSplitBytes) maxSplitBytes else remaining -PartitionedFile(partition.values, file.getPath.toUri.toString, offset, size) +// Finds out a list of location hosts where lives the first block that contains the +// starting point the file block starts at `offset` with length `size`. +val hosts = getBlockHosts(blockLocations, offset) +PartitionedFile(partition.values, file.getPath.toUri.toString, offset, size, hosts) --- End diff -- We could do that in a separate PR (improve the locality with bounded complicity). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...
Github user davies commented on a diff in the pull request: https://github.com/apache/spark/pull/12153#discussion_r59603298 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala --- @@ -197,4 +204,26 @@ private[sql] object FileSourceStrategy extends Strategy with Logging { case _ => Nil } + + private def getBlockLocations(file: FileStatus): Array[BlockLocation] = file match { +case f: LocatedFileStatus => f.getBlockLocations +case f => Array.empty[BlockLocation] + } + + // Given locations of all blocks of a single file, `blockLocations`, and an `offset` position of + // the same file, find out the index of the block that contains this `offset`. If no such block + // can be found, returns -1. + private def getBlockIndex(blockLocations: Array[BlockLocation], offset: Long): Int = { +blockLocations.indexWhere { b => + b.getOffset <= offset && offset < b.getOffset + b.getLength +} + } + + // Given locations of all blocks of a single file, `blockLocations`, and an `offset` position of + // the same file, find out the block that contains this `offset`, and returns location hosts of + // that block. If no such block can be found, returns an empty array. + private def getBlockHosts(blockLocations: Array[BlockLocation], offset: Long): Array[String] = { --- End diff -- Should we pick the hosts based on range that the host hold most of data in that range? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/12153#issuecomment-209578767 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/12153#issuecomment-209578771 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/55713/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/12153#issuecomment-209578718 **[Test build #55713 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/55713/consoleFull)** for PR 12153 at commit [`cef40bf`](https://github.com/apache/spark/commit/cef40bfb940a24e43ac01e4c521be6de51f5662d). * This patch **fails from timeout after a configured wait of \`250m\`**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...
Github user davies commented on a diff in the pull request: https://github.com/apache/spark/pull/12153#discussion_r59598886 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala --- @@ -86,4 +87,9 @@ class FileScanRDD( } override protected def getPartitions: Array[Partition] = filePartitions.toArray + + override protected def getPreferredLocations(split: Partition): Seq[String] = { +val files = split.asInstanceOf[FilePartition].files +if (files.isEmpty) Seq.empty else files.maxBy(_.length).locations --- End diff -- Should we pick the top 3 host that hold most of the data? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/12153#discussion_r59595284 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala --- @@ -197,4 +204,26 @@ private[sql] object FileSourceStrategy extends Strategy with Logging { case _ => Nil } + + private def getBlockLocations(file: FileStatus): Array[BlockLocation] = file match { +case f: LocatedFileStatus => f.getBlockLocations +case f => Array.empty[BlockLocation] + } + + // Given locations of all blocks of a single file, `blockLocations`, and an `offset` position of + // the same file, find out the index of the block that contains this `offset`. If no such block + // can be found, returns -1. + private def getBlockIndex(blockLocations: Array[BlockLocation], offset: Long): Int = { --- End diff -- This one is a little bit tricky, would like to keep it here for readability. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/12153#discussion_r59595123 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/sources/hadoopFsRelationSuites.scala --- @@ -668,6 +671,42 @@ abstract class HadoopFsRelationTest extends QueryTest with SQLTestUtils with Tes df.write.format(dataSourceName).partitionBy("c", "d", "e").saveAsTable("t") } } + + test("Locality support for FileScanRDD") { +withHadoopConf( + "fs.file.impl" -> classOf[LocalityTestFileSystem].getName, + "fs.file.impl.disable.cache" -> "true" +) { + withTempPath { dir => +val path = "file://" + dir.getCanonicalPath +val df1 = sqlContext.range(4) + df1.coalesce(1).write.mode("overwrite").format(dataSourceName).save(path) + df1.coalesce(1).write.mode("append").format(dataSourceName).save(path) + +val df2 = sqlContext.read + .format(dataSourceName) + .option("dataSchema", df1.schema.json) + .load(path) + +val Some(fileScanRDD) = { --- End diff -- Yea, this is better, thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/12153#discussion_r59594996 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/sources/hadoopFsRelationSuites.scala --- @@ -668,6 +671,42 @@ abstract class HadoopFsRelationTest extends QueryTest with SQLTestUtils with Tes df.write.format(dataSourceName).partitionBy("c", "d", "e").saveAsTable("t") } } + + test("Locality support for FileScanRDD") { +withHadoopConf( + "fs.file.impl" -> classOf[LocalityTestFileSystem].getName, + "fs.file.impl.disable.cache" -> "true" +) { + withTempPath { dir => +val path = "file://" + dir.getCanonicalPath +val df1 = sqlContext.range(4) + df1.coalesce(1).write.mode("overwrite").format(dataSourceName).save(path) + df1.coalesce(1).write.mode("append").format(dataSourceName).save(path) + +val df2 = sqlContext.read + .format(dataSourceName) + .option("dataSchema", df1.schema.json) + .load(path) + +val Some(fileScanRDD) = { + def allRDDsInDAG(rdd: RDD[_]): Seq[RDD[_]] = { +rdd +: rdd.dependencies.map(_.rdd).flatMap(allRDDsInDAG) + } + + // We have to search for the `FileScanRDD` along the RDD DAG since + // RDD.preferredLocations doesn't propagate along the DAG to the root RDD. + allRDDsInDAG(df2.rdd).collectFirst { +case f: FileScanRDD => f + } +} + +val partitions = fileScanRDD.partitions --- End diff -- Not necessarily, and probably only 1, since both files written are pretty small and will be merged into a single partition. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...
Github user cloud-fan commented on the pull request: https://github.com/apache/spark/pull/12153#issuecomment-209528671 Overall LGTM --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/12153#discussion_r59577970 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala --- @@ -197,4 +204,26 @@ private[sql] object FileSourceStrategy extends Strategy with Logging { case _ => Nil } + + private def getBlockLocations(file: FileStatus): Array[BlockLocation] = file match { +case f: LocatedFileStatus => f.getBlockLocations +case f => Array.empty[BlockLocation] + } + + // Given locations of all blocks of a single file, `blockLocations`, and an `offset` position of + // the same file, find out the index of the block that contains this `offset`. If no such block + // can be found, returns -1. + private def getBlockIndex(blockLocations: Array[BlockLocation], offset: Long): Int = { --- End diff -- This method is only called once, should we inline it? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/12153#discussion_r59577383 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/sources/hadoopFsRelationSuites.scala --- @@ -668,6 +671,42 @@ abstract class HadoopFsRelationTest extends QueryTest with SQLTestUtils with Tes df.write.format(dataSourceName).partitionBy("c", "d", "e").saveAsTable("t") } } + + test("Locality support for FileScanRDD") { +withHadoopConf( + "fs.file.impl" -> classOf[LocalityTestFileSystem].getName, + "fs.file.impl.disable.cache" -> "true" +) { + withTempPath { dir => +val path = "file://" + dir.getCanonicalPath +val df1 = sqlContext.range(4) + df1.coalesce(1).write.mode("overwrite").format(dataSourceName).save(path) + df1.coalesce(1).write.mode("append").format(dataSourceName).save(path) + +val df2 = sqlContext.read + .format(dataSourceName) + .option("dataSchema", df1.schema.json) + .load(path) + +val Some(fileScanRDD) = { + def allRDDsInDAG(rdd: RDD[_]): Seq[RDD[_]] = { +rdd +: rdd.dependencies.map(_.rdd).flatMap(allRDDsInDAG) + } + + // We have to search for the `FileScanRDD` along the RDD DAG since + // RDD.preferredLocations doesn't propagate along the DAG to the root RDD. + allRDDsInDAG(df2.rdd).collectFirst { +case f: FileScanRDD => f + } +} + +val partitions = fileScanRDD.partitions --- End diff -- do we always return 2 partitions? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/12153#discussion_r59576310 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/sources/hadoopFsRelationSuites.scala --- @@ -668,6 +671,42 @@ abstract class HadoopFsRelationTest extends QueryTest with SQLTestUtils with Tes df.write.format(dataSourceName).partitionBy("c", "d", "e").saveAsTable("t") } } + + test("Locality support for FileScanRDD") { +withHadoopConf( + "fs.file.impl" -> classOf[LocalityTestFileSystem].getName, + "fs.file.impl.disable.cache" -> "true" +) { + withTempPath { dir => +val path = "file://" + dir.getCanonicalPath +val df1 = sqlContext.range(4) + df1.coalesce(1).write.mode("overwrite").format(dataSourceName).save(path) + df1.coalesce(1).write.mode("append").format(dataSourceName).save(path) + +val df2 = sqlContext.read + .format(dataSourceName) + .option("dataSchema", df1.schema.json) + .load(path) + +val Some(fileScanRDD) = { --- End diff -- how about: ``` val fileScanRDD = df2.queryExecution.executedPlan.collectFirst { case scan: DataSourceScan if scan.rdd.isInstanceOf[FileScanRDD] => scan.rdd.asInstanceOf[FileScanRDD] } ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/12153#issuecomment-209464750 **[Test build #55713 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/55713/consoleFull)** for PR 12153 at commit [`cef40bf`](https://github.com/apache/spark/commit/cef40bfb940a24e43ac01e4c521be6de51f5662d). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/12153#discussion_r59303595 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala --- @@ -86,4 +87,20 @@ class FileScanRDD( } override protected def getPartitions: Array[Partition] = filePartitions.toArray + + override protected def getPreferredLocations(split: Partition): Seq[String] = { +val files = split.asInstanceOf[FilePartition].files +val partitionSize = files.map(_.length).sum --- End diff -- how about: ``` val hostToLength = Map[String, Long]() for (file <- files) { for (host <- file.hosts) { if (hostToLength.contains(host)) { hostToLength(host) += file.length } else { hostToLength(host) = file.length } } hostToLength.toSeq.sortBy(_._2).take(3).map(_._1) } ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/12153#issuecomment-208456408 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/12153#issuecomment-208456399 **[Test build #55529 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/55529/consoleFull)** for PR 12153 at commit [`739420b`](https://github.com/apache/spark/commit/739420b20b4cd398b4b43f3f5a3db36f4023cdb3). * This patch **fails Scala style tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/12153#issuecomment-208456411 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/55529/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/12153#issuecomment-208456006 **[Test build #55529 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/55529/consoleFull)** for PR 12153 at commit [`739420b`](https://github.com/apache/spark/commit/739420b20b4cd398b4b43f3f5a3db36f4023cdb3). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/12153#discussion_r59057227 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala --- @@ -137,10 +140,14 @@ private[sql] object FileSourceStrategy extends Strategy with Logging { val splitFiles = selectedPartitions.flatMap { partition => partition.files.flatMap { file => + val blockLocations = getBlockLocations(file) (0L to file.getLen by maxSplitBytes).map { offset => val remaining = file.getLen - offset val size = if (remaining > maxSplitBytes) maxSplitBytes else remaining -PartitionedFile(partition.values, file.getPath.toUri.toString, offset, size) +// Finds out a list of location hosts where lives the first block that contains the +// starting point the file block starts at `offset` with length `size`. +val hosts = getBlockHosts(blockLocations, offset) +PartitionedFile(partition.values, file.getPath.toUri.toString, offset, size, hosts) --- End diff -- This is a really good pointer, thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/12153#issuecomment-207030472 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/12153#issuecomment-207030473 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/55228/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/12153#issuecomment-207030280 **[Test build #55228 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/55228/consoleFull)** for PR 12153 at commit [`3a0e2f7`](https://github.com/apache/spark/commit/3a0e2f71e9ea147083d4d26167fc58ce0b3323cb). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...
Github user nongli commented on a diff in the pull request: https://github.com/apache/spark/pull/12153#discussion_r58917033 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala --- @@ -137,10 +140,14 @@ private[sql] object FileSourceStrategy extends Strategy with Logging { val splitFiles = selectedPartitions.flatMap { partition => partition.files.flatMap { file => + val blockLocations = getBlockLocations(file) (0L to file.getLen by maxSplitBytes).map { offset => val remaining = file.getLen - offset val size = if (remaining > maxSplitBytes) maxSplitBytes else remaining -PartitionedFile(partition.values, file.getPath.toUri.toString, offset, size) +// Finds out a list of location hosts where lives the first block that contains the +// starting point the file block starts at `offset` with length `size`. +val hosts = getBlockHosts(blockLocations, offset) +PartitionedFile(partition.values, file.getPath.toUri.toString, offset, size, hosts) --- End diff -- This is not likely to generate good locality right? Is it reasonable to repurpose this? https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/CoalescedRDD.scala#L124 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/12153#discussion_r58909434 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala --- @@ -137,10 +140,14 @@ private[sql] object FileSourceStrategy extends Strategy with Logging { val splitFiles = selectedPartitions.flatMap { partition => partition.files.flatMap { file => + val blockLocations = getBlockLocations(file) (0L to file.getLen by maxSplitBytes).map { offset => val remaining = file.getLen - offset val size = if (remaining > maxSplitBytes) maxSplitBytes else remaining -PartitionedFile(partition.values, file.getPath.toUri.toString, offset, size) +// Finds out a list of location hosts where lives the first block that contains the +// starting point the file block starts at `offset` with length `size`. +val hosts = getBlockHosts(blockLocations, offset) +PartitionedFile(partition.values, file.getPath.toUri.toString, offset, size, hosts) --- End diff -- Each `FilePartition` derives their own preferred location in `FileScanRDD.preferredLocations()` from location hosts of all its `PartitionedFile`s. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org