[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...

2016-04-21 Thread liancheng
Github user liancheng commented on a diff in the pull request:

https://github.com/apache/spark/pull/12527#discussion_r60695719
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala
 ---
@@ -131,4 +134,23 @@ class FileScanRDD(
   }
 
   override protected def getPartitions: Array[RDDPartition] = 
filePartitions.toArray
+
+  override protected def getPreferredLocations(split: RDDPartition): 
Seq[String] = {
+val files = split.asInstanceOf[FilePartition].files
+
+// Computes total number of bytes can be retrieved from each host.
+val hostToNumBytes = mutable.HashMap.empty[String, Long]
+files.foreach { file =>
+  file.locations.filter(_ != "localhost").foreach { host =>
--- End diff --

We should filter them out. For a partition that doesn't have any preferred 
locations, it can be bundled with any other tasks and scheduled to any 
executor. But once it's marked with "localhost", delayed scheduling may be 
triggered because they have different host name as other tasks. Further more, 
"localhost" isn't a valid location for the `DAGScheduler` when deciding which 
executors to run the tasks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...

2016-04-21 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/12527


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...

2016-04-21 Thread davies
Github user davies commented on the pull request:

https://github.com/apache/spark/pull/12527#issuecomment-213253318
  
Merging this into master, thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...

2016-04-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/12527#issuecomment-213252687
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...

2016-04-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/12527#issuecomment-213252688
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/56639/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...

2016-04-21 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/12527#issuecomment-213252532
  
**[Test build #56639 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/56639/consoleFull)**
 for PR 12527 at commit 
[`2b36e97`](https://github.com/apache/spark/commit/2b36e97aafec3b11132426ffb550f2188c28f787).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...

2016-04-21 Thread yhuai
Github user yhuai commented on the pull request:

https://github.com/apache/spark/pull/12527#issuecomment-213230661
  
lgtm


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...

2016-04-21 Thread yhuai
Github user yhuai commented on a diff in the pull request:

https://github.com/apache/spark/pull/12527#discussion_r60686408
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala
 ---
@@ -131,4 +134,23 @@ class FileScanRDD(
   }
 
   override protected def getPartitions: Array[RDDPartition] = 
filePartitions.toArray
+
+  override protected def getPreferredLocations(split: RDDPartition): 
Seq[String] = {
+val files = split.asInstanceOf[FilePartition].files
+
+// Computes total number of bytes can be retrieved from each host.
+val hostToNumBytes = mutable.HashMap.empty[String, Long]
+files.foreach { file =>
+  file.locations.filter(_ != "localhost").foreach { host =>
--- End diff --

ok HadoopRDD and NewHadoopRDD are filtering locationgs set to localhost out.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...

2016-04-21 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/12527#issuecomment-213225300
  
**[Test build #56639 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/56639/consoleFull)**
 for PR 12527 at commit 
[`2b36e97`](https://github.com/apache/spark/commit/2b36e97aafec3b11132426ffb550f2188c28f787).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...

2016-04-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/12527#issuecomment-213214152
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...

2016-04-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/12527#issuecomment-213214154
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/56615/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...

2016-04-21 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/12527#issuecomment-213213921
  
**[Test build #56615 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/56615/consoleFull)**
 for PR 12527 at commit 
[`e0bfa3e`](https://github.com/apache/spark/commit/e0bfa3ed67620ed878e69e341deb102a2defc001).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...

2016-04-21 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/12527#issuecomment-213167863
  
**[Test build #56615 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/56615/consoleFull)**
 for PR 12527 at commit 
[`e0bfa3e`](https://github.com/apache/spark/commit/e0bfa3ed67620ed878e69e341deb102a2defc001).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...

2016-04-21 Thread liancheng
Github user liancheng commented on the pull request:

https://github.com/apache/spark/pull/12527#issuecomment-213167317
  
retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...

2016-04-21 Thread yhuai
Github user yhuai commented on a diff in the pull request:

https://github.com/apache/spark/pull/12527#discussion_r60643219
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala
 ---
@@ -131,4 +134,23 @@ class FileScanRDD(
   }
 
   override protected def getPartitions: Array[RDDPartition] = 
filePartitions.toArray
+
+  override protected def getPreferredLocations(split: RDDPartition): 
Seq[String] = {
+val files = split.asInstanceOf[FilePartition].files
+
+// Computes total number of bytes can be retrieved from each host.
+val hostToNumBytes = mutable.HashMap.empty[String, Long]
+files.foreach { file =>
+  file.locations.filter(_ != "localhost").foreach { host =>
--- End diff --

I am not sure we should filter it out. For that test, localhost is the 
location of files belonging to one of the input tables (the side that is not 
shuffled), right?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...

2016-04-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/12527#issuecomment-213072256
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...

2016-04-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/12527#issuecomment-213072264
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/56559/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...

2016-04-21 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/12527#issuecomment-213071922
  
**[Test build #56559 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/56559/consoleFull)**
 for PR 12527 at commit 
[`e0bfa3e`](https://github.com/apache/spark/commit/e0bfa3ed67620ed878e69e341deb102a2defc001).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...

2016-04-21 Thread yhuai
Github user yhuai commented on the pull request:

https://github.com/apache/spark/pull/12527#issuecomment-213037505
  
Just want to add a note. For that test case, we have a join that only 
shuffle one side of the input, so we have both preferred locations of original 
input files as well as preferred locations of map output files.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...

2016-04-21 Thread liancheng
Github user liancheng commented on the pull request:

https://github.com/apache/spark/pull/12527#issuecomment-213034425
  
The root cause of the deadlock has been found. Essentially, we should 
prevent "localhost" to be returned as `FileScanRDD` preferred locations. Here's 
a detailed description of the whole processes:

1.  The test case involves a shuffle, which results in a `ShuffleRowRDD`, 
whose preferred locations are determined by the locations of the block manager 
that serves corresponding map output blocks. In a word, in the case of local 
test, the only preferred location string is the IP of the block manager.
1.  In the case of local testing, `FileScanRDD.preferredLocations` always 
returns "localhost".
1.  As a result, task set `ts1` derived from the `ShuffleRowRDD` and task 
set `ts2` derived from the `FileScanRDD` have different locality preference.
1.  After job submission, `DAGScheduler` first schedules `ts1`. While 
trying to schedule `ts2`, delayed scheduling is triggered because `ts1` and 
`ts2` have different preferred locations. By default, `DAGScheduler` waits for 
3s before trying `ts2` again.
1.  3s is long enough for all tasks in `ts1` to finish. However, 
`LocalBackend` doesn't revive offers periodically like other scheduler 
backends. It only revives offer when tasks are submitted, finish, or fail. Thus 
`ts2` never gets an opportunity to be scheduled again, and the submitted job 
never finishes.

The only factor that is not clear for now is how the number of buckets 
(which affects number of submitted tasks) interact with the above process.

The fix for this issue is simple, just filter out all "localhost" in 
`FileScanRDD.preferredLocations()` since "localhost" doesn't make sense as a 
preferred executor location. Actually this is exactly the last step of what 
`NewHadoopRDD.preferredLocations()` does.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...

2016-04-21 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/12527#issuecomment-213025638
  
**[Test build #56559 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/56559/consoleFull)**
 for PR 12527 at commit 
[`e0bfa3e`](https://github.com/apache/spark/commit/e0bfa3ed67620ed878e69e341deb102a2defc001).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...

2016-04-20 Thread liancheng
Github user liancheng commented on the pull request:

https://github.com/apache/spark/pull/12527#issuecomment-212696111
  
Yea, agree. Should either prove this is an existing bug (so that it can be 
fixed in another PR), or fix it if it's a bug introduced by this change.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...

2016-04-20 Thread yhuai
Github user yhuai commented on a diff in the pull request:

https://github.com/apache/spark/pull/12527#discussion_r60456008
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/sources/BucketedReadSuite.scala ---
@@ -301,8 +301,12 @@ class BucketedReadSuite extends QueryTest with 
SQLTestUtils with TestHiveSinglet
   }
 
   test("only shuffle one side when 2 bucketed tables have different bucket 
keys") {
-val bucketSpec1 = Some(BucketSpec(8, Seq("i"), Nil))
-val bucketSpec2 = Some(BucketSpec(8, Seq("j"), Nil))
+// !! ALERT !!
+//
+// Setting `numBuckets` of the following two `BucketSpec` to 8 causes 
a deadlock in DAGScheduler
+// due to unknown reasons. Need investigation.
--- End diff --

Maybe it exposes a bug in somewhere. Let's look at it together.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...

2016-04-20 Thread davies
Github user davies commented on the pull request:

https://github.com/apache/spark/pull/12527#issuecomment-212517510
  
@liancheng  We can't have a workaround in test case to hide a bug, should 
fix it before merging.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...

2016-04-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/12527#issuecomment-212465511
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...

2016-04-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/12527#issuecomment-212465518
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/56356/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...

2016-04-20 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/12527#issuecomment-212464920
  
**[Test build #56356 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/56356/consoleFull)**
 for PR 12527 at commit 
[`0358e33`](https://github.com/apache/spark/commit/0358e33c1a796b7875f932c4d4aee28db505f696).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `  case class FakeBlockLocation(`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...

2016-04-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/12153#issuecomment-212462180
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/56354/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...

2016-04-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/12153#issuecomment-212462174
  
Build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...

2016-04-20 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/12153#issuecomment-212461507
  
**[Test build #56354 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/56354/consoleFull)**
 for PR 12153 at commit 
[`79beca6`](https://github.com/apache/spark/commit/79beca6ad09df76a39cc1a80eebed378e3b2fffc).
 * This patch passes all tests.
 * This patch **does not merge cleanly**.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...

2016-04-20 Thread liancheng
Github user liancheng commented on the pull request:

https://github.com/apache/spark/pull/12527#issuecomment-212425913
  
@yhuai This one can probably avoid the deadlock since I worked around it 
[here][1]. Also check [this comment] for more information.

Not sure whether we should merge this one since I haven't figured out 
whether the deadlock is caused bug(s) in this PR or in `DAGScheduler`.

[1]: 
https://github.com/apache/spark/pull/12527/files#diff-75164dfeac2c1507ae828d2ba5529470R304
[2]: https://github.com/apache/spark/pull/12153#issuecomment-212424785


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...

2016-04-20 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/12527#issuecomment-212425185
  
**[Test build #56356 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/56356/consoleFull)**
 for PR 12527 at commit 
[`0358e33`](https://github.com/apache/spark/commit/0358e33c1a796b7875f932c4d4aee28db505f696).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...

2016-04-20 Thread liancheng
Github user liancheng commented on the pull request:

https://github.com/apache/spark/pull/12153#issuecomment-212424940
  
I'm closing this one for #12527, which is a rebased version of this one.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...

2016-04-20 Thread liancheng
Github user liancheng closed the pull request at:

https://github.com/apache/spark/pull/12153


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...

2016-04-20 Thread liancheng
Github user liancheng commented on the pull request:

https://github.com/apache/spark/pull/12153#issuecomment-212424785
  
All the timeout in the Jenkins builds were due to a deadlock in 
`DAGScheduler`, and can be steadily reproduced locally by running the following 
test case

> BucketedReadSuite.only shuffle one side when 2 bucketed tables have 
different bucket keys.

This test case creates two bucketed tables both with 8 buckets and then 
joins them. Reducing 8 to 5 eliminates the deadlock. But I haven't figured out 
the real reason behind the deadlock. The deadlock also disappears if I remove 
FileScanRDD.preferredLocations(). Maybe that too many tasks are scheduled to 
the same place and exhausted some thread-pool?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...

2016-04-20 Thread liancheng
GitHub user liancheng opened a pull request:

https://github.com/apache/spark/pull/12527

[SPARK-14369][SQL] Locality support for FileScanRDD

(This PR is a rebased version of PR #12153.)

## What changes were proposed in this pull request?

This PR adds preliminary locality support for `FileFormat` data sources by 
overriding `FileScanRDD.preferredLocations()`. The strategy can be divided into 
two parts:

1.  Block location lookup

Unlike `HadoopRDD` or `NewHadoopRDD`, `FileScanRDD` doesn't have access 
to the underlying `InputFormat` or `InputSplit`, and thus can't rely on 
`InputSplit.getLocations()` to gather locality information. Instead, this PR 
queries block locations using `FileSystem.getBlockLocations()` after listing 
all `FileStatus`es in `HDFSFileCatalog` and convert all `FileStatus`es into 
`LocatedFileStatus`es.

Note that although S3/S3A/S3N file systems don't provide valid locality 
information, their `getLocatedStatus()` implementations don't actually issue 
remote calls either. So there's no need to special case these file systems.

2.  Selecting preferred locations

For each `FilePartition`, we pick up top 3 locations that containing 
the most data to be retrieved. This isn't necessarily the best algorithm out 
there. Further improvements may be brought up in follow-up PRs.

## How was this patch tested?

Tested by overriding default `FileSystem` implementation for `file:///` 
with a mocked one, which returns mocked block locations.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/liancheng/spark spark-14369-locality-rebased

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/12527.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #12527


commit 0358e33c1a796b7875f932c4d4aee28db505f696
Author: Cheng Lian 
Date:   2016-04-20T13:25:51Z

Rebased PR #12153




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...

2016-04-20 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/12153#issuecomment-212408238
  
**[Test build #56354 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/56354/consoleFull)**
 for PR 12153 at commit 
[`79beca6`](https://github.com/apache/spark/commit/79beca6ad09df76a39cc1a80eebed378e3b2fffc).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...

2016-04-19 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/12153#issuecomment-212101265
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/56234/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...

2016-04-19 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/12153#issuecomment-212101095
  
**[Test build #56234 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/56234/consoleFull)**
 for PR 12153 at commit 
[`86f1195`](https://github.com/apache/spark/commit/86f1195c3f114aaf3a01f9043302379c09f0d755).
 * This patch **fails from timeout after a configured wait of \`250m\`**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `  case class FakeBlockLocation(`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...

2016-04-19 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/12153#issuecomment-212101258
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...

2016-04-19 Thread yhuai
Github user yhuai commented on the pull request:

https://github.com/apache/spark/pull/12153#issuecomment-212043186
  
Seems it is hanging in BucketedReadSuite again... Let's look at it together 
and figure out what's wrong.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...

2016-04-19 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/12153#issuecomment-211988625
  
**[Test build #56234 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/56234/consoleFull)**
 for PR 12153 at commit 
[`86f1195`](https://github.com/apache/spark/commit/86f1195c3f114aaf3a01f9043302379c09f0d755).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...

2016-04-18 Thread yhuai
Github user yhuai commented on a diff in the pull request:

https://github.com/apache/spark/pull/12153#discussion_r60174480
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala ---
@@ -621,20 +621,40 @@ class HDFSFileCatalog(
 
   def getStatus(path: Path): Array[FileStatus] = 
leafDirToChildrenFiles(path)
 
+  private implicit class LocatedFileStatusIterator(iterator: 
RemoteIterator[LocatedFileStatus])
+extends Iterator[LocatedFileStatus] {
+
+override def hasNext: Boolean = iterator.hasNext
+
+override def next(): LocatedFileStatus = iterator.next()
+  }
+
   private def listLeafFiles(paths: Seq[Path]): 
mutable.LinkedHashSet[FileStatus] = {
 if (paths.length >= 
sqlContext.conf.parallelPartitionDiscoveryThreshold) {
   HadoopFsRelation.listLeafFilesInParallel(paths, hadoopConf, 
sqlContext.sparkContext)
--- End diff --

Let's also have a test.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...

2016-04-18 Thread yhuai
Github user yhuai commented on a diff in the pull request:

https://github.com/apache/spark/pull/12153#discussion_r60174434
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala ---
@@ -621,20 +621,40 @@ class HDFSFileCatalog(
 
   def getStatus(path: Path): Array[FileStatus] = 
leafDirToChildrenFiles(path)
 
+  private implicit class LocatedFileStatusIterator(iterator: 
RemoteIterator[LocatedFileStatus])
+extends Iterator[LocatedFileStatus] {
+
+override def hasNext: Boolean = iterator.hasNext
+
+override def next(): LocatedFileStatus = iterator.next()
+  }
+
   private def listLeafFiles(paths: Seq[Path]): 
mutable.LinkedHashSet[FileStatus] = {
 if (paths.length >= 
sqlContext.conf.parallelPartitionDiscoveryThreshold) {
   HadoopFsRelation.listLeafFilesInParallel(paths, hadoopConf, 
sqlContext.sparkContext)
--- End diff --

Seems we also need to update the `listLeafFiles` that is called by 
`listLeafFilesInParallel`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...

2016-04-18 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/12153#issuecomment-211721311
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...

2016-04-18 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/12153#issuecomment-211721313
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/56158/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...

2016-04-18 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/12153#issuecomment-211721217
  
**[Test build #56158 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/56158/consoleFull)**
 for PR 12153 at commit 
[`3484559`](https://github.com/apache/spark/commit/34845596b508715762f76374005f7e0e89db4ff0).
 * This patch **fails from timeout after a configured wait of \`250m\`**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...

2016-04-18 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/12153#issuecomment-211638453
  
**[Test build #56158 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/56158/consoleFull)**
 for PR 12153 at commit 
[`3484559`](https://github.com/apache/spark/commit/34845596b508715762f76374005f7e0e89db4ff0).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...

2016-04-18 Thread yhuai
Github user yhuai commented on the pull request:

https://github.com/apache/spark/pull/12153#issuecomment-211637249
  
test this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...

2016-04-18 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/12153#issuecomment-211594412
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/56090/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...

2016-04-18 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/12153#issuecomment-211594408
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...

2016-04-18 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/12153#issuecomment-211594306
  
**[Test build #56090 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/56090/consoleFull)**
 for PR 12153 at commit 
[`3484559`](https://github.com/apache/spark/commit/34845596b508715762f76374005f7e0e89db4ff0).
 * This patch **fails from timeout after a configured wait of \`250m\`**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...

2016-04-18 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/12153#issuecomment-211487877
  
**[Test build #56090 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/56090/consoleFull)**
 for PR 12153 at commit 
[`3484559`](https://github.com/apache/spark/commit/34845596b508715762f76374005f7e0e89db4ff0).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...

2016-04-18 Thread liancheng
Github user liancheng commented on the pull request:

https://github.com/apache/spark/pull/12153#issuecomment-211446073
  
Seems that Jenkins is down...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...

2016-04-18 Thread liancheng
Github user liancheng commented on the pull request:

https://github.com/apache/spark/pull/12153#issuecomment-211423901
  
retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...

2016-04-15 Thread yhuai
Github user yhuai commented on the pull request:

https://github.com/apache/spark/pull/12153#issuecomment-210521600
  
@liancheng Add more tests?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...

2016-04-14 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/12153#issuecomment-210192942
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...

2016-04-14 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/12153#issuecomment-210192944
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/55834/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...

2016-04-14 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/12153#issuecomment-210192829
  
**[Test build #55834 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/55834/consoleFull)**
 for PR 12153 at commit 
[`63613ef`](https://github.com/apache/spark/commit/63613ef2d42ec93f0b466116c702316f06fbb7d6).
 * This patch **fails from timeout after a configured wait of \`250m\`**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...

2016-04-14 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/12153#issuecomment-210093794
  
**[Test build #55834 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/55834/consoleFull)**
 for PR 12153 at commit 
[`63613ef`](https://github.com/apache/spark/commit/63613ef2d42ec93f0b466116c702316f06fbb7d6).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...

2016-04-14 Thread liancheng
Github user liancheng commented on the pull request:

https://github.com/apache/spark/pull/12153#issuecomment-210092607
  
retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...

2016-04-14 Thread liancheng
Github user liancheng commented on a diff in the pull request:

https://github.com/apache/spark/pull/12153#discussion_r59766576
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala ---
@@ -624,20 +624,40 @@ class HDFSFileCatalog(
 
   def getStatus(path: Path): Array[FileStatus] = 
leafDirToChildrenFiles(path)
 
+  private implicit class LocatedFileStatusIterator(iterator: 
RemoteIterator[LocatedFileStatus])
--- End diff --

It's used [here][1]. `RemoteIterator` doesn't extend from `Iterator`. This 
implicit conversion helps to convert contents behind a `RemoteIterator` into an 
array.

[1]: 
https://github.com/apache/spark/pull/12153/files#diff-40c347747af9101e7e9fee52fc4120b8R656


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...

2016-04-14 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/12153#issuecomment-210044469
  
**[Test build #55813 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/55813/consoleFull)**
 for PR 12153 at commit 
[`63613ef`](https://github.com/apache/spark/commit/63613ef2d42ec93f0b466116c702316f06fbb7d6).
 * This patch **fails from timeout after a configured wait of \`250m\`**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...

2016-04-14 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/12153#issuecomment-210044529
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/55813/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...

2016-04-14 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/12153#issuecomment-210044527
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...

2016-04-14 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/12153#issuecomment-210042698
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/55812/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...

2016-04-14 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/12153#issuecomment-210042693
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...

2016-04-14 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/12153#issuecomment-210042553
  
**[Test build #55812 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/55812/consoleFull)**
 for PR 12153 at commit 
[`c372a90`](https://github.com/apache/spark/commit/c372a90f2fc03e116f095c7cd1dce534b47104cc).
 * This patch **fails from timeout after a configured wait of \`250m\`**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...

2016-04-14 Thread yhuai
Github user yhuai commented on a diff in the pull request:

https://github.com/apache/spark/pull/12153#discussion_r59733711
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/sources/hadoopFsRelationSuites.scala
 ---
@@ -668,6 +671,42 @@ abstract class HadoopFsRelationTest extends QueryTest 
with SQLTestUtils with Tes
   df.write.format(dataSourceName).partitionBy("c", "d", 
"e").saveAsTable("t")
 }
   }
+
+  test("Locality support for FileScanRDD") {
+withHadoopConf(
+  "fs.file.impl" -> classOf[LocalityTestFileSystem].getName,
+  "fs.file.impl.disable.cache" -> "true"
+) {
+  withTempPath { dir =>
+val path = "file://" + dir.getCanonicalPath
+val df1 = sqlContext.range(4)
+
df1.coalesce(1).write.mode("overwrite").format(dataSourceName).save(path)
+
df1.coalesce(1).write.mode("append").format(dataSourceName).save(path)
+
+val df2 = sqlContext.read
+  .format(dataSourceName)
+  .option("dataSchema", df1.schema.json)
+  .load(path)
+
+val Some(fileScanRDD) = {
+  def allRDDsInDAG(rdd: RDD[_]): Seq[RDD[_]] = {
+rdd +: rdd.dependencies.map(_.rdd).flatMap(allRDDsInDAG)
+  }
+
+  // We have to search for the `FileScanRDD` along the RDD DAG 
since
+  // RDD.preferredLocations doesn't propagate along the DAG to the 
root RDD.
+  allRDDsInDAG(df2.rdd).collectFirst {
+case f: FileScanRDD => f
+  }
+}
+
+val partitions = fileScanRDD.partitions
--- End diff --

If we expect them be packed in the same partition, let's add an `assert`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...

2016-04-14 Thread yhuai
Github user yhuai commented on a diff in the pull request:

https://github.com/apache/spark/pull/12153#discussion_r59733621
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/sources/hadoopFsRelationSuites.scala
 ---
@@ -668,6 +671,42 @@ abstract class HadoopFsRelationTest extends QueryTest 
with SQLTestUtils with Tes
   df.write.format(dataSourceName).partitionBy("c", "d", 
"e").saveAsTable("t")
 }
   }
+
+  test("Locality support for FileScanRDD") {
--- End diff --

How about we add the following test cases?

* Every partition of the FileScanRDD has a single file.
* There are some large files. So, there are some partitions and every such 
a partition has a part of a file.
* If we are going to pick the top K locations that host the largest amount 
of data, let's also add tests.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...

2016-04-14 Thread yhuai
Github user yhuai commented on a diff in the pull request:

https://github.com/apache/spark/pull/12153#discussion_r59730560
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala ---
@@ -624,20 +624,40 @@ class HDFSFileCatalog(
 
   def getStatus(path: Path): Array[FileStatus] = 
leafDirToChildrenFiles(path)
 
+  private implicit class LocatedFileStatusIterator(iterator: 
RemoteIterator[LocatedFileStatus])
--- End diff --

Why do we need this?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...

2016-04-14 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/12153#issuecomment-209916891
  
**[Test build #55813 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/55813/consoleFull)**
 for PR 12153 at commit 
[`63613ef`](https://github.com/apache/spark/commit/63613ef2d42ec93f0b466116c702316f06fbb7d6).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...

2016-04-14 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/12153#issuecomment-209915502
  
**[Test build #55812 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/55812/consoleFull)**
 for PR 12153 at commit 
[`c372a90`](https://github.com/apache/spark/commit/c372a90f2fc03e116f095c7cd1dce534b47104cc).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...

2016-04-14 Thread liancheng
Github user liancheng commented on a diff in the pull request:

https://github.com/apache/spark/pull/12153#discussion_r59707908
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala
 ---
@@ -197,4 +204,26 @@ private[sql] object FileSourceStrategy extends 
Strategy with Logging {
 
 case _ => Nil
   }
+
+  private def getBlockLocations(file: FileStatus): Array[BlockLocation] = 
file match {
+case f: LocatedFileStatus => f.getBlockLocations
+case f => Array.empty[BlockLocation]
+  }
+
+  // Given locations of all blocks of a single file, `blockLocations`, and 
an `offset` position of
+  // the same file, find out the index of the block that contains this 
`offset`. If no such block
+  // can be found, returns -1.
+  private def getBlockIndex(blockLocations: Array[BlockLocation], offset: 
Long): Int = {
+blockLocations.indexWhere { b =>
+  b.getOffset <= offset && offset < b.getOffset + b.getLength
+}
+  }
+
+  // Given locations of all blocks of a single file, `blockLocations`, and 
an `offset` position of
+  // the same file, find out the block that contains this `offset`, and 
returns location hosts of
+  // that block. If no such block can be found, returns an empty array.
+  private def getBlockHosts(blockLocations: Array[BlockLocation], offset: 
Long): Array[String] = {
--- End diff --

Makes sense, updated.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...

2016-04-14 Thread liancheng
Github user liancheng commented on a diff in the pull request:

https://github.com/apache/spark/pull/12153#discussion_r59707891
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala
 ---
@@ -86,4 +87,9 @@ class FileScanRDD(
   }
 
   override protected def getPartitions: Array[Partition] = 
filePartitions.toArray
+
+  override protected def getPreferredLocations(split: Partition): 
Seq[String] = {
+val files = split.asInstanceOf[FilePartition].files
+if (files.isEmpty) Seq.empty else files.maxBy(_.length).locations
--- End diff --

Updated. @cloud-fan also suggested the same strategy. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...

2016-04-13 Thread davies
Github user davies commented on a diff in the pull request:

https://github.com/apache/spark/pull/12153#discussion_r59603636
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala
 ---
@@ -137,10 +140,14 @@ private[sql] object FileSourceStrategy extends 
Strategy with Logging {
 
   val splitFiles = selectedPartitions.flatMap { partition =>
 partition.files.flatMap { file =>
+  val blockLocations = getBlockLocations(file)
   (0L to file.getLen by maxSplitBytes).map { offset =>
 val remaining = file.getLen - offset
 val size = if (remaining > maxSplitBytes) maxSplitBytes 
else remaining
-PartitionedFile(partition.values, 
file.getPath.toUri.toString, offset, size)
+// Finds out a list of location hosts where lives the 
first block that contains the
+// starting point the file block starts at `offset` with 
length `size`.
+val hosts = getBlockHosts(blockLocations, offset)
+PartitionedFile(partition.values, 
file.getPath.toUri.toString, offset, size, hosts)
--- End diff --

We could do that in a separate PR (improve the locality with bounded 
complicity).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...

2016-04-13 Thread davies
Github user davies commented on a diff in the pull request:

https://github.com/apache/spark/pull/12153#discussion_r59603298
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala
 ---
@@ -197,4 +204,26 @@ private[sql] object FileSourceStrategy extends 
Strategy with Logging {
 
 case _ => Nil
   }
+
+  private def getBlockLocations(file: FileStatus): Array[BlockLocation] = 
file match {
+case f: LocatedFileStatus => f.getBlockLocations
+case f => Array.empty[BlockLocation]
+  }
+
+  // Given locations of all blocks of a single file, `blockLocations`, and 
an `offset` position of
+  // the same file, find out the index of the block that contains this 
`offset`. If no such block
+  // can be found, returns -1.
+  private def getBlockIndex(blockLocations: Array[BlockLocation], offset: 
Long): Int = {
+blockLocations.indexWhere { b =>
+  b.getOffset <= offset && offset < b.getOffset + b.getLength
+}
+  }
+
+  // Given locations of all blocks of a single file, `blockLocations`, and 
an `offset` position of
+  // the same file, find out the block that contains this `offset`, and 
returns location hosts of
+  // that block. If no such block can be found, returns an empty array.
+  private def getBlockHosts(blockLocations: Array[BlockLocation], offset: 
Long): Array[String] = {
--- End diff --

Should we pick the hosts based on range that the host hold most of data in 
that range?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...

2016-04-13 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/12153#issuecomment-209578767
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...

2016-04-13 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/12153#issuecomment-209578771
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/55713/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...

2016-04-13 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/12153#issuecomment-209578718
  
**[Test build #55713 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/55713/consoleFull)**
 for PR 12153 at commit 
[`cef40bf`](https://github.com/apache/spark/commit/cef40bfb940a24e43ac01e4c521be6de51f5662d).
 * This patch **fails from timeout after a configured wait of \`250m\`**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...

2016-04-13 Thread davies
Github user davies commented on a diff in the pull request:

https://github.com/apache/spark/pull/12153#discussion_r59598886
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala
 ---
@@ -86,4 +87,9 @@ class FileScanRDD(
   }
 
   override protected def getPartitions: Array[Partition] = 
filePartitions.toArray
+
+  override protected def getPreferredLocations(split: Partition): 
Seq[String] = {
+val files = split.asInstanceOf[FilePartition].files
+if (files.isEmpty) Seq.empty else files.maxBy(_.length).locations
--- End diff --

Should we pick the top 3 host that hold most of the data?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...

2016-04-13 Thread liancheng
Github user liancheng commented on a diff in the pull request:

https://github.com/apache/spark/pull/12153#discussion_r59595284
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala
 ---
@@ -197,4 +204,26 @@ private[sql] object FileSourceStrategy extends 
Strategy with Logging {
 
 case _ => Nil
   }
+
+  private def getBlockLocations(file: FileStatus): Array[BlockLocation] = 
file match {
+case f: LocatedFileStatus => f.getBlockLocations
+case f => Array.empty[BlockLocation]
+  }
+
+  // Given locations of all blocks of a single file, `blockLocations`, and 
an `offset` position of
+  // the same file, find out the index of the block that contains this 
`offset`. If no such block
+  // can be found, returns -1.
+  private def getBlockIndex(blockLocations: Array[BlockLocation], offset: 
Long): Int = {
--- End diff --

This one is a little bit tricky, would like to keep it here for readability.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...

2016-04-13 Thread liancheng
Github user liancheng commented on a diff in the pull request:

https://github.com/apache/spark/pull/12153#discussion_r59595123
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/sources/hadoopFsRelationSuites.scala
 ---
@@ -668,6 +671,42 @@ abstract class HadoopFsRelationTest extends QueryTest 
with SQLTestUtils with Tes
   df.write.format(dataSourceName).partitionBy("c", "d", 
"e").saveAsTable("t")
 }
   }
+
+  test("Locality support for FileScanRDD") {
+withHadoopConf(
+  "fs.file.impl" -> classOf[LocalityTestFileSystem].getName,
+  "fs.file.impl.disable.cache" -> "true"
+) {
+  withTempPath { dir =>
+val path = "file://" + dir.getCanonicalPath
+val df1 = sqlContext.range(4)
+
df1.coalesce(1).write.mode("overwrite").format(dataSourceName).save(path)
+
df1.coalesce(1).write.mode("append").format(dataSourceName).save(path)
+
+val df2 = sqlContext.read
+  .format(dataSourceName)
+  .option("dataSchema", df1.schema.json)
+  .load(path)
+
+val Some(fileScanRDD) = {
--- End diff --

Yea, this is better, thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...

2016-04-13 Thread liancheng
Github user liancheng commented on a diff in the pull request:

https://github.com/apache/spark/pull/12153#discussion_r59594996
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/sources/hadoopFsRelationSuites.scala
 ---
@@ -668,6 +671,42 @@ abstract class HadoopFsRelationTest extends QueryTest 
with SQLTestUtils with Tes
   df.write.format(dataSourceName).partitionBy("c", "d", 
"e").saveAsTable("t")
 }
   }
+
+  test("Locality support for FileScanRDD") {
+withHadoopConf(
+  "fs.file.impl" -> classOf[LocalityTestFileSystem].getName,
+  "fs.file.impl.disable.cache" -> "true"
+) {
+  withTempPath { dir =>
+val path = "file://" + dir.getCanonicalPath
+val df1 = sqlContext.range(4)
+
df1.coalesce(1).write.mode("overwrite").format(dataSourceName).save(path)
+
df1.coalesce(1).write.mode("append").format(dataSourceName).save(path)
+
+val df2 = sqlContext.read
+  .format(dataSourceName)
+  .option("dataSchema", df1.schema.json)
+  .load(path)
+
+val Some(fileScanRDD) = {
+  def allRDDsInDAG(rdd: RDD[_]): Seq[RDD[_]] = {
+rdd +: rdd.dependencies.map(_.rdd).flatMap(allRDDsInDAG)
+  }
+
+  // We have to search for the `FileScanRDD` along the RDD DAG 
since
+  // RDD.preferredLocations doesn't propagate along the DAG to the 
root RDD.
+  allRDDsInDAG(df2.rdd).collectFirst {
+case f: FileScanRDD => f
+  }
+}
+
+val partitions = fileScanRDD.partitions
--- End diff --

Not necessarily, and probably only 1, since both files written are pretty 
small and will be merged into a single partition.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...

2016-04-13 Thread cloud-fan
Github user cloud-fan commented on the pull request:

https://github.com/apache/spark/pull/12153#issuecomment-209528671
  
Overall LGTM


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...

2016-04-13 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/12153#discussion_r59577970
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala
 ---
@@ -197,4 +204,26 @@ private[sql] object FileSourceStrategy extends 
Strategy with Logging {
 
 case _ => Nil
   }
+
+  private def getBlockLocations(file: FileStatus): Array[BlockLocation] = 
file match {
+case f: LocatedFileStatus => f.getBlockLocations
+case f => Array.empty[BlockLocation]
+  }
+
+  // Given locations of all blocks of a single file, `blockLocations`, and 
an `offset` position of
+  // the same file, find out the index of the block that contains this 
`offset`. If no such block
+  // can be found, returns -1.
+  private def getBlockIndex(blockLocations: Array[BlockLocation], offset: 
Long): Int = {
--- End diff --

This method is only called once, should we inline it?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...

2016-04-13 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/12153#discussion_r59577383
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/sources/hadoopFsRelationSuites.scala
 ---
@@ -668,6 +671,42 @@ abstract class HadoopFsRelationTest extends QueryTest 
with SQLTestUtils with Tes
   df.write.format(dataSourceName).partitionBy("c", "d", 
"e").saveAsTable("t")
 }
   }
+
+  test("Locality support for FileScanRDD") {
+withHadoopConf(
+  "fs.file.impl" -> classOf[LocalityTestFileSystem].getName,
+  "fs.file.impl.disable.cache" -> "true"
+) {
+  withTempPath { dir =>
+val path = "file://" + dir.getCanonicalPath
+val df1 = sqlContext.range(4)
+
df1.coalesce(1).write.mode("overwrite").format(dataSourceName).save(path)
+
df1.coalesce(1).write.mode("append").format(dataSourceName).save(path)
+
+val df2 = sqlContext.read
+  .format(dataSourceName)
+  .option("dataSchema", df1.schema.json)
+  .load(path)
+
+val Some(fileScanRDD) = {
+  def allRDDsInDAG(rdd: RDD[_]): Seq[RDD[_]] = {
+rdd +: rdd.dependencies.map(_.rdd).flatMap(allRDDsInDAG)
+  }
+
+  // We have to search for the `FileScanRDD` along the RDD DAG 
since
+  // RDD.preferredLocations doesn't propagate along the DAG to the 
root RDD.
+  allRDDsInDAG(df2.rdd).collectFirst {
+case f: FileScanRDD => f
+  }
+}
+
+val partitions = fileScanRDD.partitions
--- End diff --

do we always return 2 partitions?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...

2016-04-13 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/12153#discussion_r59576310
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/sources/hadoopFsRelationSuites.scala
 ---
@@ -668,6 +671,42 @@ abstract class HadoopFsRelationTest extends QueryTest 
with SQLTestUtils with Tes
   df.write.format(dataSourceName).partitionBy("c", "d", 
"e").saveAsTable("t")
 }
   }
+
+  test("Locality support for FileScanRDD") {
+withHadoopConf(
+  "fs.file.impl" -> classOf[LocalityTestFileSystem].getName,
+  "fs.file.impl.disable.cache" -> "true"
+) {
+  withTempPath { dir =>
+val path = "file://" + dir.getCanonicalPath
+val df1 = sqlContext.range(4)
+
df1.coalesce(1).write.mode("overwrite").format(dataSourceName).save(path)
+
df1.coalesce(1).write.mode("append").format(dataSourceName).save(path)
+
+val df2 = sqlContext.read
+  .format(dataSourceName)
+  .option("dataSchema", df1.schema.json)
+  .load(path)
+
+val Some(fileScanRDD) = {
--- End diff --

how about:
```
val fileScanRDD = df2.queryExecution.executedPlan.collectFirst {
  case scan: DataSourceScan if scan.rdd.isInstanceOf[FileScanRDD] => 
scan.rdd.asInstanceOf[FileScanRDD]
}
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...

2016-04-13 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/12153#issuecomment-209464750
  
**[Test build #55713 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/55713/consoleFull)**
 for PR 12153 at commit 
[`cef40bf`](https://github.com/apache/spark/commit/cef40bfb940a24e43ac01e4c521be6de51f5662d).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...

2016-04-11 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/12153#discussion_r59303595
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala
 ---
@@ -86,4 +87,20 @@ class FileScanRDD(
   }
 
   override protected def getPartitions: Array[Partition] = 
filePartitions.toArray
+
+  override protected def getPreferredLocations(split: Partition): 
Seq[String] = {
+val files = split.asInstanceOf[FilePartition].files
+val partitionSize = files.map(_.length).sum
--- End diff --

how about:
```
val hostToLength = Map[String, Long]()
for (file <- files) {
  for (host <- file.hosts) {
if (hostToLength.contains(host)) {
  hostToLength(host) += file.length
} else {
  hostToLength(host) = file.length
}
  }
  hostToLength.toSeq.sortBy(_._2).take(3).map(_._1)
}
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...

2016-04-11 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/12153#issuecomment-208456408
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...

2016-04-11 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/12153#issuecomment-208456399
  
**[Test build #55529 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/55529/consoleFull)**
 for PR 12153 at commit 
[`739420b`](https://github.com/apache/spark/commit/739420b20b4cd398b4b43f3f5a3db36f4023cdb3).
 * This patch **fails Scala style tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...

2016-04-11 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/12153#issuecomment-208456411
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/55529/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...

2016-04-11 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/12153#issuecomment-208456006
  
**[Test build #55529 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/55529/consoleFull)**
 for PR 12153 at commit 
[`739420b`](https://github.com/apache/spark/commit/739420b20b4cd398b4b43f3f5a3db36f4023cdb3).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...

2016-04-08 Thread liancheng
Github user liancheng commented on a diff in the pull request:

https://github.com/apache/spark/pull/12153#discussion_r59057227
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala
 ---
@@ -137,10 +140,14 @@ private[sql] object FileSourceStrategy extends 
Strategy with Logging {
 
   val splitFiles = selectedPartitions.flatMap { partition =>
 partition.files.flatMap { file =>
+  val blockLocations = getBlockLocations(file)
   (0L to file.getLen by maxSplitBytes).map { offset =>
 val remaining = file.getLen - offset
 val size = if (remaining > maxSplitBytes) maxSplitBytes 
else remaining
-PartitionedFile(partition.values, 
file.getPath.toUri.toString, offset, size)
+// Finds out a list of location hosts where lives the 
first block that contains the
+// starting point the file block starts at `offset` with 
length `size`.
+val hosts = getBlockHosts(blockLocations, offset)
+PartitionedFile(partition.values, 
file.getPath.toUri.toString, offset, size, hosts)
--- End diff --

This is a really good pointer, thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...

2016-04-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/12153#issuecomment-207030472
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...

2016-04-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/12153#issuecomment-207030473
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/55228/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...

2016-04-07 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/12153#issuecomment-207030280
  
**[Test build #55228 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/55228/consoleFull)**
 for PR 12153 at commit 
[`3a0e2f7`](https://github.com/apache/spark/commit/3a0e2f71e9ea147083d4d26167fc58ce0b3323cb).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...

2016-04-07 Thread nongli
Github user nongli commented on a diff in the pull request:

https://github.com/apache/spark/pull/12153#discussion_r58917033
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala
 ---
@@ -137,10 +140,14 @@ private[sql] object FileSourceStrategy extends 
Strategy with Logging {
 
   val splitFiles = selectedPartitions.flatMap { partition =>
 partition.files.flatMap { file =>
+  val blockLocations = getBlockLocations(file)
   (0L to file.getLen by maxSplitBytes).map { offset =>
 val remaining = file.getLen - offset
 val size = if (remaining > maxSplitBytes) maxSplitBytes 
else remaining
-PartitionedFile(partition.values, 
file.getPath.toUri.toString, offset, size)
+// Finds out a list of location hosts where lives the 
first block that contains the
+// starting point the file block starts at `offset` with 
length `size`.
+val hosts = getBlockHosts(blockLocations, offset)
+PartitionedFile(partition.values, 
file.getPath.toUri.toString, offset, size, hosts)
--- End diff --

This is not likely to generate good locality right?

Is it reasonable to repurpose this?

https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/CoalescedRDD.scala#L124


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...

2016-04-07 Thread liancheng
Github user liancheng commented on a diff in the pull request:

https://github.com/apache/spark/pull/12153#discussion_r58909434
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala
 ---
@@ -137,10 +140,14 @@ private[sql] object FileSourceStrategy extends 
Strategy with Logging {
 
   val splitFiles = selectedPartitions.flatMap { partition =>
 partition.files.flatMap { file =>
+  val blockLocations = getBlockLocations(file)
   (0L to file.getLen by maxSplitBytes).map { offset =>
 val remaining = file.getLen - offset
 val size = if (remaining > maxSplitBytes) maxSplitBytes 
else remaining
-PartitionedFile(partition.values, 
file.getPath.toUri.toString, offset, size)
+// Finds out a list of location hosts where lives the 
first block that contains the
+// starting point the file block starts at `offset` with 
length `size`.
+val hosts = getBlockHosts(blockLocations, offset)
+PartitionedFile(partition.values, 
file.getPath.toUri.toString, offset, size, hosts)
--- End diff --

Each `FilePartition` derives their own preferred location in 
`FileScanRDD.preferredLocations()` from location hosts of all its 
`PartitionedFile`s.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



  1   2   >