[GitHub] spark issue #19130: [SPARK-21917][CORE][YARN] Supporting adding http(s) reso...

2017-09-12 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19130
  
**[Test build #81665 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81665/testReport)**
 for PR 19130 at commit 
[`4bbc09d`](https://github.com/apache/spark/commit/4bbc09d68c21496d97be3e2d9f781e7ca0bbf7e7).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19199: [SPARK-21610][SQL][FOLLOWUP] Corrupt records are ...

2017-09-12 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/19199#discussion_r138268612
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala
 ---
@@ -109,6 +109,20 @@ class CSVFileFormat extends TextBasedFileFormat with 
DataSourceRegister {
   }
 }
 
+if (requiredSchema.length == 1 &&
+  requiredSchema.head.name == parsedOptions.columnNameOfCorruptRecord) 
{
+  throw new AnalysisException(
+"Since Spark 2.3, the queries from raw JSON/CSV files are 
disallowed when the\n" +
+  "referenced columns only include the internal corrupt record 
column\n" +
+  s"(named ${parsedOptions.columnNameOfCorruptRecord} by default). 
For example:\n" +
+  
"spark.read.schema(schema).json(file).filter($\"_corrupt_record\".isNotNull).count()\n"
 +
+  "and 
spark.read.schema(schema).json(file).select(\"_corrupt_record\").show().\n" +
--- End diff --

We better use csv example here instead of json.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19130: [SPARK-21917][CORE][YARN] Supporting adding http(s) reso...

2017-09-12 Thread jerryshao
Github user jerryshao commented on the issue:

https://github.com/apache/spark/pull/19130
  
Jenkins, retest this please.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16422: [SPARK-17642] [SQL] support DESC EXTENDED/FORMATTED tabl...

2017-09-12 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16422
  
**[Test build #81664 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81664/testReport)**
 for PR 16422 at commit 
[`0d49ee9`](https://github.com/apache/spark/commit/0d49ee91508c908daef672a04768c15a9e5c5dba).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19198: [MINOR][DOC] Add missing call of `update()` in examples ...

2017-09-12 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19198
  
**[Test build #81663 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81663/testReport)**
 for PR 19198 at commit 
[`6f3859c`](https://github.com/apache/spark/commit/6f3859c38392c9d1e5b5be9883610ecb26513736).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16422: [SPARK-17642] [SQL] support DESC EXTENDED/FORMATTED tabl...

2017-09-12 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/16422
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19198: [MINOR][DOC] Add missing call of `update()` in examples ...

2017-09-12 Thread zhengruifeng
Github user zhengruifeng commented on the issue:

https://github.com/apache/spark/pull/19198
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16422: [SPARK-17642] [SQL] support DESC EXTENDED/FORMATTED tabl...

2017-09-12 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16422
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/81655/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19130: [SPARK-21917][CORE][YARN] Supporting adding http(s) reso...

2017-09-12 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19130
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16422: [SPARK-17642] [SQL] support DESC EXTENDED/FORMATTED tabl...

2017-09-12 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16422
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19198: [MINOR][DOC] Add missing call of `update()` in examples ...

2017-09-12 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19198
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19185: [Spark-21854] Added LogisticRegressionTrainingSummary fo...

2017-09-12 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19185
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19130: [SPARK-21917][CORE][YARN] Supporting adding http(s) reso...

2017-09-12 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19130
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/81658/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19185: [Spark-21854] Added LogisticRegressionTrainingSummary fo...

2017-09-12 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19185
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/81660/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19198: [MINOR][DOC] Add missing call of `update()` in examples ...

2017-09-12 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19198
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/81659/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19118: [SPARK-21882][CORE] OutputMetrics doesn't count written ...

2017-09-12 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19118
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/81662/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19130: [SPARK-21917][CORE][YARN] Supporting adding http(s) reso...

2017-09-12 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19130
  
**[Test build #81658 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81658/testReport)**
 for PR 19130 at commit 
[`4bbc09d`](https://github.com/apache/spark/commit/4bbc09d68c21496d97be3e2d9f781e7ca0bbf7e7).
 * This patch **fails due to an unknown error code, -9**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19185: [Spark-21854] Added LogisticRegressionTrainingSummary fo...

2017-09-12 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19185
  
**[Test build #81660 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81660/testReport)**
 for PR 19185 at commit 
[`eb8f6b4`](https://github.com/apache/spark/commit/eb8f6b431982d6f1f0118965391560f94812ab53).
 * This patch **fails due to an unknown error code, -9**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19198: [MINOR][DOC] Add missing call of `update()` in examples ...

2017-09-12 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19198
  
**[Test build #81659 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81659/testReport)**
 for PR 19198 at commit 
[`6f3859c`](https://github.com/apache/spark/commit/6f3859c38392c9d1e5b5be9883610ecb26513736).
 * This patch **fails due to an unknown error code, -9**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16422: [SPARK-17642] [SQL] support DESC EXTENDED/FORMATTED tabl...

2017-09-12 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16422
  
**[Test build #81655 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81655/testReport)**
 for PR 16422 at commit 
[`0d49ee9`](https://github.com/apache/spark/commit/0d49ee91508c908daef672a04768c15a9e5c5dba).
 * This patch **fails due to an unknown error code, -9**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19118: [SPARK-21882][CORE] OutputMetrics doesn't count written ...

2017-09-12 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19118
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19118: [SPARK-21882][CORE] OutputMetrics doesn't count written ...

2017-09-12 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19118
  
**[Test build #81662 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81662/testReport)**
 for PR 19118 at commit 
[`2c4f2ca`](https://github.com/apache/spark/commit/2c4f2ca7f92916114d090208091ba718da5621c6).
 * This patch **fails Scala style tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19199: [SPARK-21610][SQL][FOLLOWUP] Corrupt records are not han...

2017-09-12 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19199
  
**[Test build #81661 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81661/testReport)**
 for PR 19199 at commit 
[`e703fc8`](https://github.com/apache/spark/commit/e703fc8f33d1fde90d790057481f1d23f466f378).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19118: [SPARK-21882][CORE] OutputMetrics doesn't count written ...

2017-09-12 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19118
  
**[Test build #81662 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81662/testReport)**
 for PR 19118 at commit 
[`2c4f2ca`](https://github.com/apache/spark/commit/2c4f2ca7f92916114d090208091ba718da5621c6).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19199: [SPARK-21610][SQL][FOLLOWUP] Corrupt records are not han...

2017-09-12 Thread jmchung
Github user jmchung commented on the issue:

https://github.com/apache/spark/pull/19199
  
cc @gatorsmile, @HyukjinKwon and @viirya. Could you guys help to review 
this? Thanks.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19199: [SPARK-21610][SQL][FOLLOWUP] Corrupt records are not han...

2017-09-12 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/19199
  
ok to test


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19199: [SPARK-21610][SQL][FOLLOWUP] Corrupt records are not han...

2017-09-12 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19199
  
Can one of the admins verify this patch?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19199: [SPARK-21610][SQL][FOLLOWUP] Corrupt records are ...

2017-09-12 Thread jmchung
GitHub user jmchung opened a pull request:

https://github.com/apache/spark/pull/19199

[SPARK-21610][SQL][FOLLOWUP] Corrupt records are not handled properly when 
creating a dataframe from a file

## What changes were proposed in this pull request?

When the `requiredSchema` only contains `_corrupt_record`, the derived 
`actualSchema` is empty and the `_corrupt_record` are all null for all rows. 
This PR captures above situation and raise an exception with a reasonable 
workaround messag so that users can know what happened and how to fix the query.

## How was this patch tested?

Added unit test in `CSVSuite`.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/jmchung/spark SPARK-21610-FOLLOWUP

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19199.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19199


commit e703fc8f33d1fde90d790057481f1d23f466f378
Author: Jen-Ming Chung 
Date:   2017-09-12T06:48:33Z

follow-up PR for CSV




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19118: [SPARK-21882][CORE] OutputMetrics doesn't count written ...

2017-09-12 Thread awarrior
Github user awarrior commented on the issue:

https://github.com/apache/spark/pull/19118
  
@jiangxb1987 well, I passed that part above but met other initialization 
chances before runJob. They are in the write function of SparkHadoopWriter.

> 
// Assert the output format/key/value class is set in JobConf.
config.assertConf(jobContext, rdd.conf) <= chance

val committer = config.createCommitter(stageId)
committer.setupJob(jobContext) <= chance

// Try to write all RDD partitions as a Hadoop OutputFormat.
try {
  val ret = sparkContext.runJob(rdd, (context: TaskContext, iter: 
Iterator[(K, V)]) => {
executeTask(
  context = context,
  config = config,
  jobTrackerId = jobTrackerId,
  sparkStageId = context.stageId,
  sparkPartitionId = context.partitionId,
  sparkAttemptNumber = context.attemptNumber,
  committer = committer,
  iterator = iter)
  })

One trace list:

> java.lang.Thread.State: RUNNABLE
  at org.apache.hadoop.fs.FileSystem.getStatistics(FileSystem.java:3270)
  - locked <0x126a> (a java.lang.Class)
  at org.apache.hadoop.fs.FileSystem.initialize(FileSystem.java:202)
  at 
org.apache.hadoop.fs.RawLocalFileSystem.initialize(RawLocalFileSystem.java:92)
  at 
org.apache.hadoop.fs.LocalFileSystem.initialize(LocalFileSystem.java:47)
  at 
org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2598)
  at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91)
  at 
org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2632)
  at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2614)
  at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:370)
  at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:169)
  at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:354)
  at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)
  at 
org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.(FileOutputCommitter.java:91)
  at 
org.apache.hadoop.mapred.FileOutputCommitter.getWrapped(FileOutputCommitter.java:65)
  at 
org.apache.hadoop.mapred.FileOutputCommitter.setupJob(FileOutputCommitter.java:131)
  at 
org.apache.hadoop.mapred.OutputCommitter.setupJob(OutputCommitter.java:233)
  at 
org.apache.spark.internal.io.HadoopMapReduceCommitProtocol.setupJob(HadoopMapReduceCommitProtocol.scala:125)
  at 
org.apache.spark.internal.io.SparkHadoopWriter$.write(SparkHadoopWriter.scala:74)




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19118: [SPARK-21882][CORE] OutputMetrics doesn't count w...

2017-09-12 Thread awarrior
Github user awarrior commented on a diff in the pull request:

https://github.com/apache/spark/pull/19118#discussion_r138263099
  
--- Diff: 
core/src/main/scala/org/apache/spark/internal/io/SparkHadoopWriter.scala ---
@@ -112,11 +112,12 @@ object SparkHadoopWriter extends Logging {
   jobTrackerId, sparkStageId, sparkPartitionId, sparkAttemptNumber)
 committer.setupTask(taskContext)
 
-val (outputMetrics, callback) = initHadoopOutputMetrics(context)
-
 // Initiate the writer.
 config.initWriter(taskContext, sparkPartitionId)
 var recordsWritten = 0L
+
+// Initialize callback function after the writer.
--- End diff --

ok


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19141: [SPARK-21384] [YARN] Spark + YARN fails with Loca...

2017-09-12 Thread jerryshao
Github user jerryshao commented on a diff in the pull request:

https://github.com/apache/spark/pull/19141#discussion_r138262309
  
--- Diff: 
resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala 
---
@@ -565,7 +565,6 @@ private[spark] class Client(
   distribute(jarsArchive.toURI.getPath,
 resType = LocalResourceType.ARCHIVE,
 destName = Some(LOCALIZED_LIB_DIR))
-  jarsArchive.delete()
--- End diff --

What if your scenario and SPARK-20741's scenario are both encountered? 
Looks like your approach above cannot be worked.

I'm wondering if we can copy or move this __spark_libs__.zip temp file to 
another non-temp file and add that file to the dist cache. That non-temp file 
will not be deleted and can be overwritten during another launching, so we will 
always have only one copy.

Besides, I think we have several workarounds to handle this issue like 
spark.yarn.jars or spark.yarn.archive, so looks like this corner case is not so 
necessary to fix (just my thinking, normally people will not use local FS in a 
real cluster).






---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19136: [SPARK-15689][SQL] data source v2

2017-09-12 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/19136#discussion_r138254795
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/ReadTask.java ---
@@ -0,0 +1,43 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.sources.v2.reader;
+
+import java.io.Serializable;
+
+/**
+ * A read task returned by a data source reader and is responsible to 
create the data reader.
+ * The relationship between `ReadTask` and `DataReader` is similar to 
`Iterable` and `Iterator`.
+ *
+ * Note that, the read task will be serialized and sent to executors, then 
the data reader will be
+ * created on executors and do the actual reading.
+ */
+public interface ReadTask extends Serializable {
+  /**
+   * The preferred locations for this read task to run faster, but Spark 
can't guarantee that this
--- End diff --

`locations for this read task to run faster` -> `locations where this read 
task can run faster`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19136: [SPARK-15689][SQL] data source v2

2017-09-12 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/19136#discussion_r138254289
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/DataSourceV2Reader.java
 ---
@@ -0,0 +1,58 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.sources.v2.reader;
+
+import java.util.List;
+
+import org.apache.spark.sql.Row;
+import org.apache.spark.sql.types.StructType;
+
+/**
+ * A data source reader that can mix in various query optimization 
interfaces and implement these
+ * optimizations. The actual scan logic should be delegated to `ReadTask`s 
that are returned by
+ * this data source reader.
+ *
+ * There are mainly 3 kinds of query optimizations:
+ *   1. push operators downward to the data source, e.g., column pruning, 
filter push down, etc.
+ *   2. propagate information upward to Spark, e.g., report statistics, 
report ordering, etc.
+ *   3. special scans like columnar scan, unsafe row scan, etc. Note that 
a data source reader can
+ *  at most implement one special scan.
+ *
+ * Spark first applies all operator push down optimizations which this 
data source supports. Then
+ * Spark collects information this data source provides for further 
optimizations. Finally Spark
+ * issues the scan request and does the actual data reading.
+ */
+public interface DataSourceV2Reader {
+
+  /**
+   * Returns the actual schema of this data source reader, which may be 
different from the physical
+   * schema of the underlying storage, as column pruning or other 
optimizations may happen.
+   */
+  StructType readSchema();
+
+  /**
+   * Returns a list of read tasks, each task is responsible for outputting 
data for one RDD
+   * partition, which means the number of tasks returned here is same as 
the number of RDD
--- End diff --

`, which means` -> `That means`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19136: [SPARK-15689][SQL] data source v2

2017-09-12 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/19136#discussion_r138253471
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/DataSourceV2Reader.java
 ---
@@ -0,0 +1,58 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.sources.v2.reader;
+
+import java.util.List;
+
+import org.apache.spark.sql.Row;
+import org.apache.spark.sql.types.StructType;
+
+/**
+ * A data source reader that can mix in various query optimization 
interfaces and implement these
+ * optimizations. The actual scan logic should be delegated to `ReadTask`s 
that are returned by
+ * this data source reader.
+ *
+ * There are mainly 3 kinds of query optimizations:
+ *   1. push operators downward to the data source, e.g., column pruning, 
filter push down, etc.
+ *   2. propagate information upward to Spark, e.g., report statistics, 
report ordering, etc.
+ *   3. special scans like columnar scan, unsafe row scan, etc. Note that 
a data source reader can
+ *  at most implement one special scan.
--- End diff --

`at most implement one` -> ` implement at most one`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19136: [SPARK-15689][SQL] data source v2

2017-09-12 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/19136#discussion_r138252631
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/DataReader.java 
---
@@ -0,0 +1,37 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.sources.v2.reader;
+
+import java.io.Closeable;
+
+/**
+ * A data reader returned by a read task and is responsible for outputting 
data for an RDD
--- End diff --

Nit: `an` -> `a`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19136: [SPARK-15689][SQL] data source v2

2017-09-12 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/19136#discussion_r138255522
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/downward/ColumnPruningSupport.java
 ---
@@ -0,0 +1,36 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.sources.v2.reader.downward;
+
+import org.apache.spark.sql.types.StructType;
+
+/**
+ * A mix-in interface for `DataSourceV2Reader`. Users can implement this 
interface to only read
+ * required columns/nested fields during scan.
+ */
+public interface ColumnPruningSupport {
+
+  /**
+   * Apply column pruning w.r.t. the given requiredSchema.
+   *
+   * Implementation should try its best to prune unnecessary 
columns/nested fields, but it's also
--- End diff --

`the unnecessary `


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19136: [SPARK-15689][SQL] data source v2

2017-09-12 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/19136#discussion_r138254894
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/ReadTask.java ---
@@ -0,0 +1,43 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.sources.v2.reader;
+
+import java.io.Serializable;
+
+/**
+ * A read task returned by a data source reader and is responsible to 
create the data reader.
+ * The relationship between `ReadTask` and `DataReader` is similar to 
`Iterable` and `Iterator`.
+ *
+ * Note that, the read task will be serialized and sent to executors, then 
the data reader will be
+ * created on executors and do the actual reading.
+ */
+public interface ReadTask extends Serializable {
+  /**
+   * The preferred locations for this read task to run faster, but Spark 
can't guarantee that this
--- End diff --

`can't guarantee` -> `does not guarantee`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19136: [SPARK-15689][SQL] data source v2

2017-09-12 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/19136#discussion_r138254192
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/DataSourceV2Reader.java
 ---
@@ -0,0 +1,58 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.sources.v2.reader;
+
+import java.util.List;
+
+import org.apache.spark.sql.Row;
+import org.apache.spark.sql.types.StructType;
+
+/**
+ * A data source reader that can mix in various query optimization 
interfaces and implement these
+ * optimizations. The actual scan logic should be delegated to `ReadTask`s 
that are returned by
+ * this data source reader.
+ *
+ * There are mainly 3 kinds of query optimizations:
+ *   1. push operators downward to the data source, e.g., column pruning, 
filter push down, etc.
+ *   2. propagate information upward to Spark, e.g., report statistics, 
report ordering, etc.
+ *   3. special scans like columnar scan, unsafe row scan, etc. Note that 
a data source reader can
+ *  at most implement one special scan.
+ *
+ * Spark first applies all operator push down optimizations which this 
data source supports. Then
+ * Spark collects information this data source provides for further 
optimizations. Finally Spark
+ * issues the scan request and does the actual data reading.
+ */
+public interface DataSourceV2Reader {
+
+  /**
+   * Returns the actual schema of this data source reader, which may be 
different from the physical
+   * schema of the underlying storage, as column pruning or other 
optimizations may happen.
+   */
+  StructType readSchema();
+
+  /**
+   * Returns a list of read tasks, each task is responsible for outputting 
data for one RDD
--- End diff --

`, each` -> `. Each`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19136: [SPARK-15689][SQL] data source v2

2017-09-12 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/19136#discussion_r138254426
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/DataSourceV2Reader.java
 ---
@@ -0,0 +1,58 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.sources.v2.reader;
+
+import java.util.List;
+
+import org.apache.spark.sql.Row;
+import org.apache.spark.sql.types.StructType;
+
+/**
+ * A data source reader that can mix in various query optimization 
interfaces and implement these
+ * optimizations. The actual scan logic should be delegated to `ReadTask`s 
that are returned by
+ * this data source reader.
+ *
+ * There are mainly 3 kinds of query optimizations:
+ *   1. push operators downward to the data source, e.g., column pruning, 
filter push down, etc.
+ *   2. propagate information upward to Spark, e.g., report statistics, 
report ordering, etc.
+ *   3. special scans like columnar scan, unsafe row scan, etc. Note that 
a data source reader can
+ *  at most implement one special scan.
+ *
+ * Spark first applies all operator push down optimizations which this 
data source supports. Then
+ * Spark collects information this data source provides for further 
optimizations. Finally Spark
+ * issues the scan request and does the actual data reading.
+ */
+public interface DataSourceV2Reader {
+
+  /**
+   * Returns the actual schema of this data source reader, which may be 
different from the physical
+   * schema of the underlying storage, as column pruning or other 
optimizations may happen.
+   */
+  StructType readSchema();
+
+  /**
+   * Returns a list of read tasks, each task is responsible for outputting 
data for one RDD
+   * partition, which means the number of tasks returned here is same as 
the number of RDD
+   * partitions this scan outputs.
+   *
+   * Note that, this may not be a full scan if the data source reader 
mixes in other optimization
+   * interfaces like column pruning, filter push down, etc. These 
optimizations are applied before
--- End diff --

`push down` -> `push-down`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19136: [SPARK-15689][SQL] data source v2

2017-09-12 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/19136#discussion_r138255191
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/downward/CatalystFilterPushDownSupport.java
 ---
@@ -0,0 +1,36 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.sources.v2.reader.downward;
+
+import org.apache.spark.annotation.Experimental;
+import org.apache.spark.annotation.InterfaceStability;
+import org.apache.spark.sql.catalyst.expressions.Expression;
+
+/**
+ * A mix-in interface for `DataSourceV2Reader`. Users can implement this 
interface to push down
+ * arbitrary expressions as predicates to the data source.
+ */
+@Experimental
+@InterfaceStability.Unstable
+public interface CatalystFilterPushDownSupport {
+
+  /**
+   * Push down filters, returns unsupported filters.
--- End diff --

`Pushes down filters, and returns unsupported filters.`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19136: [SPARK-15689][SQL] data source v2

2017-09-12 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/19136#discussion_r138253590
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/DataSourceV2Reader.java
 ---
@@ -0,0 +1,58 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.sources.v2.reader;
+
+import java.util.List;
+
+import org.apache.spark.sql.Row;
+import org.apache.spark.sql.types.StructType;
+
+/**
+ * A data source reader that can mix in various query optimization 
interfaces and implement these
+ * optimizations. The actual scan logic should be delegated to `ReadTask`s 
that are returned by
+ * this data source reader.
+ *
+ * There are mainly 3 kinds of query optimizations:
+ *   1. push operators downward to the data source, e.g., column pruning, 
filter push down, etc.
+ *   2. propagate information upward to Spark, e.g., report statistics, 
report ordering, etc.
+ *   3. special scans like columnar scan, unsafe row scan, etc. Note that 
a data source reader can
+ *  at most implement one special scan.
+ *
+ * Spark first applies all operator push down optimizations which this 
data source supports. Then
--- End diff --

`push down` -> `push-down`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19136: [SPARK-15689][SQL] data source v2

2017-09-12 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/19136#discussion_r138258962
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/upward/StatisticsSupport.java
 ---
@@ -0,0 +1,26 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.sources.v2.reader.upward;
+
+/**
+ * A mix in interface for `DataSourceV2Reader`. Users can implement this 
interface to report
+ * statistics to Spark.
+ */
+public interface StatisticsSupport {
+  Statistics getStatistics();
--- End diff --

Will the returned stats be adjusted by the data sources based on the 
operator push-down?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19136: [SPARK-15689][SQL] data source v2

2017-09-12 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/19136#discussion_r138255810
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/downward/FilterPushDownSupport.java
 ---
@@ -0,0 +1,32 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.sources.v2.reader.downward;
+
+import org.apache.spark.sql.sources.Filter;
+
+/**
+ * A mix-in interface for `DataSourceV2Reader`. Users can implement this 
interface to push down
+ * filters to the data source and reduce the size of the data to be read.
+ */
+public interface FilterPushDownSupport {
+
+  /**
+   * Push down filters, returns unsupported filters.
--- End diff --

`Pushes down filters, and returns unsupported filters.`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19136: [SPARK-15689][SQL] data source v2

2017-09-12 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/19136#discussion_r138255388
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/downward/ColumnPruningSupport.java
 ---
@@ -0,0 +1,36 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.sources.v2.reader.downward;
+
+import org.apache.spark.sql.types.StructType;
+
+/**
+ * A mix-in interface for `DataSourceV2Reader`. Users can implement this 
interface to only read
+ * required columns/nested fields during scan.
--- End diff --

-> `the required`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19136: [SPARK-15689][SQL] data source v2

2017-09-12 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/19136#discussion_r138255263
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/downward/CatalystFilterPushDownSupport.java
 ---
@@ -0,0 +1,36 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.sources.v2.reader.downward;
+
+import org.apache.spark.annotation.Experimental;
+import org.apache.spark.annotation.InterfaceStability;
+import org.apache.spark.sql.catalyst.expressions.Expression;
+
+/**
+ * A mix-in interface for `DataSourceV2Reader`. Users can implement this 
interface to push down
+ * arbitrary expressions as predicates to the data source.
--- End diff --

`Note that, this is an experimental and unstable interface`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19136: [SPARK-15689][SQL] data source v2

2017-09-12 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/19136#discussion_r138254904
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/ReadTask.java ---
@@ -0,0 +1,43 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.sources.v2.reader;
+
+import java.io.Serializable;
+
+/**
+ * A read task returned by a data source reader and is responsible to 
create the data reader.
+ * The relationship between `ReadTask` and `DataReader` is similar to 
`Iterable` and `Iterator`.
+ *
+ * Note that, the read task will be serialized and sent to executors, then 
the data reader will be
+ * created on executors and do the actual reading.
+ */
+public interface ReadTask extends Serializable {
+  /**
+   * The preferred locations for this read task to run faster, but Spark 
can't guarantee that this
+   * task will always run on these locations. Implementations should make 
sure that it can
--- End diff --

`Implementations ` -> `The Implementation`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19186: [SPARK-21972][ML] Add param handlePersistence

2017-09-12 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19186
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/81651/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19186: [SPARK-21972][ML] Add param handlePersistence

2017-09-12 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19186
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19186: [SPARK-21972][ML] Add param handlePersistence

2017-09-12 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19186
  
**[Test build #81651 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81651/testReport)**
 for PR 19186 at commit 
[`9e53579`](https://github.com/apache/spark/commit/9e53579b4e8e69761f5a6c89cc60ab179ff78ea6).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19196: [SPARK-21977] SinglePartition optimizations break certai...

2017-09-12 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19196
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/81657/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19196: [SPARK-21977] SinglePartition optimizations break certai...

2017-09-12 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19196
  
**[Test build #81657 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81657/testReport)**
 for PR 19196 at commit 
[`12cf02a`](https://github.com/apache/spark/commit/12cf02a10ff7219f1ed405c37c2ac87c65a6c798).
 * This patch **fails to generate documentation**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19196: [SPARK-21977] SinglePartition optimizations break certai...

2017-09-12 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19196
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19185: [Spark-21854] Added LogisticRegressionTrainingSummary fo...

2017-09-12 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19185
  
**[Test build #81660 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81660/testReport)**
 for PR 19185 at commit 
[`eb8f6b4`](https://github.com/apache/spark/commit/eb8f6b431982d6f1f0118965391560f94812ab53).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19198: [MINOR][DOC] Add missing call of `update()` in examples ...

2017-09-12 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19198
  
**[Test build #81659 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81659/testReport)**
 for PR 19198 at commit 
[`6f3859c`](https://github.com/apache/spark/commit/6f3859c38392c9d1e5b5be9883610ecb26513736).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19197: [SPARK-18608][ML] Fix double caching

2017-09-12 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19197
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/81654/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19197: [SPARK-18608][ML] Fix double caching

2017-09-12 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19197
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19197: [SPARK-18608][ML] Fix double caching

2017-09-12 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19197
  
**[Test build #81654 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81654/testReport)**
 for PR 19197 at commit 
[`b485614`](https://github.com/apache/spark/commit/b4856147e04c3d57f2bfc70c70e3f136f46fa873).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19186: [SPARK-21972][ML] Add param handlePersistence

2017-09-12 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19186
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19186: [SPARK-21972][ML] Add param handlePersistence

2017-09-12 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19186
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/81650/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19186: [SPARK-21972][ML] Add param handlePersistence

2017-09-12 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19186
  
**[Test build #81650 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81650/testReport)**
 for PR 19186 at commit 
[`e112b42`](https://github.com/apache/spark/commit/e112b42a2df231f6b200bcb0cd3759cc143c8c80).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19130: [SPARK-21917][CORE][YARN] Supporting adding http(s) reso...

2017-09-12 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19130
  
**[Test build #81658 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81658/testReport)**
 for PR 19130 at commit 
[`4bbc09d`](https://github.com/apache/spark/commit/4bbc09d68c21496d97be3e2d9f781e7ca0bbf7e7).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19198: [MINOR][DOC] Add missing call of `update()` in ex...

2017-09-12 Thread zhengruifeng
GitHub user zhengruifeng opened a pull request:

https://github.com/apache/spark/pull/19198

[MINOR][DOC] Add missing call of `update()` in examples of 
PeriodicGraphCheckpointer & PeriodicRDDCheckpointer

## What changes were proposed in this pull request?
forgot to call `update()` with `graph1` & `rdd1` in examples for 
`PeriodicGraphCheckpointer` & `PeriodicRDDCheckpoin`
## How was this patch tested?
existing tests


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/zhengruifeng/spark fix_doc_checkpointer

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19198.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19198


commit 6f3859c38392c9d1e5b5be9883610ecb26513736
Author: Zheng RuiFeng 
Date:   2017-09-12T05:59:25Z

create pr




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19130: [SPARK-21917][CORE][YARN] Supporting adding http(s) reso...

2017-09-12 Thread jerryshao
Github user jerryshao commented on the issue:

https://github.com/apache/spark/pull/19130
  
@tgravescs , thanks for your comments, can you review again, if it is what 
you expected.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #9168: [SPARK-11182] HDFS Delegation Token will be expired when ...

2017-09-12 Thread jackiehff
Github user jackiehff commented on the issue:

https://github.com/apache/spark/pull/9168
  
@marsishandsome, the patch file in jira HDFS-9276 has been introduced into 
hadoop-2.6.0-cdh5.7.3 which is our hadoop cluster version, but I still got 
token can't be found in cache error when I running spark streaming job, so my 
question is whether spark source code also need to be modified according to 
your pull request? By the way, I also used the configuration " --conf 
spark.hadoop.fs.hdfs.impl.disable.cache=true", but it didn't work


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19196: [SPARK-21977] SinglePartition optimizations break certai...

2017-09-12 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19196
  
**[Test build #81657 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81657/testReport)**
 for PR 19196 at commit 
[`12cf02a`](https://github.com/apache/spark/commit/12cf02a10ff7219f1ed405c37c2ac87c65a6c798).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



<    1   2   3   4   5