[GitHub] spark pull request #20437: [SPARK-23270][Streaming][WEB-UI]FileInputDStream ...

2018-01-30 Thread jerryshao
Github user jerryshao commented on a diff in the pull request:

https://github.com/apache/spark/pull/20437#discussion_r164973745
  
--- Diff: 
streaming/src/main/scala/org/apache/spark/streaming/dstream/FileInputDStream.scala
 ---
@@ -157,7 +157,7 @@ class FileInputDStream[K, V, F <: NewInputFormat[K, V]](
 val metadata = Map(
   "files" -> newFiles.toList,
   StreamInputInfo.METADATA_KEY_DESCRIPTION -> newFiles.mkString("\n"))
-val inputInfo = StreamInputInfo(id, 0, metadata)
+val inputInfo = StreamInputInfo(id, rdds.map(_.count).sum, metadata)
--- End diff --

This is not a small overhead. The changes will read/scan all the new files, 
this is a big overhead for streaming application (data is unnecessarily read 
twice).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20400: [SPARK-23084][PYTHON]Add unboundedPreceding(), un...

2018-01-30 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/20400#discussion_r164973111
  
--- Diff: python/pyspark/sql/window.py ---
@@ -212,16 +218,20 @@ def rangeBetween(self, start, end):
 values directly.
 
 :param start: boundary start, inclusive.
-  The frame is unbounded if this is 
``Window.unboundedPreceding``, or
+  The frame is unbounded if this is 
``Window.unboundedPreceding``,
+  
``org.apache.spark.sql.catalyst.expressions.UnboundedPreceding``, or
   any value less than or equal to max(-sys.maxsize, 
-9223372036854775808).
 :param end: boundary end, inclusive.
-The frame is unbounded if this is 
``Window.unboundedFollowing``, or
+The frame is unbounded if this is 
``Window.unboundedFollowing``,
+
``org.apache.spark.sql.catalyst.expressions.UnboundedFollowing``, or
--- End diff --

Wait .. why do we expose `org.apache.spark.sql.catalyst` path in Python doc 
.. ? In addition, this package is meant to be internal if I haven't missed 
something .. ?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20437: [SPARK-23270][Streaming][WEB-UI]FileInputDStream ...

2018-01-30 Thread guoxiaolongzte
Github user guoxiaolongzte commented on a diff in the pull request:

https://github.com/apache/spark/pull/20437#discussion_r164973156
  
--- Diff: 
streaming/src/main/scala/org/apache/spark/streaming/dstream/FileInputDStream.scala
 ---
@@ -157,7 +157,7 @@ class FileInputDStream[K, V, F <: NewInputFormat[K, V]](
 val metadata = Map(
   "files" -> newFiles.toList,
   StreamInputInfo.METADATA_KEY_DESCRIPTION -> newFiles.mkString("\n"))
-val inputInfo = StreamInputInfo(id, 0, metadata)
+val inputInfo = StreamInputInfo(id, rdds.map(_.count).sum, metadata)
--- End diff --

Because of this little overhead, that 'Records' is not recorded? This is a 
obvious bug.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20450: [SPARK-23280][SQL] add map type support to ColumnVector

2018-01-30 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20450
  
**[Test build #86866 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86866/testReport)**
 for PR 20450 at commit 
[`8e66fb5`](https://github.com/apache/spark/commit/8e66fb5077c68ff1378ec2a9e4238f9d73af5fb4).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20450: [SPARK-23280][SQL] add map type support to ColumnVector

2018-01-30 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20450
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/421/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20450: [SPARK-23280][SQL] add map type support to ColumnVector

2018-01-30 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20450
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20450: [SPARK-23280][SQL] add map type support to ColumnVector

2018-01-30 Thread cloud-fan
Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/20450
  
cc @hvanhovell @viirya @ueshin @kiszk @gatorsmile @dongjoon-hyun 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20450: [SPARK-23280][SQL] add map type support to Column...

2018-01-30 Thread cloud-fan
GitHub user cloud-fan opened a pull request:

https://github.com/apache/spark/pull/20450

[SPARK-23280][SQL] add map type support to ColumnVector

## What changes were proposed in this pull request?

Fill the last missing piece of `ColumnVector`: the map type support.

The idea is similar to the array type support. A map is basically 2 arrays 
for keys and values. We ask the implementations to provide a key array, a value 
array, and an offset and length to specify the range of this map in the 
key/value array.

In `WritableColumnVector`, we put the key array in first child vector, and 
value array in second child vector, and offsets and lengths in the current 
vector, which is very similar to how array type is implemented here.

## How was this patch tested?

a new test

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/cloud-fan/spark map

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/20450.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #20450


commit 8e66fb5077c68ff1378ec2a9e4238f9d73af5fb4
Author: Wenchen Fan 
Date:   2018-01-30T10:52:12Z

add map type support to ColumnVector




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20400: [SPARK-23084][PYTHON]Add unboundedPreceding(), un...

2018-01-30 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/20400#discussion_r164972094
  
--- Diff: python/pyspark/sql/window.py ---
@@ -212,16 +218,20 @@ def rangeBetween(self, start, end):
 values directly.
 
 :param start: boundary start, inclusive.
-  The frame is unbounded if this is 
``Window.unboundedPreceding``, or
+  The frame is unbounded if this is 
``Window.unboundedPreceding``,
+  
``org.apache.spark.sql.catalyst.expressions.UnboundedPreceding``, or
   any value less than or equal to max(-sys.maxsize, 
-9223372036854775808).
 :param end: boundary end, inclusive.
-The frame is unbounded if this is 
``Window.unboundedFollowing``, or
+The frame is unbounded if this is 
``Window.unboundedFollowing``,
+
``org.apache.spark.sql.catalyst.expressions.UnboundedFollowing``, or
 any value greater than or equal to min(sys.maxsize, 
9223372036854775807).
+
--- End diff --

Can we have a doctest resembling this -
 
https://github.com/jiangxb1987/spark/blob/cec519b8cfbf1ed2a3107056ef5281a5be75ec54/sql/core/src/main/scala/org/apache/spark/sql/expressions/Window.scala#L214-L240
?



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20400: [SPARK-23084][PYTHON]Add unboundedPreceding(), un...

2018-01-30 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/20400#discussion_r164971807
  
--- Diff: python/pyspark/sql/window.py ---
@@ -124,16 +126,20 @@ def rangeBetween(start, end):
 values directly.
 
 :param start: boundary start, inclusive.
-  The frame is unbounded if this is 
``Window.unboundedPreceding``, or
+  The frame is unbounded if this is 
``Window.unboundedPreceding``,
+  
``org.apache.spark.sql.catalyst.expressions.UnboundedPreceding``, or
   any value less than or equal to max(-sys.maxsize, 
-9223372036854775808).
 :param end: boundary end, inclusive.
-The frame is unbounded if this is 
``Window.unboundedFollowing``, or
+The frame is unbounded if this is 
``Window.unboundedFollowing``,
+
``org.apache.spark.sql.catalyst.expressions.UnboundedFollowing``, or
 any value greater than or equal to min(sys.maxsize, 
9223372036854775807).
+any value greater than or equal to 9223372036854775807.
 """
-if start <= Window._PRECEDING_THRESHOLD:
-start = Window.unboundedPreceding
-if end >= Window._FOLLOWING_THRESHOLD:
-end = Window.unboundedFollowing
+if isinstance(start, (int, long)) and isinstance(end, (int, long)):
+if start <= Window._PRECEDING_THRESHOLD:
+start = Window.unboundedPreceding
+if end >= Window._FOLLOWING_THRESHOLD:
+end = Window.unboundedFollowing
--- End diff --

Shall we add a logic like:

```
elif isinstance(start, Column) and isinstance(end, Column):
start = start._jc
end = end._jc
```

too?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20438: [SPARK-23272][SQL] add calendar interval type sup...

2018-01-30 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/20438#discussion_r164971509
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/MutableColumnarRow.java
 ---
@@ -146,9 +146,7 @@ public UTF8String getUTF8String(int ordinal) {
   @Override
   public CalendarInterval getInterval(int ordinal) {
 if (columns[ordinal].isNullAt(rowId)) return null;
--- End diff --

Let me try to prepare a PR tonight.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17886: [SPARK-13983][SQL] Fix HiveThriftServer2 can not get "--...

2018-01-30 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17886
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17886: [SPARK-13983][SQL] Fix HiveThriftServer2 can not get "--...

2018-01-30 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17886
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/420/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20438: [SPARK-23272][SQL] add calendar interval type sup...

2018-01-30 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/20438


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20400: [SPARK-23084][PYTHON]Add unboundedPreceding(), un...

2018-01-30 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/20400#discussion_r164968965
  
--- Diff: python/pyspark/sql/functions.py ---
@@ -809,6 +809,45 @@ def ntile(n):
 return Column(sc._jvm.functions.ntile(int(n)))
 
 
+@since(2.3)
+def unboundedPreceding():
+"""
+Window function: returns the special frame boundary that represents 
the first row
+in the window partition.
+>>> df = spark.createDataFrame([(5,)])
+>>> df.select(unboundedPreceding()).show
+
--- End diff --

I think we should have a working example here in doc tests here.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20438: [SPARK-23272][SQL] add calendar interval type support to...

2018-01-30 Thread cloud-fan
Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/20438
  
thanks, merging to master/2.3!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20437: [SPARK-23270][Streaming][WEB-UI]FileInputDStream ...

2018-01-30 Thread jerryshao
Github user jerryshao commented on a diff in the pull request:

https://github.com/apache/spark/pull/20437#discussion_r164968292
  
--- Diff: 
streaming/src/main/scala/org/apache/spark/streaming/dstream/FileInputDStream.scala
 ---
@@ -157,7 +157,7 @@ class FileInputDStream[K, V, F <: NewInputFormat[K, V]](
 val metadata = Map(
   "files" -> newFiles.toList,
   StreamInputInfo.METADATA_KEY_DESCRIPTION -> newFiles.mkString("\n"))
-val inputInfo = StreamInputInfo(id, 0, metadata)
+val inputInfo = StreamInputInfo(id, rdds.map(_.count).sum, metadata)
--- End diff --

This will kick off a new Spark job to read files and count, which will 
bring in obvious overhead. Whereas `count` in `DirectKafkaInputDStream` only 
calculates offsets.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20448: [SPARK-23203][SQL] make DataSourceV2Relation immu...

2018-01-30 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/20448#discussion_r164966631
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala
 ---
@@ -17,16 +17,57 @@
 
 package org.apache.spark.sql.execution.datasources.v2
 
-import org.apache.spark.sql.catalyst.expressions.AttributeReference
+import org.apache.spark.sql.catalyst.expressions.{AttributeReference, 
AttributeSet, Expression}
 import org.apache.spark.sql.catalyst.plans.logical.{LeafNode, Statistics}
+import org.apache.spark.sql.sources.v2.{DataSourceOptions, DataSourceV2, 
ReadSupport, ReadSupportWithSchema}
 import org.apache.spark.sql.sources.v2.reader._
+import org.apache.spark.sql.types.StructType
 
+/**
+ * A logical plan representing a data source relation, which will be 
planned to a data scan
+ * operator finally.
+ *
+ * @param output The output of this relation.
+ * @param source The instance of a data source v2 implementation.
+ * @param options The options specified for this scan, used to create the 
`DataSourceReader`.
+ * @param userSpecifiedSchema The user specified schema, used to create 
the `DataSourceReader`.
+ * @param filters The predicates which are pushed and handled by this data 
source.
+ * @param existingReader An mutable reader carrying some temporary stats 
during optimization and
+ *   planning. It's always None before optimization, 
and does not take part in
+ *   the equality of this plan, which means this plan 
is still immutable.
+ */
 case class DataSourceV2Relation(
-fullOutput: Seq[AttributeReference],
-reader: DataSourceReader) extends LeafNode with DataSourceReaderHolder 
{
+output: Seq[AttributeReference],
+source: DataSourceV2,
+options: DataSourceOptions,
+userSpecifiedSchema: Option[StructType],
+filters: Set[Expression],
+existingReader: Option[DataSourceReader]) extends LeafNode with 
DataSourceV2QueryPlan {
--- End diff --

good catch! Yea this is a bug, but to respect the rule about solving 
different issues in different PR, I'd like to fix it in a new PR.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20373: [SPARK-23159][PYTHON] Update cloudpickle to v0.4.2 plus ...

2018-01-30 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/20373
  
Ah, yup. I am okay with including it. Wanted to double check. Seems a small 
clean bug fix.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20400: [SPARK-23084][PYTHON]Add unboundedPreceding(), un...

2018-01-30 Thread huaxingao
Github user huaxingao commented on a diff in the pull request:

https://github.com/apache/spark/pull/20400#discussion_r164966938
  
--- Diff: python/pyspark/sql/window.py ---
@@ -124,16 +126,20 @@ def rangeBetween(start, end):
 values directly.
 
 :param start: boundary start, inclusive.
-  The frame is unbounded if this is 
``Window.unboundedPreceding``, or
+  The frame is unbounded if this is 
``Window.unboundedPreceding``,
+  
``org.apache.spark.sql.catalyst.expressions.UnboundedPreceding``, or
   any value less than or equal to max(-sys.maxsize, 
-9223372036854775808).
 :param end: boundary end, inclusive.
-The frame is unbounded if this is 
``Window.unboundedFollowing``, or
+The frame is unbounded if this is 
``Window.unboundedFollowing``,
+
``org.apache.spark.sql.catalyst.expressions.UnboundedFollowing``, or
 any value greater than or equal to min(sys.maxsize, 
9223372036854775807).
--- End diff --

@jiangxb1987 Sorry for the extra line. Will remove. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20442: [SPARK-23265][SQL]Update multi-column error handling log...

2018-01-30 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20442
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20442: [SPARK-23265][SQL]Update multi-column error handling log...

2018-01-30 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20442
  
**[Test build #86864 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86864/testReport)**
 for PR 20442 at commit 
[`dfaad52`](https://github.com/apache/spark/commit/dfaad5271029f2d1ec9e47fb2ddab1738543c238).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20442: [SPARK-23265][SQL]Update multi-column error handling log...

2018-01-30 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20442
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86864/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20373: [SPARK-23159][PYTHON] Update cloudpickle to v0.4.2 plus ...

2018-01-30 Thread ueshin
Github user ueshin commented on the issue:

https://github.com/apache/spark/pull/20373
  
@HyukjinKwon Yes, I meant to port it too. I agree with matching it to 
v0.4.2 as same as we can, so I don't think we should include formatting one but 
bug fixes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20448: [SPARK-23203][SQL] make DataSourceV2Relation immutable

2018-01-30 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20448
  
**[Test build #86865 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86865/testReport)**
 for PR 20448 at commit 
[`11220db`](https://github.com/apache/spark/commit/11220db7879798967cda85d2aa4e68fefb8ec646).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20448: [SPARK-23203][SQL] make DataSourceV2Relation immutable

2018-01-30 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20448
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/419/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20400: [SPARK-23084][PYTHON]Add unboundedPreceding(), un...

2018-01-30 Thread huaxingao
Github user huaxingao commented on a diff in the pull request:

https://github.com/apache/spark/pull/20400#discussion_r164965129
  
--- Diff: python/pyspark/sql/functions.py ---
@@ -809,6 +809,45 @@ def ntile(n):
 return Column(sc._jvm.functions.ntile(int(n)))
 
 
+@since(2.3)
+def unboundedPreceding():
+"""
+Window function: returns the special frame boundary that represents 
the first row
+in the window partition.
+>>> df = spark.createDataFrame([(5,)])
+>>> df.select(unboundedPreceding()).show
+
+"""
+sc = SparkContext._active_spark_context
+return Column(sc._jvm.functions.unboundedPreceding())
+
+
+@since(2.3)
+def unboundedFollowing():
+"""
+Window function: returns the special frame boundary that represents 
the last row
+in the window partition.
+>>> df = spark.createDataFrame([(5,)])
--- End diff --

Will add a newline. Thanks. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20448: [SPARK-23203][SQL] make DataSourceV2Relation immutable

2018-01-30 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20448
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20400: [SPARK-23084][PYTHON]Add unboundedPreceding(), un...

2018-01-30 Thread huaxingao
Github user huaxingao commented on a diff in the pull request:

https://github.com/apache/spark/pull/20400#discussion_r164965040
  
--- Diff: python/pyspark/sql/functions.py ---
@@ -809,6 +809,45 @@ def ntile(n):
 return Column(sc._jvm.functions.ntile(int(n)))
 
 
+@since(2.3)
+def unboundedPreceding():
+"""
+Window function: returns the special frame boundary that represents 
the first row
+in the window partition.
+>>> df = spark.createDataFrame([(5,)])
+>>> df.select(unboundedPreceding()).show
+
--- End diff --

@HyukjinKwon Thanks for your comment.
Yes, it is intentional. I am trying to print out something that contains 
"UNBOUNDED PRECEDING" when calling the method unboundedPreceding(), so I will 
know this method gets executed correctly. I couldn't figure out a better way to 
do this. Please let me know if you have a better way.  


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20373: [SPARK-23159][PYTHON] Update cloudpickle to v0.4.2 plus ...

2018-01-30 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/20373
  
Hm .. I was thinking we should focus on matching it to v0.4.2 as same as we 
can. Few bug links I and @ueshin found looked regressions comparing to our copy 
so I suggested to include them here ..


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20445: [SPARK-23092][SQL] Migrate MemoryStream to DataSourceV2 ...

2018-01-30 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20445
  
Build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20445: [SPARK-23092][SQL] Migrate MemoryStream to DataSourceV2 ...

2018-01-30 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20445
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86855/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20445: [SPARK-23092][SQL] Migrate MemoryStream to DataSourceV2 ...

2018-01-30 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20445
  
**[Test build #86855 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86855/testReport)**
 for PR 20445 at commit 
[`e66d809`](https://github.com/apache/spark/commit/e66d809fe501b19b923a88d1b4cb9df69b4ae329).
 * This patch **fails from timeout after a configured wait of \`300m\`**.
 * This patch **does not merge cleanly**.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20373: [SPARK-23159][PYTHON] Update cloudpickle to v0.4.2 plus ...

2018-01-30 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/20373
  
@ueshin, is https://github.com/cloudpipe/cloudpickle/pull/140 a regression 
comparing to our cloudcpikle copy, or do you suggest to port it too as it's a 
simple and clean bug fix? Seems the same codes were already there.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20449: [SPARK-23040][CORE]: Returns interruptible iterator for ...

2018-01-30 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20449
  
Can one of the admins verify this patch?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20449: [SPARK-23040][CORE]: Returns interruptible iterator for ...

2018-01-30 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20449
  
Can one of the admins verify this patch?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20449: [SPARK-23040][CORE]: Returns interruptible iterat...

2018-01-30 Thread advancedxy
GitHub user advancedxy opened a pull request:

https://github.com/apache/spark/pull/20449

[SPARK-23040][CORE]: Returns interruptible iterator for shuffle reader

## What changes were proposed in this pull request?

Before this commit, a non-interruptible iterator is returned if aggregator 
or ordering is specified.
This commit also ensures that sorter is closed even when task is 
cancelled(killed) in the middle of sorting.

## How was this patch tested?

Add a unit test in JobCancellationSuite

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/advancedxy/spark SPARK-23040

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/20449.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #20449


commit acca0e3da2da21e4e184f01b4e8f7b6b8c05ee1d
Author: Xianjin YE 
Date:   2018-01-31T06:27:21Z

[SPARK-23040][CORE]: Returns interruptible iterator for shuffle reader

Before this commit, a non-interruptible iterator is returned if
aggregator or ordering is specified.




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20404: [SPARK-23228][PYSPARK] Add Python Created jsparkSession ...

2018-01-30 Thread jerryshao
Github user jerryshao commented on the issue:

https://github.com/apache/spark/pull/20404
  
@felixcheung what is your opinion on this, do we really need to handle this 
case?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20422: [SPARK-23253][Core][Shuffle]Only write shuffle temporary...

2018-01-30 Thread yaooqinn
Github user yaooqinn commented on the issue:

https://github.com/apache/spark/pull/20422
  
@squito add a test for index file. plz check it again, thanks.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17332: [SPARK-10764][ML] Add optional caching to Pipelines

2018-01-30 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17332
  
Can one of the admins verify this patch?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20373: [SPARK-23159][PYTHON] Update cloudpickle to v0.4.2 plus ...

2018-01-30 Thread BryanCutler
Github user BryanCutler commented on the issue:

https://github.com/apache/spark/pull/20373
  
>Wait .. @BryanCutler, did you port the formatting one here ..? I was 
thinking we should match it to v0.4.2 as same as possible to reduce the diff.

Yes I added that here as well because I thought that's what Holden asked 
for, could you confirm @holdenk ?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20415: [SPARK-23247][SQL]combines Unsafe operations and statist...

2018-01-30 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20415
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20415: [SPARK-23247][SQL]combines Unsafe operations and statist...

2018-01-30 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20415
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86860/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20415: [SPARK-23247][SQL]combines Unsafe operations and statist...

2018-01-30 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20415
  
**[Test build #86860 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86860/testReport)**
 for PR 20415 at commit 
[`e3e09d9`](https://github.com/apache/spark/commit/e3e09d98072bd39328a4e7d4de1ddd38594c6232).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20447: [SPARK-23279][SS] Avoid triggering distributed jo...

2018-01-30 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/20447


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20422: [SPARK-23253][Core][Shuffle]Only write shuffle temporary...

2018-01-30 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20422
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86858/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20422: [SPARK-23253][Core][Shuffle]Only write shuffle temporary...

2018-01-30 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20422
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20447: [SPARK-23279][SS] Avoid triggering distributed job for C...

2018-01-30 Thread jerryshao
Github user jerryshao commented on the issue:

https://github.com/apache/spark/pull/20447
  
Merging to master and 2.3. Thanks for the review!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20422: [SPARK-23253][Core][Shuffle]Only write shuffle temporary...

2018-01-30 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20422
  
**[Test build #86858 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86858/testReport)**
 for PR 20422 at commit 
[`87e6dc0`](https://github.com/apache/spark/commit/87e6dc0b9ce362e754142c63b95a1841f427471a).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20442: [SPARK-23265][SQL]Update multi-column error handling log...

2018-01-30 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20442
  
**[Test build #86864 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86864/testReport)**
 for PR 20442 at commit 
[`dfaad52`](https://github.com/apache/spark/commit/dfaad5271029f2d1ec9e47fb2ddab1738543c238).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20442: [SPARK-23265][SQL]Update multi-column error handling log...

2018-01-30 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20442
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20442: [SPARK-23265][SQL]Update multi-column error handling log...

2018-01-30 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20442
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/418/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20448: [SPARK-23203][SQL] make DataSourceV2Relation immu...

2018-01-30 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/20448#discussion_r164958102
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala
 ---
@@ -17,16 +17,57 @@
 
 package org.apache.spark.sql.execution.datasources.v2
 
-import org.apache.spark.sql.catalyst.expressions.AttributeReference
+import org.apache.spark.sql.catalyst.expressions.{AttributeReference, 
AttributeSet, Expression}
 import org.apache.spark.sql.catalyst.plans.logical.{LeafNode, Statistics}
+import org.apache.spark.sql.sources.v2.{DataSourceOptions, DataSourceV2, 
ReadSupport, ReadSupportWithSchema}
 import org.apache.spark.sql.sources.v2.reader._
+import org.apache.spark.sql.types.StructType
 
+/**
+ * A logical plan representing a data source relation, which will be 
planned to a data scan
+ * operator finally.
+ *
+ * @param output The output of this relation.
+ * @param source The instance of a data source v2 implementation.
+ * @param options The options specified for this scan, used to create the 
`DataSourceReader`.
+ * @param userSpecifiedSchema The user specified schema, used to create 
the `DataSourceReader`.
+ * @param filters The predicates which are pushed and handled by this data 
source.
+ * @param existingReader An mutable reader carrying some temporary stats 
during optimization and
+ *   planning. It's always None before optimization, 
and does not take part in
+ *   the equality of this plan, which means this plan 
is still immutable.
+ */
 case class DataSourceV2Relation(
-fullOutput: Seq[AttributeReference],
-reader: DataSourceReader) extends LeafNode with DataSourceReaderHolder 
{
+output: Seq[AttributeReference],
+source: DataSourceV2,
+options: DataSourceOptions,
+userSpecifiedSchema: Option[StructType],
+filters: Set[Expression],
+existingReader: Option[DataSourceReader]) extends LeafNode with 
DataSourceV2QueryPlan {
+
+  override def references: AttributeSet = AttributeSet.empty
+
+  override def sourceClass: Class[_ <: DataSourceV2] = source.getClass
 
   override def canEqual(other: Any): Boolean = 
other.isInstanceOf[DataSourceV2Relation]
 
+  def reader: DataSourceReader = existingReader.getOrElse {
+(source, userSpecifiedSchema) match {
+  case (ds: ReadSupportWithSchema, Some(schema)) =>
+ds.createReader(schema, options)
+
+  case (ds: ReadSupport, None) =>
+ds.createReader(options)
+
+  case (ds: ReadSupport, Some(schema)) =>
+val reader = ds.createReader(options)
+// Sanity check, this should be guaranteed by 
`DataFrameReader.load`
+assert(reader.readSchema() == schema)
+reader
+
+  case _ => throw new IllegalStateException()
+}
+  }
+
--- End diff --

data source v2 doesn't support tables yet, so we don't have this problem 
now.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20447: [SPARK-23279][SS] Avoid triggering distributed job for C...

2018-01-30 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20447
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20447: [SPARK-23279][SS] Avoid triggering distributed job for C...

2018-01-30 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20447
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86859/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20447: [SPARK-23279][SS] Avoid triggering distributed job for C...

2018-01-30 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20447
  
**[Test build #86859 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86859/testReport)**
 for PR 20447 at commit 
[`4b2baeb`](https://github.com/apache/spark/commit/4b2baeb8e00d575189036609c232a9c24f69e4a0).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20373: [SPARK-23159][PYTHON] Update cloudpickle to v0.4.2 plus ...

2018-01-30 Thread ueshin
Github user ueshin commented on the issue:

https://github.com/apache/spark/pull/20373
  
What about https://github.com/cloudpipe/cloudpickle/pull/140?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20419: [SPARK-23032][SQL][FOLLOW-UP]Add codegenStageId in comme...

2018-01-30 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/20419
  
Then, we should add a test case to ensure it will not be broken. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20442: [SPARK-23265][SQL]Update multi-column error handl...

2018-01-30 Thread huaxingao
Github user huaxingao commented on a diff in the pull request:

https://github.com/apache/spark/pull/20442#discussion_r164956149
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/QuantileDiscretizer.scala ---
@@ -167,25 +167,31 @@ final class QuantileDiscretizer @Since("1.6.0") 
(@Since("1.6.0") override val ui
   @Since("2.3.0")
   def setOutputCols(value: Array[String]): this.type = set(outputCols, 
value)
 
-  private[feature] def getInOutCols: (Array[String], Array[String]) = {
-require((isSet(inputCol) && isSet(outputCol) && !isSet(inputCols) && 
!isSet(outputCols)) ||
-  (!isSet(inputCol) && !isSet(outputCol) && isSet(inputCols) && 
isSet(outputCols)),
-  "QuantileDiscretizer only supports setting either inputCol/outputCol 
or" +
-"inputCols/outputCols."
-)
+  @Since("1.6.0")
+  override def transformSchema(schema: StructType): StructType = {
+ParamValidators.checkSingleVsMultiColumnParams(this, Seq(outputCol),
+  Seq(outputCols))
 
-if (isSet(inputCol)) {
-  (Array($(inputCol)), Array($(outputCol)))
-} else {
-  require($(inputCols).length == $(outputCols).length,
-"inputCols number do not match outputCols")
-  ($(inputCols), $(outputCols))
+if (isSet(inputCols)) {
+  require(getInputCols.length == getOutputCols.length,
+s"QuantileDiscretizer $this has mismatched Params " +
--- End diff --

The only reason I have $this is because Bucketizer has $this and I am 
trying to be consistent with Bucketizer implementation.
```
if (isSet(inputCols)) {
  require(getInputCols.length == getOutputCols.length &&
getInputCols.length == getSplitsArray.length, s"Bucketizer $this 
has mismatched Params " +
s"for multi-column transform.  Params (inputCols, outputCols, 
splitsArray) should have " +
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20448: [SPARK-23203][SQL] make DataSourceV2Relation immu...

2018-01-30 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/20448#discussion_r164956171
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala
 ---
@@ -17,16 +17,57 @@
 
 package org.apache.spark.sql.execution.datasources.v2
 
-import org.apache.spark.sql.catalyst.expressions.AttributeReference
+import org.apache.spark.sql.catalyst.expressions.{AttributeReference, 
AttributeSet, Expression}
 import org.apache.spark.sql.catalyst.plans.logical.{LeafNode, Statistics}
+import org.apache.spark.sql.sources.v2.{DataSourceOptions, DataSourceV2, 
ReadSupport, ReadSupportWithSchema}
 import org.apache.spark.sql.sources.v2.reader._
+import org.apache.spark.sql.types.StructType
 
+/**
+ * A logical plan representing a data source relation, which will be 
planned to a data scan
+ * operator finally.
+ *
+ * @param output The output of this relation.
+ * @param source The instance of a data source v2 implementation.
+ * @param options The options specified for this scan, used to create the 
`DataSourceReader`.
+ * @param userSpecifiedSchema The user specified schema, used to create 
the `DataSourceReader`.
+ * @param filters The predicates which are pushed and handled by this data 
source.
+ * @param existingReader An mutable reader carrying some temporary stats 
during optimization and
+ *   planning. It's always None before optimization, 
and does not take part in
+ *   the equality of this plan, which means this plan 
is still immutable.
+ */
 case class DataSourceV2Relation(
-fullOutput: Seq[AttributeReference],
-reader: DataSourceReader) extends LeafNode with DataSourceReaderHolder 
{
+output: Seq[AttributeReference],
+source: DataSourceV2,
+options: DataSourceOptions,
+userSpecifiedSchema: Option[StructType],
+filters: Set[Expression],
+existingReader: Option[DataSourceReader]) extends LeafNode with 
DataSourceV2QueryPlan {
+
+  override def references: AttributeSet = AttributeSet.empty
+
+  override def sourceClass: Class[_ <: DataSourceV2] = source.getClass
 
   override def canEqual(other: Any): Boolean = 
other.isInstanceOf[DataSourceV2Relation]
 
+  def reader: DataSourceReader = existingReader.getOrElse {
+(source, userSpecifiedSchema) match {
+  case (ds: ReadSupportWithSchema, Some(schema)) =>
+ds.createReader(schema, options)
+
+  case (ds: ReadSupport, None) =>
+ds.createReader(options)
+
+  case (ds: ReadSupport, Some(schema)) =>
+val reader = ds.createReader(options)
+// Sanity check, this should be guaranteed by 
`DataFrameReader.load`
+assert(reader.readSchema() == schema)
+reader
+
+  case _ => throw new IllegalStateException()
+}
+  }
+
--- End diff --

What is the behavior we expect when users call `REFRESH TABLE`?

Also another potential issue is about storing the statistics in the 
external catalog? Do we still have the previous issues discussed in 
https://github.com/apache/spark/pull/14712?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20442: [SPARK-23265][SQL]Update multi-column error handl...

2018-01-30 Thread huaxingao
Github user huaxingao commented on a diff in the pull request:

https://github.com/apache/spark/pull/20442#discussion_r164955798
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/QuantileDiscretizer.scala ---
@@ -167,25 +167,31 @@ final class QuantileDiscretizer @Since("1.6.0") 
(@Since("1.6.0") override val ui
   @Since("2.3.0")
   def setOutputCols(value: Array[String]): this.type = set(outputCols, 
value)
 
-  private[feature] def getInOutCols: (Array[String], Array[String]) = {
-require((isSet(inputCol) && isSet(outputCol) && !isSet(inputCols) && 
!isSet(outputCols)) ||
-  (!isSet(inputCol) && !isSet(outputCol) && isSet(inputCols) && 
isSet(outputCols)),
-  "QuantileDiscretizer only supports setting either inputCol/outputCol 
or" +
-"inputCols/outputCols."
-)
+  @Since("1.6.0")
+  override def transformSchema(schema: StructType): StructType = {
+ParamValidators.checkSingleVsMultiColumnParams(this, Seq(outputCol),
--- End diff --

Thanks for your comment. I will add the check.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20448: [SPARK-23203][SQL] make DataSourceV2Relation immu...

2018-01-30 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/20448#discussion_r164955518
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala
 ---
@@ -17,16 +17,57 @@
 
 package org.apache.spark.sql.execution.datasources.v2
 
-import org.apache.spark.sql.catalyst.expressions.AttributeReference
+import org.apache.spark.sql.catalyst.expressions.{AttributeReference, 
AttributeSet, Expression}
 import org.apache.spark.sql.catalyst.plans.logical.{LeafNode, Statistics}
+import org.apache.spark.sql.sources.v2.{DataSourceOptions, DataSourceV2, 
ReadSupport, ReadSupportWithSchema}
 import org.apache.spark.sql.sources.v2.reader._
+import org.apache.spark.sql.types.StructType
 
+/**
+ * A logical plan representing a data source relation, which will be 
planned to a data scan
+ * operator finally.
+ *
+ * @param output The output of this relation.
+ * @param source The instance of a data source v2 implementation.
+ * @param options The options specified for this scan, used to create the 
`DataSourceReader`.
+ * @param userSpecifiedSchema The user specified schema, used to create 
the `DataSourceReader`.
+ * @param filters The predicates which are pushed and handled by this data 
source.
+ * @param existingReader An mutable reader carrying some temporary stats 
during optimization and
+ *   planning. It's always None before optimization, 
and does not take part in
+ *   the equality of this plan, which means this plan 
is still immutable.
+ */
 case class DataSourceV2Relation(
-fullOutput: Seq[AttributeReference],
-reader: DataSourceReader) extends LeafNode with DataSourceReaderHolder 
{
+output: Seq[AttributeReference],
+source: DataSourceV2,
+options: DataSourceOptions,
+userSpecifiedSchema: Option[StructType],
+filters: Set[Expression],
+existingReader: Option[DataSourceReader]) extends LeafNode with 
DataSourceV2QueryPlan {
+
+  override def references: AttributeSet = AttributeSet.empty
+
+  override def sourceClass: Class[_ <: DataSourceV2] = source.getClass
 
   override def canEqual(other: Any): Boolean = 
other.isInstanceOf[DataSourceV2Relation]
 
+  def reader: DataSourceReader = existingReader.getOrElse {
+(source, userSpecifiedSchema) match {
+  case (ds: ReadSupportWithSchema, Some(schema)) =>
+ds.createReader(schema, options)
+
+  case (ds: ReadSupport, None) =>
+ds.createReader(options)
+
+  case (ds: ReadSupport, Some(schema)) =>
+val reader = ds.createReader(options)
+// Sanity check, this should be guaranteed by 
`DataFrameReader.load`
+assert(reader.readSchema() == schema)
+reader
+
+  case _ => throw new IllegalStateException()
+}
+  }
+
--- End diff --

What is the output of this node in `Explain`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20448: [SPARK-23203][SQL] make DataSourceV2Relation immu...

2018-01-30 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/20448#discussion_r164955297
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala
 ---
@@ -17,16 +17,57 @@
 
 package org.apache.spark.sql.execution.datasources.v2
 
-import org.apache.spark.sql.catalyst.expressions.AttributeReference
+import org.apache.spark.sql.catalyst.expressions.{AttributeReference, 
AttributeSet, Expression}
 import org.apache.spark.sql.catalyst.plans.logical.{LeafNode, Statistics}
+import org.apache.spark.sql.sources.v2.{DataSourceOptions, DataSourceV2, 
ReadSupport, ReadSupportWithSchema}
 import org.apache.spark.sql.sources.v2.reader._
+import org.apache.spark.sql.types.StructType
 
+/**
+ * A logical plan representing a data source relation, which will be 
planned to a data scan
+ * operator finally.
+ *
+ * @param output The output of this relation.
+ * @param source The instance of a data source v2 implementation.
+ * @param options The options specified for this scan, used to create the 
`DataSourceReader`.
+ * @param userSpecifiedSchema The user specified schema, used to create 
the `DataSourceReader`.
+ * @param filters The predicates which are pushed and handled by this data 
source.
+ * @param existingReader An mutable reader carrying some temporary stats 
during optimization and
+ *   planning. It's always None before optimization, 
and does not take part in
+ *   the equality of this plan, which means this plan 
is still immutable.
+ */
 case class DataSourceV2Relation(
-fullOutput: Seq[AttributeReference],
-reader: DataSourceReader) extends LeafNode with DataSourceReaderHolder 
{
+output: Seq[AttributeReference],
+source: DataSourceV2,
+options: DataSourceOptions,
+userSpecifiedSchema: Option[StructType],
+filters: Set[Expression],
+existingReader: Option[DataSourceReader]) extends LeafNode with 
DataSourceV2QueryPlan {
+
+  override def references: AttributeSet = AttributeSet.empty
+
+  override def sourceClass: Class[_ <: DataSourceV2] = source.getClass
 
   override def canEqual(other: Any): Boolean = 
other.isInstanceOf[DataSourceV2Relation]
 
+  def reader: DataSourceReader = existingReader.getOrElse {
+(source, userSpecifiedSchema) match {
+  case (ds: ReadSupportWithSchema, Some(schema)) =>
+ds.createReader(schema, options)
+
+  case (ds: ReadSupport, None) =>
+ds.createReader(options)
+
+  case (ds: ReadSupport, Some(schema)) =>
+val reader = ds.createReader(options)
+// Sanity check, this should be guaranteed by 
`DataFrameReader.load`
+assert(reader.readSchema() == schema)
+reader
+
+  case _ => throw new IllegalStateException()
+}
+  }
+
--- End diff --

Do we need to override a `def doCanonicalize`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20448: [SPARK-23203][SQL] make DataSourceV2Relation immu...

2018-01-30 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/20448#discussion_r164954926
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala
 ---
@@ -17,16 +17,57 @@
 
 package org.apache.spark.sql.execution.datasources.v2
 
-import org.apache.spark.sql.catalyst.expressions.AttributeReference
+import org.apache.spark.sql.catalyst.expressions.{AttributeReference, 
AttributeSet, Expression}
 import org.apache.spark.sql.catalyst.plans.logical.{LeafNode, Statistics}
+import org.apache.spark.sql.sources.v2.{DataSourceOptions, DataSourceV2, 
ReadSupport, ReadSupportWithSchema}
 import org.apache.spark.sql.sources.v2.reader._
+import org.apache.spark.sql.types.StructType
 
+/**
+ * A logical plan representing a data source relation, which will be 
planned to a data scan
+ * operator finally.
+ *
+ * @param output The output of this relation.
+ * @param source The instance of a data source v2 implementation.
+ * @param options The options specified for this scan, used to create the 
`DataSourceReader`.
+ * @param userSpecifiedSchema The user specified schema, used to create 
the `DataSourceReader`.
+ * @param filters The predicates which are pushed and handled by this data 
source.
+ * @param existingReader An mutable reader carrying some temporary stats 
during optimization and
+ *   planning. It's always None before optimization, 
and does not take part in
+ *   the equality of this plan, which means this plan 
is still immutable.
+ */
 case class DataSourceV2Relation(
-fullOutput: Seq[AttributeReference],
-reader: DataSourceReader) extends LeafNode with DataSourceReaderHolder 
{
+output: Seq[AttributeReference],
+source: DataSourceV2,
+options: DataSourceOptions,
+userSpecifiedSchema: Option[StructType],
+filters: Set[Expression],
+existingReader: Option[DataSourceReader]) extends LeafNode with 
DataSourceV2QueryPlan {
--- End diff --

Could you add a test for self join? Just to ensure it still works.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20279: [SPARK-23092][SQL] Migrate MemoryStream to DataSourceV2 ...

2018-01-30 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20279
  
Build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20279: [SPARK-23092][SQL] Migrate MemoryStream to DataSourceV2 ...

2018-01-30 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20279
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86853/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20448: [SPARK-23203][SQL] make DataSourceV2Relation immutable

2018-01-30 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20448
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86863/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20279: [SPARK-23092][SQL] Migrate MemoryStream to DataSourceV2 ...

2018-01-30 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20279
  
**[Test build #86853 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86853/testReport)**
 for PR 20279 at commit 
[`9f6c6b9`](https://github.com/apache/spark/commit/9f6c6b9ec5f9669a9147ce61ada90977e255ec85).
 * This patch **fails Spark unit tests**.
 * This patch **does not merge cleanly**.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20448: [SPARK-23203][SQL] make DataSourceV2Relation immutable

2018-01-30 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20448
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20448: [SPARK-23203][SQL] make DataSourceV2Relation immutable

2018-01-30 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20448
  
**[Test build #86863 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86863/testReport)**
 for PR 20448 at commit 
[`d96a48f`](https://github.com/apache/spark/commit/d96a48ff8f03daff23633b700a1374c89f3675e2).
 * This patch **fails to build**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20448: [SPARK-23203][SQL] make DataSourceV2Relation immu...

2018-01-30 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/20448#discussion_r164954221
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala
 ---
@@ -17,16 +17,57 @@
 
 package org.apache.spark.sql.execution.datasources.v2
 
-import org.apache.spark.sql.catalyst.expressions.AttributeReference
+import org.apache.spark.sql.catalyst.expressions.{AttributeReference, 
AttributeSet, Expression}
 import org.apache.spark.sql.catalyst.plans.logical.{LeafNode, Statistics}
+import org.apache.spark.sql.sources.v2.{DataSourceOptions, DataSourceV2, 
ReadSupport, ReadSupportWithSchema}
 import org.apache.spark.sql.sources.v2.reader._
+import org.apache.spark.sql.types.StructType
 
+/**
+ * A logical plan representing a data source relation, which will be 
planned to a data scan
+ * operator finally.
+ *
+ * @param output The output of this relation.
+ * @param source The instance of a data source v2 implementation.
+ * @param options The options specified for this scan, used to create the 
`DataSourceReader`.
+ * @param userSpecifiedSchema The user specified schema, used to create 
the `DataSourceReader`.
+ * @param filters The predicates which are pushed and handled by this data 
source.
+ * @param existingReader An mutable reader carrying some temporary stats 
during optimization and
--- End diff --

`An` ->  `A`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20448: [SPARK-23203][SQL] make DataSourceV2Relation immu...

2018-01-30 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/20448#discussion_r164953600
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2QueryPlan.scala
 ---
@@ -19,50 +19,31 @@ package org.apache.spark.sql.execution.datasources.v2
 
 import java.util.Objects
 
-import org.apache.spark.sql.catalyst.expressions.{Attribute, 
AttributeReference}
-import org.apache.spark.sql.sources.v2.reader._
+import org.apache.spark.sql.catalyst.expressions.{Attribute, AttributeSet, 
Expression}
+import org.apache.spark.sql.sources.v2.DataSourceV2
 
 /**
- * A base class for data source reader holder with customized 
equals/hashCode methods.
+ * A base class for data source v2 related query plan. It defines the 
equals/hashCode methods
+ * according to some common information.
--- End diff --

We might need to emphasize this is for both physical and logical plans.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20448: [SPARK-23203][SQL] make DataSourceV2Relation immu...

2018-01-30 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/20448#discussion_r164954392
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala
 ---
@@ -17,16 +17,57 @@
 
 package org.apache.spark.sql.execution.datasources.v2
 
-import org.apache.spark.sql.catalyst.expressions.AttributeReference
+import org.apache.spark.sql.catalyst.expressions.{AttributeReference, 
AttributeSet, Expression}
 import org.apache.spark.sql.catalyst.plans.logical.{LeafNode, Statistics}
+import org.apache.spark.sql.sources.v2.{DataSourceOptions, DataSourceV2, 
ReadSupport, ReadSupportWithSchema}
 import org.apache.spark.sql.sources.v2.reader._
+import org.apache.spark.sql.types.StructType
 
+/**
+ * A logical plan representing a data source relation, which will be 
planned to a data scan
+ * operator finally.
+ *
+ * @param output The output of this relation.
+ * @param source The instance of a data source v2 implementation.
+ * @param options The options specified for this scan, used to create the 
`DataSourceReader`.
+ * @param userSpecifiedSchema The user specified schema, used to create 
the `DataSourceReader`.
+ * @param filters The predicates which are pushed and handled by this data 
source.
+ * @param existingReader An mutable reader carrying some temporary stats 
during optimization and
+ *   planning. It's always None before optimization, 
and does not take part in
+ *   the equality of this plan, which means this plan 
is still immutable.
+ */
 case class DataSourceV2Relation(
-fullOutput: Seq[AttributeReference],
-reader: DataSourceReader) extends LeafNode with DataSourceReaderHolder 
{
+output: Seq[AttributeReference],
+source: DataSourceV2,
+options: DataSourceOptions,
+userSpecifiedSchema: Option[StructType],
+filters: Set[Expression],
+existingReader: Option[DataSourceReader]) extends LeafNode with 
DataSourceV2QueryPlan {
--- End diff --

Why this plan does not extend `MultiInstanceRelation`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20448: [SPARK-23203][SQL] make DataSourceV2Relation immutable

2018-01-30 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20448
  
**[Test build #86863 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86863/testReport)**
 for PR 20448 at commit 
[`d96a48f`](https://github.com/apache/spark/commit/d96a48ff8f03daff23633b700a1374c89f3675e2).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20448: [SPARK-23203][SQL] make DataSourceV2Relation immutable

2018-01-30 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20448
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20448: [SPARK-23203][SQL] make DataSourceV2Relation immutable

2018-01-30 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20448
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/417/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20279: [SPARK-23092][SQL] Migrate MemoryStream to DataSourceV2 ...

2018-01-30 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20279
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86854/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20279: [SPARK-23092][SQL] Migrate MemoryStream to DataSourceV2 ...

2018-01-30 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20279
  
Build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20279: [SPARK-23092][SQL] Migrate MemoryStream to DataSourceV2 ...

2018-01-30 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20279
  
**[Test build #86854 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86854/testReport)**
 for PR 20279 at commit 
[`a81c2ec`](https://github.com/apache/spark/commit/a81c2ecdafd54a2c5bfb07c6f1f53546eaa96c7c).
 * This patch **fails Spark unit tests**.
 * This patch **does not merge cleanly**.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20435: [SPARK-23268][SQL]Reorganize packages in data sou...

2018-01-30 Thread ueshin
Github user ueshin commented on a diff in the pull request:

https://github.com/apache/spark/pull/20435#discussion_r164953225
  
--- Diff: 
external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaSourceOffset.scala
 ---
@@ -20,14 +20,16 @@ package org.apache.spark.sql.kafka010
 import org.apache.kafka.common.TopicPartition
 
 import org.apache.spark.sql.execution.streaming.{Offset, SerializedOffset}
-import org.apache.spark.sql.sources.v2.streaming.reader.{Offset => 
OffsetV2, PartitionOffset}
+import org.apache.spark.sql.sources.v2.reader.streaming
+import org.apache.spark.sql.sources.v2.reader.streaming.PartitionOffset
 
 /**
  * An [[Offset]] for the [[KafkaSource]]. This one tracks all partitions 
of subscribed topics and
--- End diff --

Should this `Offset` be `streaming.Offset`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20435: [SPARK-23268][SQL]Reorganize packages in data sou...

2018-01-30 Thread ueshin
Github user ueshin commented on a diff in the pull request:

https://github.com/apache/spark/pull/20435#discussion_r164953380
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/sources/memoryV2.scala
 ---
@@ -30,9 +30,8 @@ import 
org.apache.spark.sql.catalyst.plans.logical.{LeafNode, Statistics}
 import 
org.apache.spark.sql.catalyst.streaming.InternalOutputModes.{Append, Complete, 
Update}
 import org.apache.spark.sql.execution.streaming.Sink
 import org.apache.spark.sql.sources.v2.{DataSourceOptions, DataSourceV2}
-import org.apache.spark.sql.sources.v2.streaming.StreamWriteSupport
-import org.apache.spark.sql.sources.v2.streaming.writer.StreamWriter
-import org.apache.spark.sql.sources.v2.writer._
+import org.apache.spark.sql.sources.v2.writer.{StreamWriteSupport, _}
--- End diff --

`import org.apache.spark.sql.sources.v2.writer._`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20448: [SPARK-23203][SQL] make DataSourceV2Relation immutable

2018-01-30 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20448
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86862/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20448: [SPARK-23203][SQL] make DataSourceV2Relation immutable

2018-01-30 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20448
  
**[Test build #86862 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86862/testReport)**
 for PR 20448 at commit 
[`0665282`](https://github.com/apache/spark/commit/0665282cca2b690b2bc54e179496e03f7d77596c).
 * This patch **fails to build**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20448: [SPARK-23203][SQL] make DataSourceV2Relation immutable

2018-01-30 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20448
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20448: [SPARK-23203][SQL] make DataSourceV2Relation immutable

2018-01-30 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20448
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20448: [SPARK-23203][SQL] make DataSourceV2Relation immutable

2018-01-30 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20448
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86861/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20448: [SPARK-23203][SQL] make DataSourceV2Relation immutable

2018-01-30 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20448
  
**[Test build #86861 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86861/testReport)**
 for PR 20448 at commit 
[`7441334`](https://github.com/apache/spark/commit/7441334e210944e1419be4059d12c06385c586cb).
 * This patch **fails to build**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20448: [SPARK-23203][SQL] make DataSourceV2Relation immu...

2018-01-30 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/20448#discussion_r164952844
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2QueryPlan.scala
 ---
@@ -19,50 +19,31 @@ package org.apache.spark.sql.execution.datasources.v2
 
 import java.util.Objects
 
-import org.apache.spark.sql.catalyst.expressions.{Attribute, 
AttributeReference}
-import org.apache.spark.sql.sources.v2.reader._
+import org.apache.spark.sql.catalyst.expressions.{Attribute, AttributeSet, 
Expression}
+import org.apache.spark.sql.sources.v2.DataSourceV2
 
 /**
- * A base class for data source reader holder with customized 
equals/hashCode methods.
+ * A base class for data source v2 related query plan. It defines the 
equals/hashCode methods
+ * according to some common information.
  */
-trait DataSourceReaderHolder {
+trait DataSourceV2QueryPlan {
 
-  /**
-   * The full output of the data source reader, without column pruning.
-   */
-  def fullOutput: Seq[AttributeReference]
+  def output: Seq[Attribute]
+  def sourceClass: Class[_ <: DataSourceV2]
+  def filters: Set[Expression]
 
-  /**
-   * The held data source reader.
-   */
-  def reader: DataSourceReader
-
-  /**
-   * The metadata of this data source reader that can be used for equality 
test.
-   */
-  private def metadata: Seq[Any] = {
-val filters: Any = reader match {
-  case s: SupportsPushDownCatalystFilters => 
s.pushedCatalystFilters().toSet
-  case s: SupportsPushDownFilters => s.pushedFilters().toSet
-  case _ => Nil
-}
-Seq(fullOutput, reader.getClass, reader.readSchema(), filters)
-  }
+  // The metadata of this data source relation that can be used for 
equality test.
+  private def metadata: Seq[Any] = Seq(output, sourceClass, filters)
 
   def canEqual(other: Any): Boolean
 
   override def equals(other: Any): Boolean = other match {
-case other: DataSourceReaderHolder =>
-  canEqual(other) && metadata.length == other.metadata.length &&
-metadata.zip(other.metadata).forall { case (l, r) => l == r }
+case other: DataSourceV2QueryPlan =>
+  canEqual(other) && metadata == other.metadata
 case _ => false
   }
 
   override def hashCode(): Int = {
 metadata.map(Objects.hashCode).foldLeft(0)((a, b) => 31 * a + b)
   }
-
-  lazy val output: Seq[Attribute] = reader.readSchema().map(_.name).map { 
name =>
--- End diff --

We don't need to do this anymore. Now the plan is immutable, we have to 
create a new plan when applying push down optimizations, and we can also update 
`output` at that time.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20448: [SPARK-23203][SQL] make DataSourceV2Relation immutable

2018-01-30 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20448
  
**[Test build #86862 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86862/testReport)**
 for PR 20448 at commit 
[`0665282`](https://github.com/apache/spark/commit/0665282cca2b690b2bc54e179496e03f7d77596c).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20448: [SPARK-23203][SQL] make DataSourceV2Relation immutable

2018-01-30 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20448
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20448: [SPARK-23203][SQL] make DataSourceV2Relation immutable

2018-01-30 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20448
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/416/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20448: [SPARK-23203][SQL] make DataSourceV2Relation immutable

2018-01-30 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20448
  
**[Test build #86861 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86861/testReport)**
 for PR 20448 at commit 
[`7441334`](https://github.com/apache/spark/commit/7441334e210944e1419be4059d12c06385c586cb).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20448: [SPARK-23203][SQL] make DataSourceV2Relation immutable

2018-01-30 Thread cloud-fan
Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/20448
  
cc @rdblue @tdas @gatorsmile 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20448: [SPARK-23203][SQL] make DataSourceV2Relation immu...

2018-01-30 Thread cloud-fan
GitHub user cloud-fan opened a pull request:

https://github.com/apache/spark/pull/20448

[SPARK-23203][SQL] make DataSourceV2Relation immutable

## What changes were proposed in this pull request?

This is inspired by https://github.com/apache/spark/pull/20387, but only 
focus on making the plan immutable.

The idea is simple, instead of keeping the mutable `DataSourceReader` in 
the plan, we should keep `DataSourceV2`, and create the reader when needed. The 
pushdown information will be stored in the plan, instead of relying on the 
mutable reader.

This can also help us removing 2 unnecessary APIs from 
`SupportsPushDownCatalystFilters` and `SupportsPushDownFilters`.

## How was this patch tested?

I improved the test in `DataSourceV2Suite`, to make sure this new change 
doesn't break the column pruning and filter push down.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/cloud-fan/spark immutable-plan

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/20448.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #20448


commit 7441334e210944e1419be4059d12c06385c586cb
Author: Wenchen Fan 
Date:   2018-01-31T03:12:00Z

make DataSourceV2Relation immutable




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20419: [SPARK-23032][SQL][FOLLOW-UP]Add codegenStageId in comme...

2018-01-30 Thread rednaxelafx
Github user rednaxelafx commented on the issue:

https://github.com/apache/spark/pull/20419
  
@gatorsmile Not directly. The `CodeAndComment` case class is just a 
"container", it doesn't handle what gets into the `body` field. When we force 
embed a comment, it'll leave a comment as a placeholder in the generated code, 
which is in the `body` (the actual comment contents are in the `comment` map). 
I was just curious if any reviews knows off the top of their heads whether or 
not this placeholder may affect equality.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20419: [SPARK-23032][SQL][FOLLOW-UP]Add codegenStageId in comme...

2018-01-30 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/20419
  
@rednaxelafx Does the following 
[codes](https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala#L1286-L1294)
 address your concern?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20419: [SPARK-23032][SQL][FOLLOW-UP]Add codegenStageId i...

2018-01-30 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/20419#discussion_r164951129
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala
 ---
@@ -1227,12 +1227,13 @@ class CodegenContext {
   /**
* Register a comment and return the corresponding place holder
*/
-  def registerComment(text: => String): String = {
+  def registerComment(text: => String, forceComment: Boolean = false): 
String = {
--- End diff --

Nit: rename it to `force`?

Also adding the text?
```
@param force whether to force registering the comments
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20400: [SPARK-23084][PYTHON]Add unboundedPreceding(), unbounded...

2018-01-30 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/20400
  
LGTM too except the three comments above.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20400: [SPARK-23084][PYTHON]Add unboundedPreceding(), un...

2018-01-30 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/20400#discussion_r164950417
  
--- Diff: python/pyspark/sql/functions.py ---
@@ -809,6 +809,45 @@ def ntile(n):
 return Column(sc._jvm.functions.ntile(int(n)))
 
 
+@since(2.3)
+def unboundedPreceding():
+"""
+Window function: returns the special frame boundary that represents 
the first row
+in the window partition.
+>>> df = spark.createDataFrame([(5,)])
+>>> df.select(unboundedPreceding()).show
+
--- End diff --

Seems this print out the function `show`. Is this intentional?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20400: [SPARK-23084][PYTHON]Add unboundedPreceding(), un...

2018-01-30 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/20400#discussion_r164950531
  
--- Diff: python/pyspark/sql/functions.py ---
@@ -809,6 +809,45 @@ def ntile(n):
 return Column(sc._jvm.functions.ntile(int(n)))
 
 
+@since(2.3)
+def unboundedPreceding():
+"""
+Window function: returns the special frame boundary that represents 
the first row
+in the window partition.
+>>> df = spark.createDataFrame([(5,)])
+>>> df.select(unboundedPreceding()).show
+
+"""
+sc = SparkContext._active_spark_context
+return Column(sc._jvm.functions.unboundedPreceding())
+
+
+@since(2.3)
+def unboundedFollowing():
+"""
+Window function: returns the special frame boundary that represents 
the last row
+in the window partition.
+>>> df = spark.createDataFrame([(5,)])
--- End diff --

I believe we didn't claim we follow PEP 257 yet but I believe it would be 
good to have a newline between doctest and the description at least, if you 
don't mind.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



  1   2   3   4   5   6   7   >