[GitHub] spark pull request #20437: [SPARK-23270][Streaming][WEB-UI]FileInputDStream ...
Github user jerryshao commented on a diff in the pull request: https://github.com/apache/spark/pull/20437#discussion_r164973745 --- Diff: streaming/src/main/scala/org/apache/spark/streaming/dstream/FileInputDStream.scala --- @@ -157,7 +157,7 @@ class FileInputDStream[K, V, F <: NewInputFormat[K, V]]( val metadata = Map( "files" -> newFiles.toList, StreamInputInfo.METADATA_KEY_DESCRIPTION -> newFiles.mkString("\n")) -val inputInfo = StreamInputInfo(id, 0, metadata) +val inputInfo = StreamInputInfo(id, rdds.map(_.count).sum, metadata) --- End diff -- This is not a small overhead. The changes will read/scan all the new files, this is a big overhead for streaming application (data is unnecessarily read twice). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20400: [SPARK-23084][PYTHON]Add unboundedPreceding(), un...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/20400#discussion_r164973111 --- Diff: python/pyspark/sql/window.py --- @@ -212,16 +218,20 @@ def rangeBetween(self, start, end): values directly. :param start: boundary start, inclusive. - The frame is unbounded if this is ``Window.unboundedPreceding``, or + The frame is unbounded if this is ``Window.unboundedPreceding``, + ``org.apache.spark.sql.catalyst.expressions.UnboundedPreceding``, or any value less than or equal to max(-sys.maxsize, -9223372036854775808). :param end: boundary end, inclusive. -The frame is unbounded if this is ``Window.unboundedFollowing``, or +The frame is unbounded if this is ``Window.unboundedFollowing``, + ``org.apache.spark.sql.catalyst.expressions.UnboundedFollowing``, or --- End diff -- Wait .. why do we expose `org.apache.spark.sql.catalyst` path in Python doc .. ? In addition, this package is meant to be internal if I haven't missed something .. ? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20437: [SPARK-23270][Streaming][WEB-UI]FileInputDStream ...
Github user guoxiaolongzte commented on a diff in the pull request: https://github.com/apache/spark/pull/20437#discussion_r164973156 --- Diff: streaming/src/main/scala/org/apache/spark/streaming/dstream/FileInputDStream.scala --- @@ -157,7 +157,7 @@ class FileInputDStream[K, V, F <: NewInputFormat[K, V]]( val metadata = Map( "files" -> newFiles.toList, StreamInputInfo.METADATA_KEY_DESCRIPTION -> newFiles.mkString("\n")) -val inputInfo = StreamInputInfo(id, 0, metadata) +val inputInfo = StreamInputInfo(id, rdds.map(_.count).sum, metadata) --- End diff -- Because of this little overhead, that 'Records' is not recorded? This is a obvious bug. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20450: [SPARK-23280][SQL] add map type support to ColumnVector
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20450 **[Test build #86866 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86866/testReport)** for PR 20450 at commit [`8e66fb5`](https://github.com/apache/spark/commit/8e66fb5077c68ff1378ec2a9e4238f9d73af5fb4). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20450: [SPARK-23280][SQL] add map type support to ColumnVector
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20450 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/421/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20450: [SPARK-23280][SQL] add map type support to ColumnVector
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20450 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20450: [SPARK-23280][SQL] add map type support to ColumnVector
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/20450 cc @hvanhovell @viirya @ueshin @kiszk @gatorsmile @dongjoon-hyun --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20450: [SPARK-23280][SQL] add map type support to Column...
GitHub user cloud-fan opened a pull request: https://github.com/apache/spark/pull/20450 [SPARK-23280][SQL] add map type support to ColumnVector ## What changes were proposed in this pull request? Fill the last missing piece of `ColumnVector`: the map type support. The idea is similar to the array type support. A map is basically 2 arrays for keys and values. We ask the implementations to provide a key array, a value array, and an offset and length to specify the range of this map in the key/value array. In `WritableColumnVector`, we put the key array in first child vector, and value array in second child vector, and offsets and lengths in the current vector, which is very similar to how array type is implemented here. ## How was this patch tested? a new test You can merge this pull request into a Git repository by running: $ git pull https://github.com/cloud-fan/spark map Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/20450.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #20450 commit 8e66fb5077c68ff1378ec2a9e4238f9d73af5fb4 Author: Wenchen FanDate: 2018-01-30T10:52:12Z add map type support to ColumnVector --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20400: [SPARK-23084][PYTHON]Add unboundedPreceding(), un...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/20400#discussion_r164972094 --- Diff: python/pyspark/sql/window.py --- @@ -212,16 +218,20 @@ def rangeBetween(self, start, end): values directly. :param start: boundary start, inclusive. - The frame is unbounded if this is ``Window.unboundedPreceding``, or + The frame is unbounded if this is ``Window.unboundedPreceding``, + ``org.apache.spark.sql.catalyst.expressions.UnboundedPreceding``, or any value less than or equal to max(-sys.maxsize, -9223372036854775808). :param end: boundary end, inclusive. -The frame is unbounded if this is ``Window.unboundedFollowing``, or +The frame is unbounded if this is ``Window.unboundedFollowing``, + ``org.apache.spark.sql.catalyst.expressions.UnboundedFollowing``, or any value greater than or equal to min(sys.maxsize, 9223372036854775807). + --- End diff -- Can we have a doctest resembling this - https://github.com/jiangxb1987/spark/blob/cec519b8cfbf1ed2a3107056ef5281a5be75ec54/sql/core/src/main/scala/org/apache/spark/sql/expressions/Window.scala#L214-L240 ? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20400: [SPARK-23084][PYTHON]Add unboundedPreceding(), un...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/20400#discussion_r164971807 --- Diff: python/pyspark/sql/window.py --- @@ -124,16 +126,20 @@ def rangeBetween(start, end): values directly. :param start: boundary start, inclusive. - The frame is unbounded if this is ``Window.unboundedPreceding``, or + The frame is unbounded if this is ``Window.unboundedPreceding``, + ``org.apache.spark.sql.catalyst.expressions.UnboundedPreceding``, or any value less than or equal to max(-sys.maxsize, -9223372036854775808). :param end: boundary end, inclusive. -The frame is unbounded if this is ``Window.unboundedFollowing``, or +The frame is unbounded if this is ``Window.unboundedFollowing``, + ``org.apache.spark.sql.catalyst.expressions.UnboundedFollowing``, or any value greater than or equal to min(sys.maxsize, 9223372036854775807). +any value greater than or equal to 9223372036854775807. """ -if start <= Window._PRECEDING_THRESHOLD: -start = Window.unboundedPreceding -if end >= Window._FOLLOWING_THRESHOLD: -end = Window.unboundedFollowing +if isinstance(start, (int, long)) and isinstance(end, (int, long)): +if start <= Window._PRECEDING_THRESHOLD: +start = Window.unboundedPreceding +if end >= Window._FOLLOWING_THRESHOLD: +end = Window.unboundedFollowing --- End diff -- Shall we add a logic like: ``` elif isinstance(start, Column) and isinstance(end, Column): start = start._jc end = end._jc ``` too? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20438: [SPARK-23272][SQL] add calendar interval type sup...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/20438#discussion_r164971509 --- Diff: sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/MutableColumnarRow.java --- @@ -146,9 +146,7 @@ public UTF8String getUTF8String(int ordinal) { @Override public CalendarInterval getInterval(int ordinal) { if (columns[ordinal].isNullAt(rowId)) return null; --- End diff -- Let me try to prepare a PR tonight. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17886: [SPARK-13983][SQL] Fix HiveThriftServer2 can not get "--...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17886 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17886: [SPARK-13983][SQL] Fix HiveThriftServer2 can not get "--...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17886 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/420/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20438: [SPARK-23272][SQL] add calendar interval type sup...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/20438 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20400: [SPARK-23084][PYTHON]Add unboundedPreceding(), un...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/20400#discussion_r164968965 --- Diff: python/pyspark/sql/functions.py --- @@ -809,6 +809,45 @@ def ntile(n): return Column(sc._jvm.functions.ntile(int(n))) +@since(2.3) +def unboundedPreceding(): +""" +Window function: returns the special frame boundary that represents the first row +in the window partition. +>>> df = spark.createDataFrame([(5,)]) +>>> df.select(unboundedPreceding()).show + --- End diff -- I think we should have a working example here in doc tests here. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20438: [SPARK-23272][SQL] add calendar interval type support to...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/20438 thanks, merging to master/2.3! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20437: [SPARK-23270][Streaming][WEB-UI]FileInputDStream ...
Github user jerryshao commented on a diff in the pull request: https://github.com/apache/spark/pull/20437#discussion_r164968292 --- Diff: streaming/src/main/scala/org/apache/spark/streaming/dstream/FileInputDStream.scala --- @@ -157,7 +157,7 @@ class FileInputDStream[K, V, F <: NewInputFormat[K, V]]( val metadata = Map( "files" -> newFiles.toList, StreamInputInfo.METADATA_KEY_DESCRIPTION -> newFiles.mkString("\n")) -val inputInfo = StreamInputInfo(id, 0, metadata) +val inputInfo = StreamInputInfo(id, rdds.map(_.count).sum, metadata) --- End diff -- This will kick off a new Spark job to read files and count, which will bring in obvious overhead. Whereas `count` in `DirectKafkaInputDStream` only calculates offsets. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20448: [SPARK-23203][SQL] make DataSourceV2Relation immu...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/20448#discussion_r164966631 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala --- @@ -17,16 +17,57 @@ package org.apache.spark.sql.execution.datasources.v2 -import org.apache.spark.sql.catalyst.expressions.AttributeReference +import org.apache.spark.sql.catalyst.expressions.{AttributeReference, AttributeSet, Expression} import org.apache.spark.sql.catalyst.plans.logical.{LeafNode, Statistics} +import org.apache.spark.sql.sources.v2.{DataSourceOptions, DataSourceV2, ReadSupport, ReadSupportWithSchema} import org.apache.spark.sql.sources.v2.reader._ +import org.apache.spark.sql.types.StructType +/** + * A logical plan representing a data source relation, which will be planned to a data scan + * operator finally. + * + * @param output The output of this relation. + * @param source The instance of a data source v2 implementation. + * @param options The options specified for this scan, used to create the `DataSourceReader`. + * @param userSpecifiedSchema The user specified schema, used to create the `DataSourceReader`. + * @param filters The predicates which are pushed and handled by this data source. + * @param existingReader An mutable reader carrying some temporary stats during optimization and + * planning. It's always None before optimization, and does not take part in + * the equality of this plan, which means this plan is still immutable. + */ case class DataSourceV2Relation( -fullOutput: Seq[AttributeReference], -reader: DataSourceReader) extends LeafNode with DataSourceReaderHolder { +output: Seq[AttributeReference], +source: DataSourceV2, +options: DataSourceOptions, +userSpecifiedSchema: Option[StructType], +filters: Set[Expression], +existingReader: Option[DataSourceReader]) extends LeafNode with DataSourceV2QueryPlan { --- End diff -- good catch! Yea this is a bug, but to respect the rule about solving different issues in different PR, I'd like to fix it in a new PR. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20373: [SPARK-23159][PYTHON] Update cloudpickle to v0.4.2 plus ...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/20373 Ah, yup. I am okay with including it. Wanted to double check. Seems a small clean bug fix. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20400: [SPARK-23084][PYTHON]Add unboundedPreceding(), un...
Github user huaxingao commented on a diff in the pull request: https://github.com/apache/spark/pull/20400#discussion_r164966938 --- Diff: python/pyspark/sql/window.py --- @@ -124,16 +126,20 @@ def rangeBetween(start, end): values directly. :param start: boundary start, inclusive. - The frame is unbounded if this is ``Window.unboundedPreceding``, or + The frame is unbounded if this is ``Window.unboundedPreceding``, + ``org.apache.spark.sql.catalyst.expressions.UnboundedPreceding``, or any value less than or equal to max(-sys.maxsize, -9223372036854775808). :param end: boundary end, inclusive. -The frame is unbounded if this is ``Window.unboundedFollowing``, or +The frame is unbounded if this is ``Window.unboundedFollowing``, + ``org.apache.spark.sql.catalyst.expressions.UnboundedFollowing``, or any value greater than or equal to min(sys.maxsize, 9223372036854775807). --- End diff -- @jiangxb1987 Sorry for the extra line. Will remove. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20442: [SPARK-23265][SQL]Update multi-column error handling log...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20442 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20442: [SPARK-23265][SQL]Update multi-column error handling log...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20442 **[Test build #86864 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86864/testReport)** for PR 20442 at commit [`dfaad52`](https://github.com/apache/spark/commit/dfaad5271029f2d1ec9e47fb2ddab1738543c238). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20442: [SPARK-23265][SQL]Update multi-column error handling log...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20442 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86864/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20373: [SPARK-23159][PYTHON] Update cloudpickle to v0.4.2 plus ...
Github user ueshin commented on the issue: https://github.com/apache/spark/pull/20373 @HyukjinKwon Yes, I meant to port it too. I agree with matching it to v0.4.2 as same as we can, so I don't think we should include formatting one but bug fixes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20448: [SPARK-23203][SQL] make DataSourceV2Relation immutable
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20448 **[Test build #86865 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86865/testReport)** for PR 20448 at commit [`11220db`](https://github.com/apache/spark/commit/11220db7879798967cda85d2aa4e68fefb8ec646). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20448: [SPARK-23203][SQL] make DataSourceV2Relation immutable
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20448 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/419/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20400: [SPARK-23084][PYTHON]Add unboundedPreceding(), un...
Github user huaxingao commented on a diff in the pull request: https://github.com/apache/spark/pull/20400#discussion_r164965129 --- Diff: python/pyspark/sql/functions.py --- @@ -809,6 +809,45 @@ def ntile(n): return Column(sc._jvm.functions.ntile(int(n))) +@since(2.3) +def unboundedPreceding(): +""" +Window function: returns the special frame boundary that represents the first row +in the window partition. +>>> df = spark.createDataFrame([(5,)]) +>>> df.select(unboundedPreceding()).show + +""" +sc = SparkContext._active_spark_context +return Column(sc._jvm.functions.unboundedPreceding()) + + +@since(2.3) +def unboundedFollowing(): +""" +Window function: returns the special frame boundary that represents the last row +in the window partition. +>>> df = spark.createDataFrame([(5,)]) --- End diff -- Will add a newline. Thanks. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20448: [SPARK-23203][SQL] make DataSourceV2Relation immutable
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20448 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20400: [SPARK-23084][PYTHON]Add unboundedPreceding(), un...
Github user huaxingao commented on a diff in the pull request: https://github.com/apache/spark/pull/20400#discussion_r164965040 --- Diff: python/pyspark/sql/functions.py --- @@ -809,6 +809,45 @@ def ntile(n): return Column(sc._jvm.functions.ntile(int(n))) +@since(2.3) +def unboundedPreceding(): +""" +Window function: returns the special frame boundary that represents the first row +in the window partition. +>>> df = spark.createDataFrame([(5,)]) +>>> df.select(unboundedPreceding()).show + --- End diff -- @HyukjinKwon Thanks for your comment. Yes, it is intentional. I am trying to print out something that contains "UNBOUNDED PRECEDING" when calling the method unboundedPreceding(), so I will know this method gets executed correctly. I couldn't figure out a better way to do this. Please let me know if you have a better way. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20373: [SPARK-23159][PYTHON] Update cloudpickle to v0.4.2 plus ...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/20373 Hm .. I was thinking we should focus on matching it to v0.4.2 as same as we can. Few bug links I and @ueshin found looked regressions comparing to our copy so I suggested to include them here .. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20445: [SPARK-23092][SQL] Migrate MemoryStream to DataSourceV2 ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20445 Build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20445: [SPARK-23092][SQL] Migrate MemoryStream to DataSourceV2 ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20445 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86855/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20445: [SPARK-23092][SQL] Migrate MemoryStream to DataSourceV2 ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20445 **[Test build #86855 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86855/testReport)** for PR 20445 at commit [`e66d809`](https://github.com/apache/spark/commit/e66d809fe501b19b923a88d1b4cb9df69b4ae329). * This patch **fails from timeout after a configured wait of \`300m\`**. * This patch **does not merge cleanly**. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20373: [SPARK-23159][PYTHON] Update cloudpickle to v0.4.2 plus ...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/20373 @ueshin, is https://github.com/cloudpipe/cloudpickle/pull/140 a regression comparing to our cloudcpikle copy, or do you suggest to port it too as it's a simple and clean bug fix? Seems the same codes were already there. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20449: [SPARK-23040][CORE]: Returns interruptible iterator for ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20449 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20449: [SPARK-23040][CORE]: Returns interruptible iterator for ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20449 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20449: [SPARK-23040][CORE]: Returns interruptible iterat...
GitHub user advancedxy opened a pull request: https://github.com/apache/spark/pull/20449 [SPARK-23040][CORE]: Returns interruptible iterator for shuffle reader ## What changes were proposed in this pull request? Before this commit, a non-interruptible iterator is returned if aggregator or ordering is specified. This commit also ensures that sorter is closed even when task is cancelled(killed) in the middle of sorting. ## How was this patch tested? Add a unit test in JobCancellationSuite You can merge this pull request into a Git repository by running: $ git pull https://github.com/advancedxy/spark SPARK-23040 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/20449.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #20449 commit acca0e3da2da21e4e184f01b4e8f7b6b8c05ee1d Author: Xianjin YEDate: 2018-01-31T06:27:21Z [SPARK-23040][CORE]: Returns interruptible iterator for shuffle reader Before this commit, a non-interruptible iterator is returned if aggregator or ordering is specified. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20404: [SPARK-23228][PYSPARK] Add Python Created jsparkSession ...
Github user jerryshao commented on the issue: https://github.com/apache/spark/pull/20404 @felixcheung what is your opinion on this, do we really need to handle this case? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20422: [SPARK-23253][Core][Shuffle]Only write shuffle temporary...
Github user yaooqinn commented on the issue: https://github.com/apache/spark/pull/20422 @squito add a test for index file. plz check it again, thanks. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17332: [SPARK-10764][ML] Add optional caching to Pipelines
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17332 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20373: [SPARK-23159][PYTHON] Update cloudpickle to v0.4.2 plus ...
Github user BryanCutler commented on the issue: https://github.com/apache/spark/pull/20373 >Wait .. @BryanCutler, did you port the formatting one here ..? I was thinking we should match it to v0.4.2 as same as possible to reduce the diff. Yes I added that here as well because I thought that's what Holden asked for, could you confirm @holdenk ? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20415: [SPARK-23247][SQL]combines Unsafe operations and statist...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20415 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20415: [SPARK-23247][SQL]combines Unsafe operations and statist...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20415 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86860/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20415: [SPARK-23247][SQL]combines Unsafe operations and statist...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20415 **[Test build #86860 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86860/testReport)** for PR 20415 at commit [`e3e09d9`](https://github.com/apache/spark/commit/e3e09d98072bd39328a4e7d4de1ddd38594c6232). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20447: [SPARK-23279][SS] Avoid triggering distributed jo...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/20447 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20422: [SPARK-23253][Core][Shuffle]Only write shuffle temporary...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20422 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86858/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20422: [SPARK-23253][Core][Shuffle]Only write shuffle temporary...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20422 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20447: [SPARK-23279][SS] Avoid triggering distributed job for C...
Github user jerryshao commented on the issue: https://github.com/apache/spark/pull/20447 Merging to master and 2.3. Thanks for the review! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20422: [SPARK-23253][Core][Shuffle]Only write shuffle temporary...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20422 **[Test build #86858 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86858/testReport)** for PR 20422 at commit [`87e6dc0`](https://github.com/apache/spark/commit/87e6dc0b9ce362e754142c63b95a1841f427471a). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20442: [SPARK-23265][SQL]Update multi-column error handling log...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20442 **[Test build #86864 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86864/testReport)** for PR 20442 at commit [`dfaad52`](https://github.com/apache/spark/commit/dfaad5271029f2d1ec9e47fb2ddab1738543c238). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20442: [SPARK-23265][SQL]Update multi-column error handling log...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20442 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20442: [SPARK-23265][SQL]Update multi-column error handling log...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20442 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/418/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20448: [SPARK-23203][SQL] make DataSourceV2Relation immu...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/20448#discussion_r164958102 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala --- @@ -17,16 +17,57 @@ package org.apache.spark.sql.execution.datasources.v2 -import org.apache.spark.sql.catalyst.expressions.AttributeReference +import org.apache.spark.sql.catalyst.expressions.{AttributeReference, AttributeSet, Expression} import org.apache.spark.sql.catalyst.plans.logical.{LeafNode, Statistics} +import org.apache.spark.sql.sources.v2.{DataSourceOptions, DataSourceV2, ReadSupport, ReadSupportWithSchema} import org.apache.spark.sql.sources.v2.reader._ +import org.apache.spark.sql.types.StructType +/** + * A logical plan representing a data source relation, which will be planned to a data scan + * operator finally. + * + * @param output The output of this relation. + * @param source The instance of a data source v2 implementation. + * @param options The options specified for this scan, used to create the `DataSourceReader`. + * @param userSpecifiedSchema The user specified schema, used to create the `DataSourceReader`. + * @param filters The predicates which are pushed and handled by this data source. + * @param existingReader An mutable reader carrying some temporary stats during optimization and + * planning. It's always None before optimization, and does not take part in + * the equality of this plan, which means this plan is still immutable. + */ case class DataSourceV2Relation( -fullOutput: Seq[AttributeReference], -reader: DataSourceReader) extends LeafNode with DataSourceReaderHolder { +output: Seq[AttributeReference], +source: DataSourceV2, +options: DataSourceOptions, +userSpecifiedSchema: Option[StructType], +filters: Set[Expression], +existingReader: Option[DataSourceReader]) extends LeafNode with DataSourceV2QueryPlan { + + override def references: AttributeSet = AttributeSet.empty + + override def sourceClass: Class[_ <: DataSourceV2] = source.getClass override def canEqual(other: Any): Boolean = other.isInstanceOf[DataSourceV2Relation] + def reader: DataSourceReader = existingReader.getOrElse { +(source, userSpecifiedSchema) match { + case (ds: ReadSupportWithSchema, Some(schema)) => +ds.createReader(schema, options) + + case (ds: ReadSupport, None) => +ds.createReader(options) + + case (ds: ReadSupport, Some(schema)) => +val reader = ds.createReader(options) +// Sanity check, this should be guaranteed by `DataFrameReader.load` +assert(reader.readSchema() == schema) +reader + + case _ => throw new IllegalStateException() +} + } + --- End diff -- data source v2 doesn't support tables yet, so we don't have this problem now. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20447: [SPARK-23279][SS] Avoid triggering distributed job for C...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20447 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20447: [SPARK-23279][SS] Avoid triggering distributed job for C...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20447 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86859/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20447: [SPARK-23279][SS] Avoid triggering distributed job for C...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20447 **[Test build #86859 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86859/testReport)** for PR 20447 at commit [`4b2baeb`](https://github.com/apache/spark/commit/4b2baeb8e00d575189036609c232a9c24f69e4a0). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20373: [SPARK-23159][PYTHON] Update cloudpickle to v0.4.2 plus ...
Github user ueshin commented on the issue: https://github.com/apache/spark/pull/20373 What about https://github.com/cloudpipe/cloudpickle/pull/140? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20419: [SPARK-23032][SQL][FOLLOW-UP]Add codegenStageId in comme...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/20419 Then, we should add a test case to ensure it will not be broken. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20442: [SPARK-23265][SQL]Update multi-column error handl...
Github user huaxingao commented on a diff in the pull request: https://github.com/apache/spark/pull/20442#discussion_r164956149 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/QuantileDiscretizer.scala --- @@ -167,25 +167,31 @@ final class QuantileDiscretizer @Since("1.6.0") (@Since("1.6.0") override val ui @Since("2.3.0") def setOutputCols(value: Array[String]): this.type = set(outputCols, value) - private[feature] def getInOutCols: (Array[String], Array[String]) = { -require((isSet(inputCol) && isSet(outputCol) && !isSet(inputCols) && !isSet(outputCols)) || - (!isSet(inputCol) && !isSet(outputCol) && isSet(inputCols) && isSet(outputCols)), - "QuantileDiscretizer only supports setting either inputCol/outputCol or" + -"inputCols/outputCols." -) + @Since("1.6.0") + override def transformSchema(schema: StructType): StructType = { +ParamValidators.checkSingleVsMultiColumnParams(this, Seq(outputCol), + Seq(outputCols)) -if (isSet(inputCol)) { - (Array($(inputCol)), Array($(outputCol))) -} else { - require($(inputCols).length == $(outputCols).length, -"inputCols number do not match outputCols") - ($(inputCols), $(outputCols)) +if (isSet(inputCols)) { + require(getInputCols.length == getOutputCols.length, +s"QuantileDiscretizer $this has mismatched Params " + --- End diff -- The only reason I have $this is because Bucketizer has $this and I am trying to be consistent with Bucketizer implementation. ``` if (isSet(inputCols)) { require(getInputCols.length == getOutputCols.length && getInputCols.length == getSplitsArray.length, s"Bucketizer $this has mismatched Params " + s"for multi-column transform. Params (inputCols, outputCols, splitsArray) should have " + ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20448: [SPARK-23203][SQL] make DataSourceV2Relation immu...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/20448#discussion_r164956171 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala --- @@ -17,16 +17,57 @@ package org.apache.spark.sql.execution.datasources.v2 -import org.apache.spark.sql.catalyst.expressions.AttributeReference +import org.apache.spark.sql.catalyst.expressions.{AttributeReference, AttributeSet, Expression} import org.apache.spark.sql.catalyst.plans.logical.{LeafNode, Statistics} +import org.apache.spark.sql.sources.v2.{DataSourceOptions, DataSourceV2, ReadSupport, ReadSupportWithSchema} import org.apache.spark.sql.sources.v2.reader._ +import org.apache.spark.sql.types.StructType +/** + * A logical plan representing a data source relation, which will be planned to a data scan + * operator finally. + * + * @param output The output of this relation. + * @param source The instance of a data source v2 implementation. + * @param options The options specified for this scan, used to create the `DataSourceReader`. + * @param userSpecifiedSchema The user specified schema, used to create the `DataSourceReader`. + * @param filters The predicates which are pushed and handled by this data source. + * @param existingReader An mutable reader carrying some temporary stats during optimization and + * planning. It's always None before optimization, and does not take part in + * the equality of this plan, which means this plan is still immutable. + */ case class DataSourceV2Relation( -fullOutput: Seq[AttributeReference], -reader: DataSourceReader) extends LeafNode with DataSourceReaderHolder { +output: Seq[AttributeReference], +source: DataSourceV2, +options: DataSourceOptions, +userSpecifiedSchema: Option[StructType], +filters: Set[Expression], +existingReader: Option[DataSourceReader]) extends LeafNode with DataSourceV2QueryPlan { + + override def references: AttributeSet = AttributeSet.empty + + override def sourceClass: Class[_ <: DataSourceV2] = source.getClass override def canEqual(other: Any): Boolean = other.isInstanceOf[DataSourceV2Relation] + def reader: DataSourceReader = existingReader.getOrElse { +(source, userSpecifiedSchema) match { + case (ds: ReadSupportWithSchema, Some(schema)) => +ds.createReader(schema, options) + + case (ds: ReadSupport, None) => +ds.createReader(options) + + case (ds: ReadSupport, Some(schema)) => +val reader = ds.createReader(options) +// Sanity check, this should be guaranteed by `DataFrameReader.load` +assert(reader.readSchema() == schema) +reader + + case _ => throw new IllegalStateException() +} + } + --- End diff -- What is the behavior we expect when users call `REFRESH TABLE`? Also another potential issue is about storing the statistics in the external catalog? Do we still have the previous issues discussed in https://github.com/apache/spark/pull/14712? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20442: [SPARK-23265][SQL]Update multi-column error handl...
Github user huaxingao commented on a diff in the pull request: https://github.com/apache/spark/pull/20442#discussion_r164955798 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/QuantileDiscretizer.scala --- @@ -167,25 +167,31 @@ final class QuantileDiscretizer @Since("1.6.0") (@Since("1.6.0") override val ui @Since("2.3.0") def setOutputCols(value: Array[String]): this.type = set(outputCols, value) - private[feature] def getInOutCols: (Array[String], Array[String]) = { -require((isSet(inputCol) && isSet(outputCol) && !isSet(inputCols) && !isSet(outputCols)) || - (!isSet(inputCol) && !isSet(outputCol) && isSet(inputCols) && isSet(outputCols)), - "QuantileDiscretizer only supports setting either inputCol/outputCol or" + -"inputCols/outputCols." -) + @Since("1.6.0") + override def transformSchema(schema: StructType): StructType = { +ParamValidators.checkSingleVsMultiColumnParams(this, Seq(outputCol), --- End diff -- Thanks for your comment. I will add the check. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20448: [SPARK-23203][SQL] make DataSourceV2Relation immu...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/20448#discussion_r164955518 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala --- @@ -17,16 +17,57 @@ package org.apache.spark.sql.execution.datasources.v2 -import org.apache.spark.sql.catalyst.expressions.AttributeReference +import org.apache.spark.sql.catalyst.expressions.{AttributeReference, AttributeSet, Expression} import org.apache.spark.sql.catalyst.plans.logical.{LeafNode, Statistics} +import org.apache.spark.sql.sources.v2.{DataSourceOptions, DataSourceV2, ReadSupport, ReadSupportWithSchema} import org.apache.spark.sql.sources.v2.reader._ +import org.apache.spark.sql.types.StructType +/** + * A logical plan representing a data source relation, which will be planned to a data scan + * operator finally. + * + * @param output The output of this relation. + * @param source The instance of a data source v2 implementation. + * @param options The options specified for this scan, used to create the `DataSourceReader`. + * @param userSpecifiedSchema The user specified schema, used to create the `DataSourceReader`. + * @param filters The predicates which are pushed and handled by this data source. + * @param existingReader An mutable reader carrying some temporary stats during optimization and + * planning. It's always None before optimization, and does not take part in + * the equality of this plan, which means this plan is still immutable. + */ case class DataSourceV2Relation( -fullOutput: Seq[AttributeReference], -reader: DataSourceReader) extends LeafNode with DataSourceReaderHolder { +output: Seq[AttributeReference], +source: DataSourceV2, +options: DataSourceOptions, +userSpecifiedSchema: Option[StructType], +filters: Set[Expression], +existingReader: Option[DataSourceReader]) extends LeafNode with DataSourceV2QueryPlan { + + override def references: AttributeSet = AttributeSet.empty + + override def sourceClass: Class[_ <: DataSourceV2] = source.getClass override def canEqual(other: Any): Boolean = other.isInstanceOf[DataSourceV2Relation] + def reader: DataSourceReader = existingReader.getOrElse { +(source, userSpecifiedSchema) match { + case (ds: ReadSupportWithSchema, Some(schema)) => +ds.createReader(schema, options) + + case (ds: ReadSupport, None) => +ds.createReader(options) + + case (ds: ReadSupport, Some(schema)) => +val reader = ds.createReader(options) +// Sanity check, this should be guaranteed by `DataFrameReader.load` +assert(reader.readSchema() == schema) +reader + + case _ => throw new IllegalStateException() +} + } + --- End diff -- What is the output of this node in `Explain`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20448: [SPARK-23203][SQL] make DataSourceV2Relation immu...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/20448#discussion_r164955297 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala --- @@ -17,16 +17,57 @@ package org.apache.spark.sql.execution.datasources.v2 -import org.apache.spark.sql.catalyst.expressions.AttributeReference +import org.apache.spark.sql.catalyst.expressions.{AttributeReference, AttributeSet, Expression} import org.apache.spark.sql.catalyst.plans.logical.{LeafNode, Statistics} +import org.apache.spark.sql.sources.v2.{DataSourceOptions, DataSourceV2, ReadSupport, ReadSupportWithSchema} import org.apache.spark.sql.sources.v2.reader._ +import org.apache.spark.sql.types.StructType +/** + * A logical plan representing a data source relation, which will be planned to a data scan + * operator finally. + * + * @param output The output of this relation. + * @param source The instance of a data source v2 implementation. + * @param options The options specified for this scan, used to create the `DataSourceReader`. + * @param userSpecifiedSchema The user specified schema, used to create the `DataSourceReader`. + * @param filters The predicates which are pushed and handled by this data source. + * @param existingReader An mutable reader carrying some temporary stats during optimization and + * planning. It's always None before optimization, and does not take part in + * the equality of this plan, which means this plan is still immutable. + */ case class DataSourceV2Relation( -fullOutput: Seq[AttributeReference], -reader: DataSourceReader) extends LeafNode with DataSourceReaderHolder { +output: Seq[AttributeReference], +source: DataSourceV2, +options: DataSourceOptions, +userSpecifiedSchema: Option[StructType], +filters: Set[Expression], +existingReader: Option[DataSourceReader]) extends LeafNode with DataSourceV2QueryPlan { + + override def references: AttributeSet = AttributeSet.empty + + override def sourceClass: Class[_ <: DataSourceV2] = source.getClass override def canEqual(other: Any): Boolean = other.isInstanceOf[DataSourceV2Relation] + def reader: DataSourceReader = existingReader.getOrElse { +(source, userSpecifiedSchema) match { + case (ds: ReadSupportWithSchema, Some(schema)) => +ds.createReader(schema, options) + + case (ds: ReadSupport, None) => +ds.createReader(options) + + case (ds: ReadSupport, Some(schema)) => +val reader = ds.createReader(options) +// Sanity check, this should be guaranteed by `DataFrameReader.load` +assert(reader.readSchema() == schema) +reader + + case _ => throw new IllegalStateException() +} + } + --- End diff -- Do we need to override a `def doCanonicalize`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20448: [SPARK-23203][SQL] make DataSourceV2Relation immu...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/20448#discussion_r164954926 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala --- @@ -17,16 +17,57 @@ package org.apache.spark.sql.execution.datasources.v2 -import org.apache.spark.sql.catalyst.expressions.AttributeReference +import org.apache.spark.sql.catalyst.expressions.{AttributeReference, AttributeSet, Expression} import org.apache.spark.sql.catalyst.plans.logical.{LeafNode, Statistics} +import org.apache.spark.sql.sources.v2.{DataSourceOptions, DataSourceV2, ReadSupport, ReadSupportWithSchema} import org.apache.spark.sql.sources.v2.reader._ +import org.apache.spark.sql.types.StructType +/** + * A logical plan representing a data source relation, which will be planned to a data scan + * operator finally. + * + * @param output The output of this relation. + * @param source The instance of a data source v2 implementation. + * @param options The options specified for this scan, used to create the `DataSourceReader`. + * @param userSpecifiedSchema The user specified schema, used to create the `DataSourceReader`. + * @param filters The predicates which are pushed and handled by this data source. + * @param existingReader An mutable reader carrying some temporary stats during optimization and + * planning. It's always None before optimization, and does not take part in + * the equality of this plan, which means this plan is still immutable. + */ case class DataSourceV2Relation( -fullOutput: Seq[AttributeReference], -reader: DataSourceReader) extends LeafNode with DataSourceReaderHolder { +output: Seq[AttributeReference], +source: DataSourceV2, +options: DataSourceOptions, +userSpecifiedSchema: Option[StructType], +filters: Set[Expression], +existingReader: Option[DataSourceReader]) extends LeafNode with DataSourceV2QueryPlan { --- End diff -- Could you add a test for self join? Just to ensure it still works. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20279: [SPARK-23092][SQL] Migrate MemoryStream to DataSourceV2 ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20279 Build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20279: [SPARK-23092][SQL] Migrate MemoryStream to DataSourceV2 ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20279 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86853/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20448: [SPARK-23203][SQL] make DataSourceV2Relation immutable
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20448 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86863/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20279: [SPARK-23092][SQL] Migrate MemoryStream to DataSourceV2 ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20279 **[Test build #86853 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86853/testReport)** for PR 20279 at commit [`9f6c6b9`](https://github.com/apache/spark/commit/9f6c6b9ec5f9669a9147ce61ada90977e255ec85). * This patch **fails Spark unit tests**. * This patch **does not merge cleanly**. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20448: [SPARK-23203][SQL] make DataSourceV2Relation immutable
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20448 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20448: [SPARK-23203][SQL] make DataSourceV2Relation immutable
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20448 **[Test build #86863 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86863/testReport)** for PR 20448 at commit [`d96a48f`](https://github.com/apache/spark/commit/d96a48ff8f03daff23633b700a1374c89f3675e2). * This patch **fails to build**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20448: [SPARK-23203][SQL] make DataSourceV2Relation immu...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/20448#discussion_r164954221 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala --- @@ -17,16 +17,57 @@ package org.apache.spark.sql.execution.datasources.v2 -import org.apache.spark.sql.catalyst.expressions.AttributeReference +import org.apache.spark.sql.catalyst.expressions.{AttributeReference, AttributeSet, Expression} import org.apache.spark.sql.catalyst.plans.logical.{LeafNode, Statistics} +import org.apache.spark.sql.sources.v2.{DataSourceOptions, DataSourceV2, ReadSupport, ReadSupportWithSchema} import org.apache.spark.sql.sources.v2.reader._ +import org.apache.spark.sql.types.StructType +/** + * A logical plan representing a data source relation, which will be planned to a data scan + * operator finally. + * + * @param output The output of this relation. + * @param source The instance of a data source v2 implementation. + * @param options The options specified for this scan, used to create the `DataSourceReader`. + * @param userSpecifiedSchema The user specified schema, used to create the `DataSourceReader`. + * @param filters The predicates which are pushed and handled by this data source. + * @param existingReader An mutable reader carrying some temporary stats during optimization and --- End diff -- `An` -> `A` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20448: [SPARK-23203][SQL] make DataSourceV2Relation immu...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/20448#discussion_r164953600 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2QueryPlan.scala --- @@ -19,50 +19,31 @@ package org.apache.spark.sql.execution.datasources.v2 import java.util.Objects -import org.apache.spark.sql.catalyst.expressions.{Attribute, AttributeReference} -import org.apache.spark.sql.sources.v2.reader._ +import org.apache.spark.sql.catalyst.expressions.{Attribute, AttributeSet, Expression} +import org.apache.spark.sql.sources.v2.DataSourceV2 /** - * A base class for data source reader holder with customized equals/hashCode methods. + * A base class for data source v2 related query plan. It defines the equals/hashCode methods + * according to some common information. --- End diff -- We might need to emphasize this is for both physical and logical plans. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20448: [SPARK-23203][SQL] make DataSourceV2Relation immu...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/20448#discussion_r164954392 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala --- @@ -17,16 +17,57 @@ package org.apache.spark.sql.execution.datasources.v2 -import org.apache.spark.sql.catalyst.expressions.AttributeReference +import org.apache.spark.sql.catalyst.expressions.{AttributeReference, AttributeSet, Expression} import org.apache.spark.sql.catalyst.plans.logical.{LeafNode, Statistics} +import org.apache.spark.sql.sources.v2.{DataSourceOptions, DataSourceV2, ReadSupport, ReadSupportWithSchema} import org.apache.spark.sql.sources.v2.reader._ +import org.apache.spark.sql.types.StructType +/** + * A logical plan representing a data source relation, which will be planned to a data scan + * operator finally. + * + * @param output The output of this relation. + * @param source The instance of a data source v2 implementation. + * @param options The options specified for this scan, used to create the `DataSourceReader`. + * @param userSpecifiedSchema The user specified schema, used to create the `DataSourceReader`. + * @param filters The predicates which are pushed and handled by this data source. + * @param existingReader An mutable reader carrying some temporary stats during optimization and + * planning. It's always None before optimization, and does not take part in + * the equality of this plan, which means this plan is still immutable. + */ case class DataSourceV2Relation( -fullOutput: Seq[AttributeReference], -reader: DataSourceReader) extends LeafNode with DataSourceReaderHolder { +output: Seq[AttributeReference], +source: DataSourceV2, +options: DataSourceOptions, +userSpecifiedSchema: Option[StructType], +filters: Set[Expression], +existingReader: Option[DataSourceReader]) extends LeafNode with DataSourceV2QueryPlan { --- End diff -- Why this plan does not extend `MultiInstanceRelation`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20448: [SPARK-23203][SQL] make DataSourceV2Relation immutable
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20448 **[Test build #86863 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86863/testReport)** for PR 20448 at commit [`d96a48f`](https://github.com/apache/spark/commit/d96a48ff8f03daff23633b700a1374c89f3675e2). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20448: [SPARK-23203][SQL] make DataSourceV2Relation immutable
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20448 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20448: [SPARK-23203][SQL] make DataSourceV2Relation immutable
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20448 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/417/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20279: [SPARK-23092][SQL] Migrate MemoryStream to DataSourceV2 ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20279 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86854/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20279: [SPARK-23092][SQL] Migrate MemoryStream to DataSourceV2 ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20279 Build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20279: [SPARK-23092][SQL] Migrate MemoryStream to DataSourceV2 ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20279 **[Test build #86854 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86854/testReport)** for PR 20279 at commit [`a81c2ec`](https://github.com/apache/spark/commit/a81c2ecdafd54a2c5bfb07c6f1f53546eaa96c7c). * This patch **fails Spark unit tests**. * This patch **does not merge cleanly**. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20435: [SPARK-23268][SQL]Reorganize packages in data sou...
Github user ueshin commented on a diff in the pull request: https://github.com/apache/spark/pull/20435#discussion_r164953225 --- Diff: external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaSourceOffset.scala --- @@ -20,14 +20,16 @@ package org.apache.spark.sql.kafka010 import org.apache.kafka.common.TopicPartition import org.apache.spark.sql.execution.streaming.{Offset, SerializedOffset} -import org.apache.spark.sql.sources.v2.streaming.reader.{Offset => OffsetV2, PartitionOffset} +import org.apache.spark.sql.sources.v2.reader.streaming +import org.apache.spark.sql.sources.v2.reader.streaming.PartitionOffset /** * An [[Offset]] for the [[KafkaSource]]. This one tracks all partitions of subscribed topics and --- End diff -- Should this `Offset` be `streaming.Offset`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20435: [SPARK-23268][SQL]Reorganize packages in data sou...
Github user ueshin commented on a diff in the pull request: https://github.com/apache/spark/pull/20435#discussion_r164953380 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/sources/memoryV2.scala --- @@ -30,9 +30,8 @@ import org.apache.spark.sql.catalyst.plans.logical.{LeafNode, Statistics} import org.apache.spark.sql.catalyst.streaming.InternalOutputModes.{Append, Complete, Update} import org.apache.spark.sql.execution.streaming.Sink import org.apache.spark.sql.sources.v2.{DataSourceOptions, DataSourceV2} -import org.apache.spark.sql.sources.v2.streaming.StreamWriteSupport -import org.apache.spark.sql.sources.v2.streaming.writer.StreamWriter -import org.apache.spark.sql.sources.v2.writer._ +import org.apache.spark.sql.sources.v2.writer.{StreamWriteSupport, _} --- End diff -- `import org.apache.spark.sql.sources.v2.writer._`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20448: [SPARK-23203][SQL] make DataSourceV2Relation immutable
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20448 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86862/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20448: [SPARK-23203][SQL] make DataSourceV2Relation immutable
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20448 **[Test build #86862 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86862/testReport)** for PR 20448 at commit [`0665282`](https://github.com/apache/spark/commit/0665282cca2b690b2bc54e179496e03f7d77596c). * This patch **fails to build**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20448: [SPARK-23203][SQL] make DataSourceV2Relation immutable
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20448 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20448: [SPARK-23203][SQL] make DataSourceV2Relation immutable
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20448 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20448: [SPARK-23203][SQL] make DataSourceV2Relation immutable
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20448 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86861/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20448: [SPARK-23203][SQL] make DataSourceV2Relation immutable
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20448 **[Test build #86861 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86861/testReport)** for PR 20448 at commit [`7441334`](https://github.com/apache/spark/commit/7441334e210944e1419be4059d12c06385c586cb). * This patch **fails to build**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20448: [SPARK-23203][SQL] make DataSourceV2Relation immu...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/20448#discussion_r164952844 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2QueryPlan.scala --- @@ -19,50 +19,31 @@ package org.apache.spark.sql.execution.datasources.v2 import java.util.Objects -import org.apache.spark.sql.catalyst.expressions.{Attribute, AttributeReference} -import org.apache.spark.sql.sources.v2.reader._ +import org.apache.spark.sql.catalyst.expressions.{Attribute, AttributeSet, Expression} +import org.apache.spark.sql.sources.v2.DataSourceV2 /** - * A base class for data source reader holder with customized equals/hashCode methods. + * A base class for data source v2 related query plan. It defines the equals/hashCode methods + * according to some common information. */ -trait DataSourceReaderHolder { +trait DataSourceV2QueryPlan { - /** - * The full output of the data source reader, without column pruning. - */ - def fullOutput: Seq[AttributeReference] + def output: Seq[Attribute] + def sourceClass: Class[_ <: DataSourceV2] + def filters: Set[Expression] - /** - * The held data source reader. - */ - def reader: DataSourceReader - - /** - * The metadata of this data source reader that can be used for equality test. - */ - private def metadata: Seq[Any] = { -val filters: Any = reader match { - case s: SupportsPushDownCatalystFilters => s.pushedCatalystFilters().toSet - case s: SupportsPushDownFilters => s.pushedFilters().toSet - case _ => Nil -} -Seq(fullOutput, reader.getClass, reader.readSchema(), filters) - } + // The metadata of this data source relation that can be used for equality test. + private def metadata: Seq[Any] = Seq(output, sourceClass, filters) def canEqual(other: Any): Boolean override def equals(other: Any): Boolean = other match { -case other: DataSourceReaderHolder => - canEqual(other) && metadata.length == other.metadata.length && -metadata.zip(other.metadata).forall { case (l, r) => l == r } +case other: DataSourceV2QueryPlan => + canEqual(other) && metadata == other.metadata case _ => false } override def hashCode(): Int = { metadata.map(Objects.hashCode).foldLeft(0)((a, b) => 31 * a + b) } - - lazy val output: Seq[Attribute] = reader.readSchema().map(_.name).map { name => --- End diff -- We don't need to do this anymore. Now the plan is immutable, we have to create a new plan when applying push down optimizations, and we can also update `output` at that time. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20448: [SPARK-23203][SQL] make DataSourceV2Relation immutable
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20448 **[Test build #86862 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86862/testReport)** for PR 20448 at commit [`0665282`](https://github.com/apache/spark/commit/0665282cca2b690b2bc54e179496e03f7d77596c). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20448: [SPARK-23203][SQL] make DataSourceV2Relation immutable
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20448 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20448: [SPARK-23203][SQL] make DataSourceV2Relation immutable
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20448 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/416/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20448: [SPARK-23203][SQL] make DataSourceV2Relation immutable
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20448 **[Test build #86861 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86861/testReport)** for PR 20448 at commit [`7441334`](https://github.com/apache/spark/commit/7441334e210944e1419be4059d12c06385c586cb). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20448: [SPARK-23203][SQL] make DataSourceV2Relation immutable
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/20448 cc @rdblue @tdas @gatorsmile --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20448: [SPARK-23203][SQL] make DataSourceV2Relation immu...
GitHub user cloud-fan opened a pull request: https://github.com/apache/spark/pull/20448 [SPARK-23203][SQL] make DataSourceV2Relation immutable ## What changes were proposed in this pull request? This is inspired by https://github.com/apache/spark/pull/20387, but only focus on making the plan immutable. The idea is simple, instead of keeping the mutable `DataSourceReader` in the plan, we should keep `DataSourceV2`, and create the reader when needed. The pushdown information will be stored in the plan, instead of relying on the mutable reader. This can also help us removing 2 unnecessary APIs from `SupportsPushDownCatalystFilters` and `SupportsPushDownFilters`. ## How was this patch tested? I improved the test in `DataSourceV2Suite`, to make sure this new change doesn't break the column pruning and filter push down. You can merge this pull request into a Git repository by running: $ git pull https://github.com/cloud-fan/spark immutable-plan Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/20448.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #20448 commit 7441334e210944e1419be4059d12c06385c586cb Author: Wenchen FanDate: 2018-01-31T03:12:00Z make DataSourceV2Relation immutable --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20419: [SPARK-23032][SQL][FOLLOW-UP]Add codegenStageId in comme...
Github user rednaxelafx commented on the issue: https://github.com/apache/spark/pull/20419 @gatorsmile Not directly. The `CodeAndComment` case class is just a "container", it doesn't handle what gets into the `body` field. When we force embed a comment, it'll leave a comment as a placeholder in the generated code, which is in the `body` (the actual comment contents are in the `comment` map). I was just curious if any reviews knows off the top of their heads whether or not this placeholder may affect equality. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20419: [SPARK-23032][SQL][FOLLOW-UP]Add codegenStageId in comme...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/20419 @rednaxelafx Does the following [codes](https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala#L1286-L1294) address your concern? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20419: [SPARK-23032][SQL][FOLLOW-UP]Add codegenStageId i...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/20419#discussion_r164951129 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala --- @@ -1227,12 +1227,13 @@ class CodegenContext { /** * Register a comment and return the corresponding place holder */ - def registerComment(text: => String): String = { + def registerComment(text: => String, forceComment: Boolean = false): String = { --- End diff -- Nit: rename it to `force`? Also adding the text? ``` @param force whether to force registering the comments ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20400: [SPARK-23084][PYTHON]Add unboundedPreceding(), unbounded...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/20400 LGTM too except the three comments above. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20400: [SPARK-23084][PYTHON]Add unboundedPreceding(), un...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/20400#discussion_r164950417 --- Diff: python/pyspark/sql/functions.py --- @@ -809,6 +809,45 @@ def ntile(n): return Column(sc._jvm.functions.ntile(int(n))) +@since(2.3) +def unboundedPreceding(): +""" +Window function: returns the special frame boundary that represents the first row +in the window partition. +>>> df = spark.createDataFrame([(5,)]) +>>> df.select(unboundedPreceding()).show + --- End diff -- Seems this print out the function `show`. Is this intentional? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20400: [SPARK-23084][PYTHON]Add unboundedPreceding(), un...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/20400#discussion_r164950531 --- Diff: python/pyspark/sql/functions.py --- @@ -809,6 +809,45 @@ def ntile(n): return Column(sc._jvm.functions.ntile(int(n))) +@since(2.3) +def unboundedPreceding(): +""" +Window function: returns the special frame boundary that represents the first row +in the window partition. +>>> df = spark.createDataFrame([(5,)]) +>>> df.select(unboundedPreceding()).show + +""" +sc = SparkContext._active_spark_context +return Column(sc._jvm.functions.unboundedPreceding()) + + +@since(2.3) +def unboundedFollowing(): +""" +Window function: returns the special frame boundary that represents the last row +in the window partition. +>>> df = spark.createDataFrame([(5,)]) --- End diff -- I believe we didn't claim we follow PEP 257 yet but I believe it would be good to have a newline between doctest and the description at least, if you don't mind. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org