[GitHub] spark pull request #20306: [SPARK-23054][SQL][PYSPARK][FOLLOWUP] Use sqlType...
Github user ueshin commented on a diff in the pull request: https://github.com/apache/spark/pull/20306#discussion_r162269190 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala --- @@ -838,6 +839,7 @@ case class Cast(child: Expression, dataType: DataType, timeZoneId: Option[String |$evPrim = $buffer.build(); """.stripMargin } + case pudt: PythonUserDefinedType => castToStringCode(pudt.sqlType, ctx) --- End diff -- So, when casting any UDT to string, use `sqlType` casting. And when `show()`ing Java UDT, use deserializing and the deserialized object's `toString()`, otherwise use casting to string. (I guess we shouldn't use the deserialized object's `toString()` for Python-only UDT which shows the raw string I mentioned in the description.) Is that right? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20302: [SPARK-23094] Fix invalid character handling in JsonData...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20302 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86312/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20302: [SPARK-23094] Fix invalid character handling in JsonData...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20302 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20310: revert [SPARK-10030] Use tags to control which te...
Github user JoshRosen commented on a diff in the pull request: https://github.com/apache/spark/pull/20310#discussion_r162268244 --- Diff: common/tags/src/test/java/org/apache/spark/tags/DockerTest.java --- @@ -1,26 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - *http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - -package org.apache.spark.tags; - -import java.lang.annotation.*; -import org.scalatest.TagAnnotation; - -@TagAnnotation -@Retention(RetentionPolicy.RUNTIME) -@Target({ElementType.METHOD, ElementType.TYPE}) -public @interface DockerTest { } --- End diff -- If you search through the commit history, I'm pretty sure that we originally tried running those DockerTests on RISELAB Jenkins but ran into problems with the Docker Daemon becoming unstable under heavy build load. This should be fixed in the newer-generation Ubuntu build workers, but we haven't quite finished migrating the PRBs onto those. Given this, my hunch is that those tests aren't running anywhere right now, which isn't great. I think they're primarily used for testing JDBC data source SQL dialect mappings. It's been a year or more since I last looked into this, though, so I might be misremembering. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20302: [SPARK-23094] Fix invalid character handling in JsonData...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20302 **[Test build #86312 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86312/testReport)** for PR 20302 at commit [`1d998fb`](https://github.com/apache/spark/commit/1d998fb909e6809c0b2b49500324a13dd8008eca). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20310: revert [SPARK-10030] Use tags to control which tests to ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20310 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20310: revert [SPARK-10030] Use tags to control which tests to ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20310 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86325/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20310: revert [SPARK-10030] Use tags to control which tests to ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20310 **[Test build #86325 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86325/testReport)** for PR 20310 at commit [`ac77bb4`](https://github.com/apache/spark/commit/ac77bb4dee17aec2ed7afd2138115947a8eef923). * This patch **fails to build**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20310: revert [SPARK-10030] Use tags to control which te...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/20310#discussion_r162266565 --- Diff: common/tags/src/test/java/org/apache/spark/tags/DockerTest.java --- @@ -1,26 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - *http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - -package org.apache.spark.tags; - -import java.lang.annotation.*; -import org.scalatest.TagAnnotation; - -@TagAnnotation -@Retention(RetentionPolicy.RUNTIME) -@Target({ElementType.METHOD, ElementType.TYPE}) -public @interface DockerTest { } --- End diff -- if it is, we should preserve this behavior. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20309: [SPARK-23143][SS][PYTHON] Added python API for setting c...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20309 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20309: [SPARK-23143][SS][PYTHON] Added python API for setting c...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20309 **[Test build #86323 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86323/testReport)** for PR 20309 at commit [`d087302`](https://github.com/apache/spark/commit/d08730279e4aa55b2691133584fe05c04fbfd7d3). * This patch **fails PySpark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20309: [SPARK-23143][SS][PYTHON] Added python API for setting c...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20309 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86323/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20306: [SPARK-23054][SQL][PYSPARK][FOLLOWUP] Use sqlType...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/20306#discussion_r162265844 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala --- @@ -838,6 +839,7 @@ case class Cast(child: Expression, dataType: DataType, timeZoneId: Option[String |$evPrim = $buffer.build(); """.stripMargin } + case pudt: PythonUserDefinedType => castToStringCode(pudt.sqlType, ctx) --- End diff -- the java UDT is also internal, shall we make it consistent with python UDT? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20307: [SPARK-23141][SQL][PYSPARK] Support data type string as ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20307 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20310: revert [SPARK-10030] Use tags to control which tests to ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20310 **[Test build #86325 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86325/testReport)** for PR 20310 at commit [`ac77bb4`](https://github.com/apache/spark/commit/ac77bb4dee17aec2ed7afd2138115947a8eef923). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20307: [SPARK-23141][SQL][PYSPARK] Support data type string as ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20307 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86322/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20307: [SPARK-23141][SQL][PYSPARK] Support data type string as ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20307 **[Test build #86322 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86322/testReport)** for PR 20307 at commit [`d41709f`](https://github.com/apache/spark/commit/d41709fc33640b3015b07f308da813b2becdb091). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20310: revert [SPARK-10030] Use tags to control which te...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/20310#discussion_r162265397 --- Diff: common/tags/src/test/java/org/apache/spark/tags/DockerTest.java --- @@ -1,26 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - *http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - -package org.apache.spark.tags; - -import java.lang.annotation.*; -import org.scalatest.TagAnnotation; - -@TagAnnotation -@Retention(RetentionPolicy.RUNTIME) -@Target({ElementType.METHOD, ElementType.TYPE}) -public @interface DockerTest { } --- End diff -- We don't use this docker tag in `modules.py`, does it mean we never run docker tests in PR builder? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20306: [SPARK-23054][SQL][PYSPARK][FOLLOWUP] Use sqlType...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/20306#discussion_r162265243 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala --- @@ -838,6 +839,7 @@ case class Cast(child: Expression, dataType: DataType, timeZoneId: Option[String |$evPrim = $buffer.build(); """.stripMargin } + case pudt: PythonUserDefinedType => castToStringCode(pudt.sqlType, ctx) --- End diff -- As a note, I think `UserDefinedType` in Python is internal usage: https://github.com/apache/spark/blob/3e40eb3f1ffac3d2f49459a801e3ce171ed34091/python/pyspark/sql/types.py#L642 I think we are fine as long as it shows more correct string form of defined UDTs within our projects. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20310: revert [SPARK-10030] Use tags to control which tests to ...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/20310 We want a way to test yarn for the new fix: https://github.com/apache/spark/pull/20297 I checked `run-tests-jenkins.py` and seems we only support `test-maven`, `test-hadoop2.6` and `test-hadoop2.6`. It will be good if we have a `test-yarn` and then we don't need this patch. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20276: [SPARK-14948][SQL] disambiguate attributes in join condi...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20276 **[Test build #86324 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86324/testReport)** for PR 20276 at commit [`d0bdddf`](https://github.com/apache/spark/commit/d0bdddfffc18258ba1536c9cff4ea0856026094c). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20306: [SPARK-23054][SQL][PYSPARK][FOLLOWUP] Use sqlType...
Github user maropu commented on a diff in the pull request: https://github.com/apache/spark/pull/20306#discussion_r162263815 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala --- @@ -838,6 +839,7 @@ case class Cast(child: Expression, dataType: DataType, timeZoneId: Option[String |$evPrim = $buffer.build(); """.stripMargin } + case pudt: PythonUserDefinedType => castToStringCode(pudt.sqlType, ctx) --- End diff -- The second one (Deserializing the catalyst value to customer object and call toString). See: https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala#L286 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20309: [SPARK-23143][SS][PYTHON] Added python API for setting c...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20309 **[Test build #86323 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86323/testReport)** for PR 20309 at commit [`d087302`](https://github.com/apache/spark/commit/d08730279e4aa55b2691133584fe05c04fbfd7d3). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20305: [SPARK-23140][SQL] Add DataSourceV2Strategy to Hive Sess...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20305 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20310: revert [SPARK-10030] Use tags to control which tests to ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20310 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20310: revert [SPARK-10030] Use tags to control which tests to ...
Github user JoshRosen commented on the issue: https://github.com/apache/spark/pull/20310 Are you sure that we want to blanket revert this entire patch? Is there a more surgical short-term fix we can make in `dev/sparktestsupport/modules.py` to just always unconditionally enable the tag for now? Also, is this the first time recently that we've failed the YARN integration tests? How much time do they add? The trade off here seems to be between slightly slower after-the-fact detection of a test failure / build break due to YARN vs. faster tests for the majority of PRs that don't touch YARN code. I think we've had one or two such breaks in the 2+ years that we've been using these test tags, so I'd also be fine with postponing this change if you agree that it's unlikely that we're going to have many such failures here. If the motivation is that it's hard to test the fix for such build breaks (because the failing test wouldn't be exercised in the PR builder) then I think we might already have a solution via special tags placed into the PR title (I think `test-yarn` or something similar; see `run-tests-jenkins.py`). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20308: [SPARK-23142][SS][DOCS] Added docs for continuous proces...
Github user tdas commented on the issue: https://github.com/apache/spark/pull/20308 @zsxwing @jose-torres please take a look. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20276: [SPARK-14948][SQL] disambiguate attributes in joi...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/20276#discussion_r162262776 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -318,7 +318,10 @@ class Analyzer( gid: Expression): Expression = { expr transform { case e: GroupingID => - if (e.groupByExprs.isEmpty || e.groupByExprs == groupByExprs) { + def sameExpressions(e1: Seq[Expression], e2: Seq[Expression]): Boolean = { --- End diff -- Anyway my PR exposed this bug as now `Dataset.col` returns a slightly different attribute with a metadata. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20276: [SPARK-14948][SQL] disambiguate attributes in joi...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/20276#discussion_r162262684 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -318,7 +318,10 @@ class Analyzer( gid: Expression): Expression = { expr transform { case e: GroupingID => - if (e.groupByExprs.isEmpty || e.groupByExprs == groupByExprs) { + def sameExpressions(e1: Seq[Expression], e2: Seq[Expression]): Boolean = { --- End diff -- I think so, basically we should always use `semanticEquals` when matching expressions. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20305: [SPARK-23140][SQL] Add DataSourceV2Strategy to Hive Sess...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20305 **[Test build #86313 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86313/testReport)** for PR 20305 at commit [`b721eb6`](https://github.com/apache/spark/commit/b721eb62baceb778636b6d243a0e4da77e89b894). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20308: [SPARK-23142][SS][DOCS] Added docs for continuous proces...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20308 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86320/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20310: revert [SPARK-10030] Use tags to control which tests to ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20310 **[Test build #86321 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86321/testReport)** for PR 20310 at commit [`47687bb`](https://github.com/apache/spark/commit/47687bb37b9aaeac7e0305c0eaa7f419255e1a45). * This patch **fails to build**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20310: revert [SPARK-10030] Use tags to control which tests to ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20310 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86321/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20305: [SPARK-23140][SQL] Add DataSourceV2Strategy to Hive Sess...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20305 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86313/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20306: [SPARK-23054][SQL][PYSPARK][FOLLOWUP] Use sqlType...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/20306#discussion_r162262394 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala --- @@ -838,6 +839,7 @@ case class Cast(child: Expression, dataType: DataType, timeZoneId: Option[String |$evPrim = $buffer.build(); """.stripMargin } + case pudt: PythonUserDefinedType => castToStringCode(pudt.sqlType, ctx) --- End diff -- This reminds me one thing: how shall we cast a UDT to string? Shall we just strip the UDT and cast the data as it's a catalyst value? or deserialize the catalyst value to customer object and call `toString`? Note that `Dataset.show` should definitely call customer object's `toString`, but I think cast is different as it's an SQL operation. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20308: [SPARK-23142][SS][DOCS] Added docs for continuous proces...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20308 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20308: [SPARK-23142][SS][DOCS] Added docs for continuous proces...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20308 **[Test build #86320 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86320/testReport)** for PR 20308 at commit [`bc13ec4`](https://github.com/apache/spark/commit/bc13ec497d9ba88ce69493a5a026bd91db3ebb6f). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20307: [SPARK-23141][SQL][PYSPARK] Support data type string as ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20307 **[Test build #86322 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86322/testReport)** for PR 20307 at commit [`d41709f`](https://github.com/apache/spark/commit/d41709fc33640b3015b07f308da813b2becdb091). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20305: [SPARK-23140][SQL] Add DataSourceV2Strategy to Hive Sess...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/20305 LGTM --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20255: [SPARK-23064][DOCS][SS] Added documentation for s...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/20255#discussion_r162261740 --- Diff: docs/structured-streaming-programming-guide.md --- @@ -1089,6 +1098,224 @@ streamingDf.join(staticDf, "type", "right_join") # right outer join with a stat +Note that stream-static joins are not stateful, so no state management is necessary. +However, a few types of stream-static outer join are not supported as the incomplete view of +all data in a stream makes it infeasible to calculate the results correctly. +These are discussed at the end of this section. + + Stream-stream Joins +In Spark 2.3, we have added support for stream-stream joins, that is, you can join two streaming +Datasets/DataFrames. The challenge of generating join results between two data streams is that, +at any point of time, the view of the dataset is incomplete for both sides of the join making +it much harder to find matches between inputs. Any row received from one input stream can match +with any future, yet-to-be-received row from the other input stream. Hence, for both the input +streams, we buffer past input as streaming state, so that we can match every future input with +past input and accordingly generate joined results. Furthermore, similar to streaming aggregations, +we automatically handle late, out-of-order data and can limit the state using watermarks. +Letâs discuss the different types of supported stream-stream joins and how to use them. + +# Inner Joins with optional Watermarking +Inner joins on any kind of columns along with any kind of join conditions are supported. +However, as the stream runs, the size of streaming state will keep growing indefinitely as +*all* past input must be saved as the any new input can match with any input from the past. +To avoid unbounded state, you have to define additional join conditions such that indefinitely +old inputs cannot match with future inputs and therefore can be cleared from the state. +In other words, you will have to do the following additional steps in the join. + +1. Define watermark delays on both inputs such that the engine knows how delayed the input can be +(similar to streaming aggregations) + +1. Define a constraint on event-time across the two inputs such that the engine can figure out when +old rows of one input is not going to be required for matches with the other input. This constraint +can either be a time range condition (e.g. `...JOIN ON leftTime BETWEN rightTime AND rightTime + INTERVAL 1 HOUR`), +or equi-join on event-time windows (e.g. `...JOIN ON leftTimeWindow = rightTimeWindow`). +Letâs understand this with an example. + +Letâs say we want to join a stream of advertisement impressions (when an ad was shown) with +another stream of user clicks on advertisements to correlate when impressions led to +monetizable clicks. To allow the state cleanup in this stream-stream join, you will have to +specify the watermarking delays and the time constraints as follows. + +1. Watermark delays: Say, the impressions and the corresponding clicks can be late/out-of-order +in event-time by at most 2 and 3 hours, respectively. + +1. Event-time range condition: Say, a click can occur within a time range of 0 seconds to 1 hour +after the corresponding impression. + +The code would look like this. + + + + +{% highlight scala %} +import org.apache.spark.sql.functions.expr + +val impressions = spark.readStream. ... +val clicks = spark.readStream. ... + +// Apply watermarks on event-time columns +val impressionsWithWatermark = impressions.withWatermark("impressionTime", "2 hours") +val clicksWithWatermark = clicks.withWatermark("clickTime", "3 hours") + +// Join with event-time constraints +impressionsWithWatermark.join( + clicksWithWatermark, + expr(""" +clickAdId = impressionAdId AND +clickTime >= impressionTime AND +clickTime <= impressionTime + interval 1 hour +""" + )) + +{% endhighlight %} + + + + +{% highlight java %} +import static org.apache.spark.sql.functions.expr + +Dataset impressions = spark.readStream(). ... +Dataset clicks = spark.readStream(). ... + +// Apply watermarks on event-time columns +Dataset impressionsWithWatermark = impressions.withWatermark("impressionTime", "2 hours"); +Dataset clicksWithWatermark = clicks.withWatermark("clickTime", "3 hours"); + +// Join with event-time constraints +impressionsWithWatermark.join( + clicksWithWatermark, + expr( +"clickAdId = impressionAdId AND " + +"clickTime >= impressionTime AND " + +"clickTime <=
[GitHub] spark issue #20307: [SPARK-23141][SQL][PYSPARK] Support data type string as ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20307 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86317/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20310: revert [SPARK-10030] Use tags to control which tests to ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20310 **[Test build #86321 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86321/testReport)** for PR 20310 at commit [`47687bb`](https://github.com/apache/spark/commit/47687bb37b9aaeac7e0305c0eaa7f419255e1a45). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20307: [SPARK-23141][SQL][PYSPARK] Support data type string as ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20307 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20307: [SPARK-23141][SQL][PYSPARK] Support data type string as ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20307 **[Test build #86317 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86317/testReport)** for PR 20307 at commit [`1a2c01d`](https://github.com/apache/spark/commit/1a2c01d84315e8937f0683680dd81dec5a4a3a6f). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20310: revert [SPARK-10030] Use tags to control which tests to ...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/20310 cc @vanzin @sameeragarwal --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20310: revert [SPARK-10030] Use tags to control which te...
GitHub user cloud-fan opened a pull request: https://github.com/apache/spark/pull/20310 revert [SPARK-10030] Use tags to control which tests to run depending on changes being tested ## What changes were proposed in this pull request? This PR reverts https://github.com/apache/spark/pull/8775 The problem is that, when we change the code in Spark core, we may break yarn test, so we should run yarn test even we didn't change code in the yarn module. We broke the build because of this issue in https://github.com/apache/spark/pull/20223 ## How was this patch tested? N/A You can merge this pull request into a Git repository by running: $ git pull https://github.com/cloud-fan/spark test Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/20310.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #20310 commit 47687bb37b9aaeac7e0305c0eaa7f419255e1a45 Author: Wenchen FanDate: 2018-01-18T06:42:57Z rever [SPARK-10030] Use tags to control which tests to run depending on changes being tested --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20308: [SPARK-23142][SS][DOCS] Added docs for continuous proces...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20308 **[Test build #86320 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86320/testReport)** for PR 20308 at commit [`bc13ec4`](https://github.com/apache/spark/commit/bc13ec497d9ba88ce69493a5a026bd91db3ebb6f). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20307: [SPARK-23141][SQL][PYSPARK] Support data type str...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/20307#discussion_r162260542 --- Diff: python/pyspark/sql/udf.py --- @@ -310,14 +310,22 @@ def registerJavaFunction(self, name, javaClassName, returnType=None): ... "javaStringLength", "test.org.apache.spark.sql.JavaStringLength", IntegerType()) >>> spark.sql("SELECT javaStringLength('test')").collect() [Row(UDF:javaStringLength(test)=4)] + >>> spark.udf.registerJavaFunction( ... "javaStringLength2", "test.org.apache.spark.sql.JavaStringLength") >>> spark.sql("SELECT javaStringLength2('test')").collect() [Row(UDF:javaStringLength2(test)=4)] + +>>> spark.udf.registerJavaFunction( +... "javaStringLength3", "test.org.apache.spark.sql.JavaStringLength", "integer") +>>> spark.sql("SELECT javaStringLength3('test')").collect() +[Row(UDF:javaStringLength3(test)=4)] """ jdt = None if returnType is not None: +if not isinstance(returnType, DataType): +returnType = _parse_datatype_string(returnType) --- End diff -- Yup, that's https://github.com/apache/spark/pull/20307#discussion_r162258962 :). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20307: [SPARK-23141][SQL][PYSPARK] Support data type str...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/20307#discussion_r162260194 --- Diff: python/pyspark/sql/udf.py --- @@ -310,14 +310,22 @@ def registerJavaFunction(self, name, javaClassName, returnType=None): ... "javaStringLength", "test.org.apache.spark.sql.JavaStringLength", IntegerType()) >>> spark.sql("SELECT javaStringLength('test')").collect() [Row(UDF:javaStringLength(test)=4)] + >>> spark.udf.registerJavaFunction( ... "javaStringLength2", "test.org.apache.spark.sql.JavaStringLength") >>> spark.sql("SELECT javaStringLength2('test')").collect() [Row(UDF:javaStringLength2(test)=4)] + +>>> spark.udf.registerJavaFunction( +... "javaStringLength3", "test.org.apache.spark.sql.JavaStringLength", "integer") +>>> spark.sql("SELECT javaStringLength3('test')").collect() +[Row(UDF:javaStringLength3(test)=4)] """ jdt = None if returnType is not None: +if not isinstance(returnType, DataType): +returnType = _parse_datatype_string(returnType) --- End diff -- The param doc needs to be modified too. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20309: [SPARK-23143][SS][PYTHON] Added python API for setting c...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20309 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20309: [SPARK-23143][SS][PYTHON] Added python API for setting c...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20309 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86319/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20309: [SPARK-23143][SS][PYTHON] Added python API for setting c...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20309 **[Test build #86319 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86319/testReport)** for PR 20309 at commit [`d0eaabe`](https://github.com/apache/spark/commit/d0eaabecb95c32d7eed12e213685aa44b68b352e). * This patch **fails Python style tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20309: [SPARK-23143][SS][PYTHON] Added python API for setting c...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20309 **[Test build #86319 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86319/testReport)** for PR 20309 at commit [`d0eaabe`](https://github.com/apache/spark/commit/d0eaabecb95c32d7eed12e213685aa44b68b352e). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20307: [SPARK-23141][SQL][PYSPARK] Support data type str...
Github user ueshin commented on a diff in the pull request: https://github.com/apache/spark/pull/20307#discussion_r162259997 --- Diff: python/pyspark/sql/udf.py --- @@ -310,14 +310,22 @@ def registerJavaFunction(self, name, javaClassName, returnType=None): ... "javaStringLength", "test.org.apache.spark.sql.JavaStringLength", IntegerType()) --- End diff -- Sure, I'll update them here soon. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20309: [SPARK-23143][SS][PYTHON] Added python API for setting c...
Github user tdas commented on the issue: https://github.com/apache/spark/pull/20309 @zsxwing @jose-torres please take a look. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20276: [SPARK-14948][SQL] disambiguate attributes in joi...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/20276#discussion_r162259637 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -730,12 +733,28 @@ class Analyzer( right case Some((oldRelation, newRelation)) => val attributeRewrites = AttributeMap(oldRelation.output.zip(newRelation.output)) + // If we de-duplicated an `AnalysisBarrier`, then we should only replace + // `AttributeReference` that refers to this `AnalysisBarrier`. + val barrierId = oldRelation match { +case b: AnalysisBarrier => Some(b.id) +case _ => None + } right transformUp { case r if r == oldRelation => newRelation } transformUp { case other => other transformExpressions { - case a: Attribute => -dedupAttr(a, attributeRewrites) + case a: AttributeReference => +// Only replace `AttributeReference` when the de-duplicated relation is not a +// `AnalysisBarrier`, or this `AttributeReference` is not associated with any +// `AnalysisBarrier`, or this `AttributeReference` refers to the de-duplicated +// `AnalysisBarrier`, i.e. barrierId matches. +if (barrierId.isEmpty || !a.metadata.contains(AnalysisBarrier.metadataKey) || + barrierId.get == a.metadata.getLong(AnalysisBarrier.metadataKey)) { + dedupAttr(a, attributeRewrites) --- End diff -- Looks like it is the same as: ```scala // When we de-duplicated an `AnalysisBarrier` and this `AttributeReference` is associated with other // `AnalysisBarrier` different than the de-duplicated one, we don't replace it. val notToReplace = barrierId.map { id => a.metadata.contains(AnalysisBarrier.metadataKey) && id != a.metadata.getLong(AnalysisBarrier.metadataKey) }.getOrElse(false) if (notToReplace) { a } else { dedupAttr(a, attributeRewrites) } ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20243: [SPARK-23052][SS] Migrate ConsoleSink to data sou...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/20243 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20276: [SPARK-14948][SQL] disambiguate attributes in joi...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/20276#discussion_r162257509 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -318,7 +318,10 @@ class Analyzer( gid: Expression): Expression = { expr transform { case e: GroupingID => - if (e.groupByExprs.isEmpty || e.groupByExprs == groupByExprs) { + def sameExpressions(e1: Seq[Expression], e2: Seq[Expression]): Boolean = { --- End diff -- Is this a bug not related to this PR? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20309: [SPARK-23143][SS][PYTHON] Added python API for se...
GitHub user tdas opened a pull request: https://github.com/apache/spark/pull/20309 [SPARK-23143][SS][PYTHON] Added python API for setting continuous trigger ## What changes were proposed in this pull request? Self-explanatory. ## How was this patch tested? New python tests. You can merge this pull request into a Git repository by running: $ git pull https://github.com/tdas/spark SPARK-23143 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/20309.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #20309 commit d0eaabecb95c32d7eed12e213685aa44b68b352e Author: Tathagata DasDate: 2018-01-18T06:37:47Z added python apis for continuous trigger --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20308: [SPARK-23142][SS][DOCS] Added docs for continuous proces...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20308 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86316/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20308: [SPARK-23142][SS][DOCS] Added docs for continuous proces...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20308 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20308: [SPARK-23142][SS][DOCS] Added docs for continuous proces...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20308 **[Test build #86316 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86316/testReport)** for PR 20308 at commit [`43f2399`](https://github.com/apache/spark/commit/43f239946140268ef31319b1dbbd182846ab73d5). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20307: [SPARK-23141][SQL][PYSPARK] Support data type str...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/20307#discussion_r162258962 --- Diff: python/pyspark/sql/udf.py --- @@ -310,14 +310,22 @@ def registerJavaFunction(self, name, javaClassName, returnType=None): ... "javaStringLength", "test.org.apache.spark.sql.JavaStringLength", IntegerType()) --- End diff -- Ah, seems we need to fix `:param returnType:` across all other related APIs saying it takes DDL-formatted type string. @ueshin, mind opening a minor PR for this - `udf`, `pandas_udf`, `registerJavaFunction` and `register` separately? If you are busy, will do it tonight. Doing this here is fine to me too, up to you. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20305: [SPARK-23140][SQL] Add DataSourceV2Strategy to Hive Sess...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20305 **[Test build #86318 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86318/testReport)** for PR 20305 at commit [`f17b44d`](https://github.com/apache/spark/commit/f17b44de6e4d2ece008d3856fdcc037cce7dd147). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20307: [SPARK-23141][SQL][PYSPARK] Support data type string as ...
Github user ueshin commented on the issue: https://github.com/apache/spark/pull/20307 cc @HyukjinKwon --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20308: [SPARK-23142][SS][DOCS] Added docs for continuous proces...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20308 **[Test build #86316 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86316/testReport)** for PR 20308 at commit [`43f2399`](https://github.com/apache/spark/commit/43f239946140268ef31319b1dbbd182846ab73d5). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20307: [SPARK-23141][SQL][PYSPARK] Support data type string as ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20307 **[Test build #86317 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86317/testReport)** for PR 20307 at commit [`1a2c01d`](https://github.com/apache/spark/commit/1a2c01d84315e8937f0683680dd81dec5a4a3a6f). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20308: [SPARK-23142][SS][DOCS] Added docs for continuous...
GitHub user tdas opened a pull request: https://github.com/apache/spark/pull/20308 [SPARK-23142][SS][DOCS] Added docs for continuous processing ## What changes were proposed in this pull request? Added documentation for continuous processing. Modified two locations. - Modified the overview to have a mention of Continuous Processing. - Added a new section on Continuous Processing at the end. ![image](https://user-images.githubusercontent.com/663212/35083551-a3dd23f6-fbd4-11e7-9e7e-90866f131ca9.png) ![image](https://user-images.githubusercontent.com/663212/35083559-aa08ffb6-fbd4-11e7-8df1-19887fd7e8e4.png) ## How was this patch tested? N/A You can merge this pull request into a Git repository by running: $ git pull https://github.com/tdas/spark SPARK-23142 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/20308.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #20308 commit 43f239946140268ef31319b1dbbd182846ab73d5 Author: Tathagata DasDate: 2018-01-18T06:17:34Z Added docs for continuous processing --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20307: [SPARK-23141][SQL][PYSPARK] Support data type str...
GitHub user ueshin opened a pull request: https://github.com/apache/spark/pull/20307 [SPARK-23141][SQL][PYSPARK] Support data type string as a returnType for registerJavaFunction. ## What changes were proposed in this pull request? Currently `UDFRegistration.registerJavaFunction` doesn't support data type string as a `returnType` whereas `UDFRegistration.register`, `@udf`, or `@pandas_udf` does. We can support it for `UDFRegistration.registerJavaFunction` as well. ## How was this patch tested? Added a doctest and existing tests. You can merge this pull request into a Git repository by running: $ git pull https://github.com/ueshin/apache-spark issues/SPARK-23141 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/20307.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #20307 commit 1a2c01d84315e8937f0683680dd81dec5a4a3a6f Author: Takuya UESHINDate: 2018-01-18T06:04:30Z Support data type string as a returnType for registerJavaFunction. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20298: [SPARK-22976][Core]: Cluster mode driver dir removed whi...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20298 **[Test build #86315 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86315/testReport)** for PR 20298 at commit [`38916f7`](https://github.com/apache/spark/commit/38916f769252938fbce891cf1d21972e50a01181). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20298: [SPARK-22976][Core]: Cluster mode driver dir removed whi...
Github user jerryshao commented on the issue: https://github.com/apache/spark/pull/20298 Jenkins, retest this please. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20305: [SPARK-23140][SQL] Add DataSourceV2Strategy to Hive Sess...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20305 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86311/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20305: [SPARK-23140][SQL] Add DataSourceV2Strategy to Hive Sess...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20305 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20204: [SPARK-7721][PYTHON][TESTS] Adds PySpark coverage genera...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20204 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86310/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20305: [SPARK-23140][SQL] Add DataSourceV2Strategy to Hive Sess...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20305 **[Test build #86311 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86311/testReport)** for PR 20305 at commit [`fc5d3ce`](https://github.com/apache/spark/commit/fc5d3ce47fad2e5fd3a2c0e5c94002189fffc462). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20204: [SPARK-7721][PYTHON][TESTS] Adds PySpark coverage genera...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20204 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20204: [SPARK-7721][PYTHON][TESTS] Adds PySpark coverage genera...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20204 **[Test build #86310 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86310/testReport)** for PR 20204 at commit [`df9af13`](https://github.com/apache/spark/commit/df9af138adcf9bfef99b3f6a3fb6779a3d75fa69). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20288: [SPARK-23122][PYTHON][SQL] Deprecate register* fo...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/20288 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20288: [SPARK-23122][PYTHON][SQL] Deprecate register* for UDFs ...
Github user ueshin commented on the issue: https://github.com/apache/spark/pull/20288 Thanks! merging to master/2.3. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20306: [SPARK-23054][SQL][PYSPARK][FOLLOWUP] Use sqlType castin...
Github user ueshin commented on the issue: https://github.com/apache/spark/pull/20306 cc @maropu @cloud-fan @HyukjinKwon --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20306: [SPARK-23054][SQL][PYSPARK][FOLLOWUP] Use sqlType castin...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20306 **[Test build #86314 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86314/testReport)** for PR 20306 at commit [`74c1735`](https://github.com/apache/spark/commit/74c17353bb6372b123c5aee1b6d58a21de36f99a). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20288: [SPARK-23122][PYTHON][SQL] Deprecate register* for UDFs ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20288 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20288: [SPARK-23122][PYTHON][SQL] Deprecate register* for UDFs ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20288 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86309/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20306: [SPARK-23054][SQL][PYSPARK][FOLLOWUP] Use sqlType...
GitHub user ueshin opened a pull request: https://github.com/apache/spark/pull/20306 [SPARK-23054][SQL][PYSPARK][FOLLOWUP] Use sqlType casting when casting PythonUserDefinedType to String. ## What changes were proposed in this pull request? This is a follow-up of #20246. If the UDT in Python doesn't have its corresponding Scala UTD, cast to string will be the raw string of the internal value, e.g. `"org.apache.spark.sql.catalyst.expressions.UnsafeArrayData@"` if the internal type is `ArrayType`. This pr fixes it by using its `sqlType` casting. ## How was this patch tested? Added a test and existing tests. You can merge this pull request into a Git repository by running: $ git pull https://github.com/ueshin/apache-spark issues/SPARK-23054/fup1 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/20306.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #20306 commit 74c17353bb6372b123c5aee1b6d58a21de36f99a Author: Takuya UESHINDate: 2018-01-18T05:27:10Z Use sqlType casting when casting PythonUserDefinedType to String. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20288: [SPARK-23122][PYTHON][SQL] Deprecate register* for UDFs ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20288 **[Test build #86309 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86309/testReport)** for PR 20288 at commit [`e121273`](https://github.com/apache/spark/commit/e121273972d0ec0d94cc01e4426358b4e5fb7e2c). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20288: [SPARK-23122][PYTHON][SQL] Deprecate register* for UDFs ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20288 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20087: [SPARK-21786][SQL] The 'spark.sql.parquet.compres...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/20087#discussion_r162252757 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/CompressionCodecSuite.scala --- @@ -0,0 +1,321 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.hive + +import java.io.File + +import scala.collection.JavaConverters._ + +import org.apache.hadoop.fs.Path +import org.apache.orc.OrcConf.COMPRESS +import org.apache.parquet.hadoop.ParquetOutputFormat +import org.scalatest.BeforeAndAfterAll + +import org.apache.spark.sql.execution.datasources.orc.OrcOptions +import org.apache.spark.sql.execution.datasources.parquet.{ParquetOptions, ParquetTest} +import org.apache.spark.sql.hive.orc.OrcFileOperator +import org.apache.spark.sql.hive.test.TestHiveSingleton +import org.apache.spark.sql.internal.SQLConf + +class CompressionCodecSuite extends TestHiveSingleton with ParquetTest with BeforeAndAfterAll { + import spark.implicits._ + + override def beforeAll(): Unit = { +super.beforeAll() +(0 until maxRecordNum).toDF("a").createOrReplaceTempView("table_source") + } + + override def afterAll(): Unit = { +try { + spark.catalog.dropTempView("table_source") +} finally { + super.afterAll() +} + } + + private val maxRecordNum = 500 + + private def getConvertMetastoreConfName(format: String): String = format.toLowerCase match { +case "parquet" => HiveUtils.CONVERT_METASTORE_PARQUET.key +case "orc" => HiveUtils.CONVERT_METASTORE_ORC.key + } + + private def getSparkCompressionConfName(format: String): String = format.toLowerCase match { +case "parquet" => SQLConf.PARQUET_COMPRESSION.key +case "orc" => SQLConf.ORC_COMPRESSION.key + } + + private def getHiveCompressPropName(format: String): String = format.toLowerCase match { +case "parquet" => ParquetOutputFormat.COMPRESSION +case "orc" => COMPRESS.getAttribute + } + + private def normalizeCodecName(format: String, name: String): String = { +format.toLowerCase match { + case "parquet" => ParquetOptions.shortParquetCompressionCodecNames(name).name() + case "orc" => OrcOptions.shortOrcCompressionCodecNames(name) +} + } + + private def getTableCompressionCodec(path: String, format: String): Seq[String] = { +val hadoopConf = spark.sessionState.newHadoopConf() +val codecs = format.toLowerCase match { + case "parquet" => for { +footer <- readAllFootersWithoutSummaryFiles(new Path(path), hadoopConf) +block <- footer.getParquetMetadata.getBlocks.asScala +column <- block.getColumns.asScala + } yield column.getCodec.name() + case "orc" => new File(path).listFiles().filter{ file => +file.isFile && !file.getName.endsWith(".crc") && file.getName != "_SUCCESS" + }.map { orcFile => + OrcFileOperator.getFileReader(orcFile.toPath.toString).get.getCompression.toString + }.toSeq +} +codecs.distinct + } + + private def createTable( + rootDir: File, + tableName: String, + isPartitioned: Boolean, + format: String, + compressionCodec: Option[String]): Unit = { +val tblProperties = compressionCodec match { + case Some(prop) => s"TBLPROPERTIES('${getHiveCompressPropName(format)}'='$prop')" + case _ => "" +} +val partitionCreate = if (isPartitioned) "PARTITIONED BY (p string)" else "" +sql( + s""" +|CREATE TABLE $tableName(a int) +|$partitionCreate +|STORED AS $format +|LOCATION '${rootDir.toURI.toString.stripSuffix("/")}/$tableName' +|$tblProperties + """.stripMargin) + } + + private def writeDataToTable( + tableName: String, +
[GitHub] spark pull request #20087: [SPARK-21786][SQL] The 'spark.sql.parquet.compres...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/20087#discussion_r162253089 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/CompressionCodecSuite.scala --- @@ -0,0 +1,321 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.hive + +import java.io.File + +import scala.collection.JavaConverters._ + +import org.apache.hadoop.fs.Path +import org.apache.orc.OrcConf.COMPRESS +import org.apache.parquet.hadoop.ParquetOutputFormat +import org.scalatest.BeforeAndAfterAll + +import org.apache.spark.sql.execution.datasources.orc.OrcOptions +import org.apache.spark.sql.execution.datasources.parquet.{ParquetOptions, ParquetTest} +import org.apache.spark.sql.hive.orc.OrcFileOperator +import org.apache.spark.sql.hive.test.TestHiveSingleton +import org.apache.spark.sql.internal.SQLConf + +class CompressionCodecSuite extends TestHiveSingleton with ParquetTest with BeforeAndAfterAll { + import spark.implicits._ + + override def beforeAll(): Unit = { +super.beforeAll() +(0 until maxRecordNum).toDF("a").createOrReplaceTempView("table_source") + } + + override def afterAll(): Unit = { +try { + spark.catalog.dropTempView("table_source") +} finally { + super.afterAll() +} + } + + private val maxRecordNum = 500 + + private def getConvertMetastoreConfName(format: String): String = format.toLowerCase match { +case "parquet" => HiveUtils.CONVERT_METASTORE_PARQUET.key +case "orc" => HiveUtils.CONVERT_METASTORE_ORC.key + } + + private def getSparkCompressionConfName(format: String): String = format.toLowerCase match { +case "parquet" => SQLConf.PARQUET_COMPRESSION.key +case "orc" => SQLConf.ORC_COMPRESSION.key + } + + private def getHiveCompressPropName(format: String): String = format.toLowerCase match { +case "parquet" => ParquetOutputFormat.COMPRESSION +case "orc" => COMPRESS.getAttribute + } + + private def normalizeCodecName(format: String, name: String): String = { +format.toLowerCase match { + case "parquet" => ParquetOptions.shortParquetCompressionCodecNames(name).name() + case "orc" => OrcOptions.shortOrcCompressionCodecNames(name) +} + } + + private def getTableCompressionCodec(path: String, format: String): Seq[String] = { +val hadoopConf = spark.sessionState.newHadoopConf() +val codecs = format.toLowerCase match { + case "parquet" => for { +footer <- readAllFootersWithoutSummaryFiles(new Path(path), hadoopConf) +block <- footer.getParquetMetadata.getBlocks.asScala +column <- block.getColumns.asScala + } yield column.getCodec.name() + case "orc" => new File(path).listFiles().filter{ file => +file.isFile && !file.getName.endsWith(".crc") && file.getName != "_SUCCESS" + }.map { orcFile => + OrcFileOperator.getFileReader(orcFile.toPath.toString).get.getCompression.toString + }.toSeq +} +codecs.distinct + } + + private def createTable( + rootDir: File, + tableName: String, + isPartitioned: Boolean, + format: String, + compressionCodec: Option[String]): Unit = { +val tblProperties = compressionCodec match { + case Some(prop) => s"TBLPROPERTIES('${getHiveCompressPropName(format)}'='$prop')" + case _ => "" +} +val partitionCreate = if (isPartitioned) "PARTITIONED BY (p string)" else "" +sql( + s""" +|CREATE TABLE $tableName(a int) +|$partitionCreate +|STORED AS $format +|LOCATION '${rootDir.toURI.toString.stripSuffix("/")}/$tableName' +|$tblProperties + """.stripMargin) + } + + private def writeDataToTable( + tableName: String, +
[GitHub] spark pull request #20087: [SPARK-21786][SQL] The 'spark.sql.parquet.compres...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/20087#discussion_r162252636 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/CompressionCodecSuite.scala --- @@ -0,0 +1,321 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.hive + +import java.io.File + +import scala.collection.JavaConverters._ + +import org.apache.hadoop.fs.Path +import org.apache.orc.OrcConf.COMPRESS +import org.apache.parquet.hadoop.ParquetOutputFormat +import org.scalatest.BeforeAndAfterAll + +import org.apache.spark.sql.execution.datasources.orc.OrcOptions +import org.apache.spark.sql.execution.datasources.parquet.{ParquetOptions, ParquetTest} +import org.apache.spark.sql.hive.orc.OrcFileOperator +import org.apache.spark.sql.hive.test.TestHiveSingleton +import org.apache.spark.sql.internal.SQLConf + +class CompressionCodecSuite extends TestHiveSingleton with ParquetTest with BeforeAndAfterAll { + import spark.implicits._ + + override def beforeAll(): Unit = { +super.beforeAll() +(0 until maxRecordNum).toDF("a").createOrReplaceTempView("table_source") + } + + override def afterAll(): Unit = { +try { + spark.catalog.dropTempView("table_source") +} finally { + super.afterAll() +} + } + + private val maxRecordNum = 500 + + private def getConvertMetastoreConfName(format: String): String = format.toLowerCase match { +case "parquet" => HiveUtils.CONVERT_METASTORE_PARQUET.key +case "orc" => HiveUtils.CONVERT_METASTORE_ORC.key + } + + private def getSparkCompressionConfName(format: String): String = format.toLowerCase match { +case "parquet" => SQLConf.PARQUET_COMPRESSION.key +case "orc" => SQLConf.ORC_COMPRESSION.key + } + + private def getHiveCompressPropName(format: String): String = format.toLowerCase match { +case "parquet" => ParquetOutputFormat.COMPRESSION +case "orc" => COMPRESS.getAttribute + } + + private def normalizeCodecName(format: String, name: String): String = { +format.toLowerCase match { + case "parquet" => ParquetOptions.shortParquetCompressionCodecNames(name).name() + case "orc" => OrcOptions.shortOrcCompressionCodecNames(name) +} + } + + private def getTableCompressionCodec(path: String, format: String): Seq[String] = { +val hadoopConf = spark.sessionState.newHadoopConf() +val codecs = format.toLowerCase match { + case "parquet" => for { +footer <- readAllFootersWithoutSummaryFiles(new Path(path), hadoopConf) +block <- footer.getParquetMetadata.getBlocks.asScala +column <- block.getColumns.asScala + } yield column.getCodec.name() + case "orc" => new File(path).listFiles().filter{ file => +file.isFile && !file.getName.endsWith(".crc") && file.getName != "_SUCCESS" + }.map { orcFile => + OrcFileOperator.getFileReader(orcFile.toPath.toString).get.getCompression.toString + }.toSeq +} +codecs.distinct + } + + private def createTable( + rootDir: File, + tableName: String, + isPartitioned: Boolean, + format: String, + compressionCodec: Option[String]): Unit = { +val tblProperties = compressionCodec match { + case Some(prop) => s"TBLPROPERTIES('${getHiveCompressPropName(format)}'='$prop')" + case _ => "" +} +val partitionCreate = if (isPartitioned) "PARTITIONED BY (p string)" else "" +sql( + s""" +|CREATE TABLE $tableName(a int) +|$partitionCreate +|STORED AS $format +|LOCATION '${rootDir.toURI.toString.stripSuffix("/")}/$tableName' +|$tblProperties + """.stripMargin) + } + + private def writeDataToTable( + tableName: String, +
[GitHub] spark pull request #20087: [SPARK-21786][SQL] The 'spark.sql.parquet.compres...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/20087#discussion_r162252901 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/CompressionCodecSuite.scala --- @@ -0,0 +1,321 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.hive + +import java.io.File + +import scala.collection.JavaConverters._ + +import org.apache.hadoop.fs.Path +import org.apache.orc.OrcConf.COMPRESS +import org.apache.parquet.hadoop.ParquetOutputFormat +import org.scalatest.BeforeAndAfterAll + +import org.apache.spark.sql.execution.datasources.orc.OrcOptions +import org.apache.spark.sql.execution.datasources.parquet.{ParquetOptions, ParquetTest} +import org.apache.spark.sql.hive.orc.OrcFileOperator +import org.apache.spark.sql.hive.test.TestHiveSingleton +import org.apache.spark.sql.internal.SQLConf + +class CompressionCodecSuite extends TestHiveSingleton with ParquetTest with BeforeAndAfterAll { + import spark.implicits._ + + override def beforeAll(): Unit = { +super.beforeAll() +(0 until maxRecordNum).toDF("a").createOrReplaceTempView("table_source") + } + + override def afterAll(): Unit = { +try { + spark.catalog.dropTempView("table_source") +} finally { + super.afterAll() +} + } + + private val maxRecordNum = 500 + + private def getConvertMetastoreConfName(format: String): String = format.toLowerCase match { +case "parquet" => HiveUtils.CONVERT_METASTORE_PARQUET.key +case "orc" => HiveUtils.CONVERT_METASTORE_ORC.key + } + + private def getSparkCompressionConfName(format: String): String = format.toLowerCase match { +case "parquet" => SQLConf.PARQUET_COMPRESSION.key +case "orc" => SQLConf.ORC_COMPRESSION.key + } + + private def getHiveCompressPropName(format: String): String = format.toLowerCase match { +case "parquet" => ParquetOutputFormat.COMPRESSION +case "orc" => COMPRESS.getAttribute + } + + private def normalizeCodecName(format: String, name: String): String = { +format.toLowerCase match { + case "parquet" => ParquetOptions.shortParquetCompressionCodecNames(name).name() + case "orc" => OrcOptions.shortOrcCompressionCodecNames(name) +} + } + + private def getTableCompressionCodec(path: String, format: String): Seq[String] = { +val hadoopConf = spark.sessionState.newHadoopConf() +val codecs = format.toLowerCase match { + case "parquet" => for { +footer <- readAllFootersWithoutSummaryFiles(new Path(path), hadoopConf) +block <- footer.getParquetMetadata.getBlocks.asScala +column <- block.getColumns.asScala + } yield column.getCodec.name() + case "orc" => new File(path).listFiles().filter{ file => +file.isFile && !file.getName.endsWith(".crc") && file.getName != "_SUCCESS" + }.map { orcFile => + OrcFileOperator.getFileReader(orcFile.toPath.toString).get.getCompression.toString + }.toSeq +} +codecs.distinct + } + + private def createTable( + rootDir: File, + tableName: String, + isPartitioned: Boolean, + format: String, + compressionCodec: Option[String]): Unit = { +val tblProperties = compressionCodec match { + case Some(prop) => s"TBLPROPERTIES('${getHiveCompressPropName(format)}'='$prop')" + case _ => "" +} +val partitionCreate = if (isPartitioned) "PARTITIONED BY (p string)" else "" +sql( + s""" +|CREATE TABLE $tableName(a int) +|$partitionCreate +|STORED AS $format +|LOCATION '${rootDir.toURI.toString.stripSuffix("/")}/$tableName' +|$tblProperties + """.stripMargin) + } + + private def writeDataToTable( + tableName: String, +
[GitHub] spark pull request #20087: [SPARK-21786][SQL] The 'spark.sql.parquet.compres...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/20087#discussion_r162252598 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/CompressionCodecSuite.scala --- @@ -0,0 +1,321 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.hive + +import java.io.File + +import scala.collection.JavaConverters._ + +import org.apache.hadoop.fs.Path +import org.apache.orc.OrcConf.COMPRESS +import org.apache.parquet.hadoop.ParquetOutputFormat +import org.scalatest.BeforeAndAfterAll + +import org.apache.spark.sql.execution.datasources.orc.OrcOptions +import org.apache.spark.sql.execution.datasources.parquet.{ParquetOptions, ParquetTest} +import org.apache.spark.sql.hive.orc.OrcFileOperator +import org.apache.spark.sql.hive.test.TestHiveSingleton +import org.apache.spark.sql.internal.SQLConf + +class CompressionCodecSuite extends TestHiveSingleton with ParquetTest with BeforeAndAfterAll { + import spark.implicits._ + + override def beforeAll(): Unit = { +super.beforeAll() +(0 until maxRecordNum).toDF("a").createOrReplaceTempView("table_source") + } + + override def afterAll(): Unit = { +try { + spark.catalog.dropTempView("table_source") +} finally { + super.afterAll() +} + } + + private val maxRecordNum = 500 + + private def getConvertMetastoreConfName(format: String): String = format.toLowerCase match { +case "parquet" => HiveUtils.CONVERT_METASTORE_PARQUET.key +case "orc" => HiveUtils.CONVERT_METASTORE_ORC.key + } + + private def getSparkCompressionConfName(format: String): String = format.toLowerCase match { +case "parquet" => SQLConf.PARQUET_COMPRESSION.key +case "orc" => SQLConf.ORC_COMPRESSION.key + } + + private def getHiveCompressPropName(format: String): String = format.toLowerCase match { +case "parquet" => ParquetOutputFormat.COMPRESSION +case "orc" => COMPRESS.getAttribute + } + + private def normalizeCodecName(format: String, name: String): String = { +format.toLowerCase match { + case "parquet" => ParquetOptions.shortParquetCompressionCodecNames(name).name() + case "orc" => OrcOptions.shortOrcCompressionCodecNames(name) +} + } + + private def getTableCompressionCodec(path: String, format: String): Seq[String] = { +val hadoopConf = spark.sessionState.newHadoopConf() +val codecs = format.toLowerCase match { + case "parquet" => for { +footer <- readAllFootersWithoutSummaryFiles(new Path(path), hadoopConf) +block <- footer.getParquetMetadata.getBlocks.asScala +column <- block.getColumns.asScala + } yield column.getCodec.name() + case "orc" => new File(path).listFiles().filter{ file => +file.isFile && !file.getName.endsWith(".crc") && file.getName != "_SUCCESS" + }.map { orcFile => + OrcFileOperator.getFileReader(orcFile.toPath.toString).get.getCompression.toString + }.toSeq +} +codecs.distinct + } + + private def createTable( + rootDir: File, + tableName: String, + isPartitioned: Boolean, + format: String, + compressionCodec: Option[String]): Unit = { +val tblProperties = compressionCodec match { + case Some(prop) => s"TBLPROPERTIES('${getHiveCompressPropName(format)}'='$prop')" + case _ => "" +} +val partitionCreate = if (isPartitioned) "PARTITIONED BY (p string)" else "" +sql( + s""" +|CREATE TABLE $tableName(a int) +|$partitionCreate +|STORED AS $format +|LOCATION '${rootDir.toURI.toString.stripSuffix("/")}/$tableName' +|$tblProperties + """.stripMargin) + } + + private def writeDataToTable( + tableName: String, +
[GitHub] spark pull request #20087: [SPARK-21786][SQL] The 'spark.sql.parquet.compres...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/20087#discussion_r162252484 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/CompressionCodecSuite.scala --- @@ -0,0 +1,321 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.hive + +import java.io.File + +import scala.collection.JavaConverters._ + +import org.apache.hadoop.fs.Path +import org.apache.orc.OrcConf.COMPRESS +import org.apache.parquet.hadoop.ParquetOutputFormat +import org.scalatest.BeforeAndAfterAll + +import org.apache.spark.sql.execution.datasources.orc.OrcOptions +import org.apache.spark.sql.execution.datasources.parquet.{ParquetOptions, ParquetTest} +import org.apache.spark.sql.hive.orc.OrcFileOperator +import org.apache.spark.sql.hive.test.TestHiveSingleton +import org.apache.spark.sql.internal.SQLConf + +class CompressionCodecSuite extends TestHiveSingleton with ParquetTest with BeforeAndAfterAll { + import spark.implicits._ + + override def beforeAll(): Unit = { +super.beforeAll() +(0 until maxRecordNum).toDF("a").createOrReplaceTempView("table_source") + } + + override def afterAll(): Unit = { +try { + spark.catalog.dropTempView("table_source") +} finally { + super.afterAll() +} + } + + private val maxRecordNum = 500 --- End diff -- Reduce it to 50 for decreasing the execution time --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20087: [SPARK-21786][SQL] The 'spark.sql.parquet.compres...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/20087#discussion_r162253009 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/CompressionCodecSuite.scala --- @@ -0,0 +1,321 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.hive + +import java.io.File + +import scala.collection.JavaConverters._ + +import org.apache.hadoop.fs.Path +import org.apache.orc.OrcConf.COMPRESS +import org.apache.parquet.hadoop.ParquetOutputFormat +import org.scalatest.BeforeAndAfterAll + +import org.apache.spark.sql.execution.datasources.orc.OrcOptions +import org.apache.spark.sql.execution.datasources.parquet.{ParquetOptions, ParquetTest} +import org.apache.spark.sql.hive.orc.OrcFileOperator +import org.apache.spark.sql.hive.test.TestHiveSingleton +import org.apache.spark.sql.internal.SQLConf + +class CompressionCodecSuite extends TestHiveSingleton with ParquetTest with BeforeAndAfterAll { + import spark.implicits._ + + override def beforeAll(): Unit = { +super.beforeAll() +(0 until maxRecordNum).toDF("a").createOrReplaceTempView("table_source") + } + + override def afterAll(): Unit = { +try { + spark.catalog.dropTempView("table_source") +} finally { + super.afterAll() +} + } + + private val maxRecordNum = 500 + + private def getConvertMetastoreConfName(format: String): String = format.toLowerCase match { +case "parquet" => HiveUtils.CONVERT_METASTORE_PARQUET.key +case "orc" => HiveUtils.CONVERT_METASTORE_ORC.key + } + + private def getSparkCompressionConfName(format: String): String = format.toLowerCase match { +case "parquet" => SQLConf.PARQUET_COMPRESSION.key +case "orc" => SQLConf.ORC_COMPRESSION.key + } + + private def getHiveCompressPropName(format: String): String = format.toLowerCase match { +case "parquet" => ParquetOutputFormat.COMPRESSION +case "orc" => COMPRESS.getAttribute + } + + private def normalizeCodecName(format: String, name: String): String = { +format.toLowerCase match { + case "parquet" => ParquetOptions.shortParquetCompressionCodecNames(name).name() + case "orc" => OrcOptions.shortOrcCompressionCodecNames(name) +} + } + + private def getTableCompressionCodec(path: String, format: String): Seq[String] = { +val hadoopConf = spark.sessionState.newHadoopConf() +val codecs = format.toLowerCase match { + case "parquet" => for { +footer <- readAllFootersWithoutSummaryFiles(new Path(path), hadoopConf) +block <- footer.getParquetMetadata.getBlocks.asScala +column <- block.getColumns.asScala + } yield column.getCodec.name() + case "orc" => new File(path).listFiles().filter{ file => +file.isFile && !file.getName.endsWith(".crc") && file.getName != "_SUCCESS" + }.map { orcFile => + OrcFileOperator.getFileReader(orcFile.toPath.toString).get.getCompression.toString + }.toSeq +} +codecs.distinct + } + + private def createTable( + rootDir: File, + tableName: String, + isPartitioned: Boolean, + format: String, + compressionCodec: Option[String]): Unit = { +val tblProperties = compressionCodec match { + case Some(prop) => s"TBLPROPERTIES('${getHiveCompressPropName(format)}'='$prop')" + case _ => "" +} +val partitionCreate = if (isPartitioned) "PARTITIONED BY (p string)" else "" +sql( + s""" +|CREATE TABLE $tableName(a int) +|$partitionCreate +|STORED AS $format +|LOCATION '${rootDir.toURI.toString.stripSuffix("/")}/$tableName' +|$tblProperties + """.stripMargin) + } + + private def writeDataToTable( + tableName: String, +
[GitHub] spark pull request #20087: [SPARK-21786][SQL] The 'spark.sql.parquet.compres...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/20087#discussion_r162253126 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/CompressionCodecSuite.scala --- @@ -0,0 +1,321 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.hive + +import java.io.File + +import scala.collection.JavaConverters._ + +import org.apache.hadoop.fs.Path +import org.apache.orc.OrcConf.COMPRESS +import org.apache.parquet.hadoop.ParquetOutputFormat +import org.scalatest.BeforeAndAfterAll + +import org.apache.spark.sql.execution.datasources.orc.OrcOptions +import org.apache.spark.sql.execution.datasources.parquet.{ParquetOptions, ParquetTest} +import org.apache.spark.sql.hive.orc.OrcFileOperator +import org.apache.spark.sql.hive.test.TestHiveSingleton +import org.apache.spark.sql.internal.SQLConf + +class CompressionCodecSuite extends TestHiveSingleton with ParquetTest with BeforeAndAfterAll { + import spark.implicits._ + + override def beforeAll(): Unit = { +super.beforeAll() +(0 until maxRecordNum).toDF("a").createOrReplaceTempView("table_source") + } + + override def afterAll(): Unit = { +try { + spark.catalog.dropTempView("table_source") +} finally { + super.afterAll() +} + } + + private val maxRecordNum = 500 + + private def getConvertMetastoreConfName(format: String): String = format.toLowerCase match { +case "parquet" => HiveUtils.CONVERT_METASTORE_PARQUET.key +case "orc" => HiveUtils.CONVERT_METASTORE_ORC.key + } + + private def getSparkCompressionConfName(format: String): String = format.toLowerCase match { +case "parquet" => SQLConf.PARQUET_COMPRESSION.key +case "orc" => SQLConf.ORC_COMPRESSION.key + } + + private def getHiveCompressPropName(format: String): String = format.toLowerCase match { +case "parquet" => ParquetOutputFormat.COMPRESSION +case "orc" => COMPRESS.getAttribute + } + + private def normalizeCodecName(format: String, name: String): String = { +format.toLowerCase match { + case "parquet" => ParquetOptions.shortParquetCompressionCodecNames(name).name() + case "orc" => OrcOptions.shortOrcCompressionCodecNames(name) +} + } + + private def getTableCompressionCodec(path: String, format: String): Seq[String] = { +val hadoopConf = spark.sessionState.newHadoopConf() +val codecs = format.toLowerCase match { + case "parquet" => for { +footer <- readAllFootersWithoutSummaryFiles(new Path(path), hadoopConf) +block <- footer.getParquetMetadata.getBlocks.asScala +column <- block.getColumns.asScala + } yield column.getCodec.name() + case "orc" => new File(path).listFiles().filter{ file => +file.isFile && !file.getName.endsWith(".crc") && file.getName != "_SUCCESS" + }.map { orcFile => + OrcFileOperator.getFileReader(orcFile.toPath.toString).get.getCompression.toString + }.toSeq +} +codecs.distinct + } + + private def createTable( + rootDir: File, + tableName: String, + isPartitioned: Boolean, + format: String, + compressionCodec: Option[String]): Unit = { +val tblProperties = compressionCodec match { + case Some(prop) => s"TBLPROPERTIES('${getHiveCompressPropName(format)}'='$prop')" + case _ => "" +} +val partitionCreate = if (isPartitioned) "PARTITIONED BY (p string)" else "" +sql( + s""" +|CREATE TABLE $tableName(a int) +|$partitionCreate +|STORED AS $format +|LOCATION '${rootDir.toURI.toString.stripSuffix("/")}/$tableName' +|$tblProperties + """.stripMargin) + } + + private def writeDataToTable( + tableName: String, +
[GitHub] spark issue #20288: [SPARK-23122][PYTHON][SQL] Deprecate register* for UDFs ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20288 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86306/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20087: [SPARK-21786][SQL] The 'spark.sql.parquet.compres...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/20087#discussion_r162252794 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/CompressionCodecSuite.scala --- @@ -0,0 +1,321 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.hive + +import java.io.File + +import scala.collection.JavaConverters._ + +import org.apache.hadoop.fs.Path +import org.apache.orc.OrcConf.COMPRESS +import org.apache.parquet.hadoop.ParquetOutputFormat +import org.scalatest.BeforeAndAfterAll + +import org.apache.spark.sql.execution.datasources.orc.OrcOptions +import org.apache.spark.sql.execution.datasources.parquet.{ParquetOptions, ParquetTest} +import org.apache.spark.sql.hive.orc.OrcFileOperator +import org.apache.spark.sql.hive.test.TestHiveSingleton +import org.apache.spark.sql.internal.SQLConf + +class CompressionCodecSuite extends TestHiveSingleton with ParquetTest with BeforeAndAfterAll { + import spark.implicits._ + + override def beforeAll(): Unit = { +super.beforeAll() +(0 until maxRecordNum).toDF("a").createOrReplaceTempView("table_source") + } + + override def afterAll(): Unit = { +try { + spark.catalog.dropTempView("table_source") +} finally { + super.afterAll() +} + } + + private val maxRecordNum = 500 + + private def getConvertMetastoreConfName(format: String): String = format.toLowerCase match { +case "parquet" => HiveUtils.CONVERT_METASTORE_PARQUET.key +case "orc" => HiveUtils.CONVERT_METASTORE_ORC.key + } + + private def getSparkCompressionConfName(format: String): String = format.toLowerCase match { +case "parquet" => SQLConf.PARQUET_COMPRESSION.key +case "orc" => SQLConf.ORC_COMPRESSION.key + } + + private def getHiveCompressPropName(format: String): String = format.toLowerCase match { +case "parquet" => ParquetOutputFormat.COMPRESSION +case "orc" => COMPRESS.getAttribute + } + + private def normalizeCodecName(format: String, name: String): String = { +format.toLowerCase match { + case "parquet" => ParquetOptions.shortParquetCompressionCodecNames(name).name() + case "orc" => OrcOptions.shortOrcCompressionCodecNames(name) +} + } + + private def getTableCompressionCodec(path: String, format: String): Seq[String] = { +val hadoopConf = spark.sessionState.newHadoopConf() +val codecs = format.toLowerCase match { + case "parquet" => for { +footer <- readAllFootersWithoutSummaryFiles(new Path(path), hadoopConf) +block <- footer.getParquetMetadata.getBlocks.asScala +column <- block.getColumns.asScala + } yield column.getCodec.name() + case "orc" => new File(path).listFiles().filter{ file => +file.isFile && !file.getName.endsWith(".crc") && file.getName != "_SUCCESS" + }.map { orcFile => + OrcFileOperator.getFileReader(orcFile.toPath.toString).get.getCompression.toString + }.toSeq +} +codecs.distinct + } + + private def createTable( + rootDir: File, + tableName: String, + isPartitioned: Boolean, + format: String, + compressionCodec: Option[String]): Unit = { +val tblProperties = compressionCodec match { + case Some(prop) => s"TBLPROPERTIES('${getHiveCompressPropName(format)}'='$prop')" + case _ => "" +} +val partitionCreate = if (isPartitioned) "PARTITIONED BY (p string)" else "" +sql( + s""" +|CREATE TABLE $tableName(a int) +|$partitionCreate +|STORED AS $format +|LOCATION '${rootDir.toURI.toString.stripSuffix("/")}/$tableName' +|$tblProperties + """.stripMargin) + } + + private def writeDataToTable( + tableName: String, +
[GitHub] spark pull request #20087: [SPARK-21786][SQL] The 'spark.sql.parquet.compres...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/20087#discussion_r161698209 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetOptions.scala --- @@ -76,7 +76,7 @@ object ParquetOptions { val MERGE_SCHEMA = "mergeSchema" // The parquet compression short names - private val shortParquetCompressionCodecNames = Map( + val shortParquetCompressionCodecNames = Map( --- End diff -- The same here. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20087: [SPARK-21786][SQL] The 'spark.sql.parquet.compres...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/20087#discussion_r161698145 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcOptions.scala --- @@ -61,7 +61,7 @@ class OrcOptions( object OrcOptions { // The ORC compression short names - private val shortOrcCompressionCodecNames = Map( + val shortOrcCompressionCodecNames = Map( --- End diff -- Instead of changing the access modifiers, add a public function ```Scala def getORCCompressionCodecName(name: String): String = shortOrcCompressionCodecNames(name) ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20087: [SPARK-21786][SQL] The 'spark.sql.parquet.compres...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/20087#discussion_r162252680 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/CompressionCodecSuite.scala --- @@ -0,0 +1,321 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.hive + +import java.io.File + +import scala.collection.JavaConverters._ + +import org.apache.hadoop.fs.Path +import org.apache.orc.OrcConf.COMPRESS +import org.apache.parquet.hadoop.ParquetOutputFormat +import org.scalatest.BeforeAndAfterAll + +import org.apache.spark.sql.execution.datasources.orc.OrcOptions +import org.apache.spark.sql.execution.datasources.parquet.{ParquetOptions, ParquetTest} +import org.apache.spark.sql.hive.orc.OrcFileOperator +import org.apache.spark.sql.hive.test.TestHiveSingleton +import org.apache.spark.sql.internal.SQLConf + +class CompressionCodecSuite extends TestHiveSingleton with ParquetTest with BeforeAndAfterAll { + import spark.implicits._ + + override def beforeAll(): Unit = { +super.beforeAll() +(0 until maxRecordNum).toDF("a").createOrReplaceTempView("table_source") + } + + override def afterAll(): Unit = { +try { + spark.catalog.dropTempView("table_source") +} finally { + super.afterAll() +} + } + + private val maxRecordNum = 500 + + private def getConvertMetastoreConfName(format: String): String = format.toLowerCase match { +case "parquet" => HiveUtils.CONVERT_METASTORE_PARQUET.key +case "orc" => HiveUtils.CONVERT_METASTORE_ORC.key + } + + private def getSparkCompressionConfName(format: String): String = format.toLowerCase match { +case "parquet" => SQLConf.PARQUET_COMPRESSION.key +case "orc" => SQLConf.ORC_COMPRESSION.key + } + + private def getHiveCompressPropName(format: String): String = format.toLowerCase match { +case "parquet" => ParquetOutputFormat.COMPRESSION +case "orc" => COMPRESS.getAttribute + } + + private def normalizeCodecName(format: String, name: String): String = { +format.toLowerCase match { + case "parquet" => ParquetOptions.shortParquetCompressionCodecNames(name).name() + case "orc" => OrcOptions.shortOrcCompressionCodecNames(name) +} + } + + private def getTableCompressionCodec(path: String, format: String): Seq[String] = { +val hadoopConf = spark.sessionState.newHadoopConf() +val codecs = format.toLowerCase match { + case "parquet" => for { +footer <- readAllFootersWithoutSummaryFiles(new Path(path), hadoopConf) +block <- footer.getParquetMetadata.getBlocks.asScala +column <- block.getColumns.asScala + } yield column.getCodec.name() + case "orc" => new File(path).listFiles().filter{ file => +file.isFile && !file.getName.endsWith(".crc") && file.getName != "_SUCCESS" + }.map { orcFile => + OrcFileOperator.getFileReader(orcFile.toPath.toString).get.getCompression.toString + }.toSeq +} +codecs.distinct + } + + private def createTable( + rootDir: File, + tableName: String, + isPartitioned: Boolean, + format: String, + compressionCodec: Option[String]): Unit = { +val tblProperties = compressionCodec match { + case Some(prop) => s"TBLPROPERTIES('${getHiveCompressPropName(format)}'='$prop')" + case _ => "" +} +val partitionCreate = if (isPartitioned) "PARTITIONED BY (p string)" else "" +sql( + s""" +|CREATE TABLE $tableName(a int) +|$partitionCreate +|STORED AS $format +|LOCATION '${rootDir.toURI.toString.stripSuffix("/")}/$tableName' +|$tblProperties + """.stripMargin) + } + + private def writeDataToTable( + tableName: String, +
[GitHub] spark pull request #20087: [SPARK-21786][SQL] The 'spark.sql.parquet.compres...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/20087#discussion_r162252520 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/CompressionCodecSuite.scala --- @@ -0,0 +1,321 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.hive + +import java.io.File + +import scala.collection.JavaConverters._ + +import org.apache.hadoop.fs.Path +import org.apache.orc.OrcConf.COMPRESS +import org.apache.parquet.hadoop.ParquetOutputFormat +import org.scalatest.BeforeAndAfterAll + +import org.apache.spark.sql.execution.datasources.orc.OrcOptions +import org.apache.spark.sql.execution.datasources.parquet.{ParquetOptions, ParquetTest} +import org.apache.spark.sql.hive.orc.OrcFileOperator +import org.apache.spark.sql.hive.test.TestHiveSingleton +import org.apache.spark.sql.internal.SQLConf + +class CompressionCodecSuite extends TestHiveSingleton with ParquetTest with BeforeAndAfterAll { + import spark.implicits._ + + override def beforeAll(): Unit = { +super.beforeAll() +(0 until maxRecordNum).toDF("a").createOrReplaceTempView("table_source") + } + + override def afterAll(): Unit = { +try { + spark.catalog.dropTempView("table_source") +} finally { + super.afterAll() +} + } + + private val maxRecordNum = 500 + + private def getConvertMetastoreConfName(format: String): String = format.toLowerCase match { +case "parquet" => HiveUtils.CONVERT_METASTORE_PARQUET.key +case "orc" => HiveUtils.CONVERT_METASTORE_ORC.key + } + + private def getSparkCompressionConfName(format: String): String = format.toLowerCase match { +case "parquet" => SQLConf.PARQUET_COMPRESSION.key +case "orc" => SQLConf.ORC_COMPRESSION.key + } + + private def getHiveCompressPropName(format: String): String = format.toLowerCase match { +case "parquet" => ParquetOutputFormat.COMPRESSION +case "orc" => COMPRESS.getAttribute + } + + private def normalizeCodecName(format: String, name: String): String = { +format.toLowerCase match { + case "parquet" => ParquetOptions.shortParquetCompressionCodecNames(name).name() + case "orc" => OrcOptions.shortOrcCompressionCodecNames(name) +} + } + + private def getTableCompressionCodec(path: String, format: String): Seq[String] = { +val hadoopConf = spark.sessionState.newHadoopConf() +val codecs = format.toLowerCase match { + case "parquet" => for { +footer <- readAllFootersWithoutSummaryFiles(new Path(path), hadoopConf) +block <- footer.getParquetMetadata.getBlocks.asScala +column <- block.getColumns.asScala + } yield column.getCodec.name() + case "orc" => new File(path).listFiles().filter{ file => --- End diff -- Nit: add a space before `{` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org