[spark] branch master updated (b83304f -> d599807)
This is an automated email from the ASF dual-hosted git repository. lixiao pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from b83304f [SPARK-28796][DOC] Document DROP DATABASE statement in SQL Reference add d599807 [SPARK-28795][DOC][SQL] Document CREATE VIEW statement in SQL Reference No new revisions were added by this update. Summary of changes: docs/sql-ref-syntax-ddl-create-view.md | 62 +- 1 file changed, 61 insertions(+), 1 deletion(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (ee63031 -> b83304f)
This is an automated email from the ASF dual-hosted git repository. lixiao pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from ee63031 [SPARK-28828][DOC] Document REFRESH TABLE command add b83304f [SPARK-28796][DOC] Document DROP DATABASE statement in SQL Reference No new revisions were added by this update. Summary of changes: docs/sql-ref-syntax-ddl-drop-database.md | 60 +++- 1 file changed, 59 insertions(+), 1 deletion(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (5631a96 -> ee63031)
This is an automated email from the ASF dual-hosted git repository. lixiao pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 5631a96 [SPARK-29048] Improve performance on Column.isInCollection() with a large size collection add ee63031 [SPARK-28828][DOC] Document REFRESH TABLE command No new revisions were added by this update. Summary of changes: docs/_data/menu-sql.yaml | 2 ++ docs/sql-ref-syntax-aux-refresh-table.md | 58 2 files changed, 60 insertions(+) create mode 100644 docs/sql-ref-syntax-aux-refresh-table.md - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (c56a012 -> 5631a96)
This is an automated email from the ASF dual-hosted git repository. lixiao pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from c56a012 [SPARK-29060][SQL] Add tree traversal helper for adaptive spark plans add 5631a96 [SPARK-29048] Improve performance on Column.isInCollection() with a large size collection No new revisions were added by this update. Summary of changes: .../main/scala/org/apache/spark/sql/Column.scala | 10 - .../apache/spark/sql/ColumnExpressionSuite.scala | 45 ++ 2 files changed, 37 insertions(+), 18 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (c839d09 -> 125af78d)
This is an automated email from the ASF dual-hosted git repository. lixiao pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from c839d09 [SPARK-28773][DOC][SQL] Handling of NULL data in Spark SQL add 125af78d [SPARK-28831][DOC][SQL] Document CLEAR CACHE statement in SQL Reference No new revisions were added by this update. Summary of changes: docs/sql-ref-syntax-aux-cache-clear-cache.md | 18 +- docs/sql-ref-syntax-aux-cache-uncache-table.md | 4 ++-- docs/sql-ref-syntax-aux-cache.md | 15 +++ 3 files changed, 26 insertions(+), 11 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (6378d4b -> c839d09)
This is an automated email from the ASF dual-hosted git repository. lixiao pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 6378d4b [SPARK-28980][CORE][SQL][STREAMING][MLLIB] Remove most items deprecated in Spark 2.2.0 or earlier, for Spark 3 add c839d09 [SPARK-28773][DOC][SQL] Handling of NULL data in Spark SQL No new revisions were added by this update. Summary of changes: docs/_data/menu-sql.yaml | 2 + docs/sql-ref-null-semantics.md | 703 + 2 files changed, 705 insertions(+) create mode 100644 docs/sql-ref-null-semantics.md - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (d4eca7c -> 4a3a6b6)
This is an automated email from the ASF dual-hosted git repository. lixiao pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from d4eca7c [SPARK-29000][SQL] Decimal precision overflow when don't allow precision loss add 4a3a6b6 [SPARK-28637][SQL] Thriftserver support interval type No new revisions were added by this update. Summary of changes: .../thriftserver/SparkExecuteStatementOperation.scala | 9 - .../sql/hive/thriftserver/HiveThriftServer2Suites.scala | 15 +++ .../SparkThriftServerProtocolVersionsSuite.scala | 4 ++-- 3 files changed, 25 insertions(+), 3 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-2.4 updated: Revert "[SPARK-28912][STREAMING] Fixed MatchError in getCheckpointFiles()"
This is an automated email from the ASF dual-hosted git repository. lixiao pushed a commit to branch branch-2.4 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-2.4 by this push: new 0a4b356 Revert "[SPARK-28912][STREAMING] Fixed MatchError in getCheckpointFiles()" 0a4b356 is described below commit 0a4b35642ffa3020ec0fcae2cca59376e2095636 Author: Xiao Li AuthorDate: Fri Sep 6 23:37:36 2019 -0700 Revert "[SPARK-28912][STREAMING] Fixed MatchError in getCheckpointFiles()" This reverts commit 2654c33fd6a7a09e2b2fa9fc1c2ea6224ab292e6. --- .../scala/org/apache/spark/streaming/Checkpoint.scala | 4 ++-- .../org/apache/spark/streaming/CheckpointSuite.scala| 17 - 2 files changed, 2 insertions(+), 19 deletions(-) diff --git a/streaming/src/main/scala/org/apache/spark/streaming/Checkpoint.scala b/streaming/src/main/scala/org/apache/spark/streaming/Checkpoint.scala index b081287..a882558 100644 --- a/streaming/src/main/scala/org/apache/spark/streaming/Checkpoint.scala +++ b/streaming/src/main/scala/org/apache/spark/streaming/Checkpoint.scala @@ -128,8 +128,8 @@ object Checkpoint extends Logging { try { val statuses = fs.listStatus(path) if (statuses != null) { -val paths = statuses.filterNot(_.isDirectory).map(_.getPath) -val filtered = paths.filter(p => REGEX.findFirstIn(p.getName).nonEmpty) +val paths = statuses.map(_.getPath) +val filtered = paths.filter(p => REGEX.findFirstIn(p.toString).nonEmpty) filtered.sortWith(sortFunc) } else { logWarning(s"Listing $path returned null") diff --git a/streaming/src/test/scala/org/apache/spark/streaming/CheckpointSuite.scala b/streaming/src/test/scala/org/apache/spark/streaming/CheckpointSuite.scala index 43e3cdd..19b621f 100644 --- a/streaming/src/test/scala/org/apache/spark/streaming/CheckpointSuite.scala +++ b/streaming/src/test/scala/org/apache/spark/streaming/CheckpointSuite.scala @@ -846,23 +846,6 @@ class CheckpointSuite extends TestSuiteBase with DStreamCheckpointTester checkpointWriter.stop() } - test("SPARK-28912: Fix MatchError in getCheckpointFiles") { -withTempDir { tempDir => - val fs = FileSystem.get(tempDir.toURI, new Configuration) - val checkpointDir = tempDir.getAbsolutePath + "/checkpoint-01" - - assert(Checkpoint.getCheckpointFiles(checkpointDir, Some(fs)).length === 0) - - // Ignore files whose parent path match. - fs.create(new Path(checkpointDir, "this-is-matched-before-due-to-parent-path")).close() - assert(Checkpoint.getCheckpointFiles(checkpointDir, Some(fs)).length === 0) - - // Ignore directories whose names match. - fs.mkdirs(new Path(checkpointDir, "checkpoint-10")) - assert(Checkpoint.getCheckpointFiles(checkpointDir, Some(fs)).length === 0) -} - } - test("SPARK-6847: stack overflow when updateStateByKey is followed by a checkpointed dstream") { // In this test, there are two updateStateByKey operators. The RDD DAG is as follows: // - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (ff5fa58 -> 89aba69)
This is an automated email from the ASF dual-hosted git repository. lixiao pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from ff5fa58 [SPARK-21870][SQL][FOLLOW-UP] Clean up string template formats for generated code in HashAggregateExec add 89aba69 [SPARK-28935][SQL][DOCS] Document SQL metrics for Details for Query Plan No new revisions were added by this update. Summary of changes: docs/web-ui.md | 35 +++ 1 file changed, 35 insertions(+) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (67b4329 -> b2f0660)
This is an automated email from the ASF dual-hosted git repository. lixiao pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 67b4329 [SPARK-28690][SQL] Add `date_part` function for timestamps/dates add b2f0660 [SPARK-29002][SQL] Avoid changing SMJ to BHJ if the build side has a high ratio of empty partitions No new revisions were added by this update. Summary of changes: .../spark/sql/catalyst/plans/logical/hints.scala | 8 +++ .../org/apache/spark/sql/internal/SQLConf.scala| 12 + .../spark/sql/execution/SparkStrategies.scala | 4 +- .../execution/adaptive/AdaptiveSparkPlanExec.scala | 4 +- .../adaptive/DemoteBroadcastHashJoin.scala | 60 ++ .../adaptive/AdaptiveQueryExecSuite.scala | 29 ++- 6 files changed, 114 insertions(+), 3 deletions(-) create mode 100644 sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/DemoteBroadcastHashJoin.scala - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (5adaa2e -> e4f7002)
This is an automated email from the ASF dual-hosted git repository. lixiao pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 5adaa2e [SPARK-28979][SQL] Rename UnresovledTable to V1Table add e4f7002 [SPARK-28830][DOC][SQL] Document UNCACHE TABLE statement in SQL Reference No new revisions were added by this update. Summary of changes: docs/sql-ref-syntax-aux-cache-uncache-table.md | 24 +--- 1 file changed, 21 insertions(+), 3 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (f96486b -> a7a3935)
This is an automated email from the ASF dual-hosted git repository. lixiao pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from f96486b [SPARK-28808][DOCS][SQL] Document SHOW FUNCTIONS in SQL Reference add a7a3935 [SPARK-11150][SQL] Dynamic Partition Pruning No new revisions were added by this update. Summary of changes: .../apache/spark/unsafe/map/BytesToBytesMap.java |7 + .../sql/catalyst/expressions/DynamicPruning.scala | 95 ++ .../sql/catalyst/expressions/predicates.scala | 39 +- .../spark/sql/catalyst/optimizer/Optimizer.scala |4 +- .../org/apache/spark/sql/internal/SQLConf.scala| 43 + .../CleanupDynamicPruningFilters.scala | 51 + .../sql/dynamicpruning/PartitionPruning.scala | 259 .../dynamicpruning/PlanDynamicPruningFilters.scala | 91 ++ .../spark/sql/execution/DataSourceScanExec.scala | 69 +- .../spark/sql/execution/QueryExecution.scala |3 +- .../spark/sql/execution/SparkOptimizer.scala | 11 +- .../sql/execution/SubqueryBroadcastExec.scala | 108 ++ .../adaptive/InsertAdaptiveSparkPlan.scala |6 +- .../execution/datasources/DataSourceStrategy.scala |2 +- .../datasources/PruneFileSourcePartitions.scala|6 +- .../spark/sql/execution/joins/HashJoin.scala | 22 +- .../spark/sql/execution/joins/HashedRelation.scala | 84 +- .../org/apache/spark/sql/execution/subquery.scala | 81 +- .../spark/sql/DynamicPartitionPruningSuite.scala | 1293 .../sql/execution/joins/HashedRelationSuite.scala | 199 ++- 20 files changed, 2447 insertions(+), 26 deletions(-) create mode 100644 sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/DynamicPruning.scala create mode 100644 sql/core/src/main/scala/org/apache/spark/sql/dynamicpruning/CleanupDynamicPruningFilters.scala create mode 100644 sql/core/src/main/scala/org/apache/spark/sql/dynamicpruning/PartitionPruning.scala create mode 100644 sql/core/src/main/scala/org/apache/spark/sql/dynamicpruning/PlanDynamicPruningFilters.scala create mode 100644 sql/core/src/main/scala/org/apache/spark/sql/execution/SubqueryBroadcastExec.scala create mode 100644 sql/core/src/test/scala/org/apache/spark/sql/DynamicPartitionPruningSuite.scala - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-28808][DOCS][SQL] Document SHOW FUNCTIONS in SQL Reference
This is an automated email from the ASF dual-hosted git repository. lixiao pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new f96486b [SPARK-28808][DOCS][SQL] Document SHOW FUNCTIONS in SQL Reference f96486b is described below commit f96486b4aaba36a0f843c8a52801c305b0fa2b16 Author: Dilip Biswal AuthorDate: Wed Sep 4 11:47:10 2019 -0700 [SPARK-28808][DOCS][SQL] Document SHOW FUNCTIONS in SQL Reference ### What changes were proposed in this pull request? Document SHOW FUNCTIONS statement in SQL Reference Guide. ### Why are the changes needed? Currently Spark lacks documentation on the supported SQL constructs causing confusion among users who sometimes have to look at the code to understand the usage. This is aimed at addressing this issue. ### Does this PR introduce any user-facing change? Yes. **Before:** There was no documentation for this. **After.** ![image](https://user-images.githubusercontent.com/11567269/64281840-e3cc0f00-cf08-11e9-9784-f01392276130.png) https://user-images.githubusercontent.com/11567269/64281911-0fe79000-cf09-11e9-955f-21b44590707c.png";> https://user-images.githubusercontent.com/11567269/64281916-12e28080-cf09-11e9-9187-688c2c751559.png";> ### How was this patch tested? Tested using jykyll build --serve Closes #25539 from dilipbiswal/ref-doc-show-functions. Lead-authored-by: Dilip Biswal Co-authored-by: Xiao Li Signed-off-by: Xiao Li --- docs/sql-ref-syntax-aux-describe-function.md | 2 +- docs/sql-ref-syntax-aux-show-functions.md| 119 ++- 2 files changed, 119 insertions(+), 2 deletions(-) diff --git a/docs/sql-ref-syntax-aux-describe-function.md b/docs/sql-ref-syntax-aux-describe-function.md index 1c50708..d3dc192 100644 --- a/docs/sql-ref-syntax-aux-describe-function.md +++ b/docs/sql-ref-syntax-aux-describe-function.md @@ -34,7 +34,7 @@ metadata information is returned along with the extended usage information. function_name -Specifies a name of an existing function in the syetem. The function name may be +Specifies a name of an existing function in the system. The function name may be optionally qualified with a database name. If `function_name` is qualified with a database then the function is resolved from the user specified database, otherwise it is resolved from the current database. diff --git a/docs/sql-ref-syntax-aux-show-functions.md b/docs/sql-ref-syntax-aux-show-functions.md index ae689fd..db02607 100644 --- a/docs/sql-ref-syntax-aux-show-functions.md +++ b/docs/sql-ref-syntax-aux-show-functions.md @@ -19,4 +19,121 @@ license: | limitations under the License. --- -**This page is under construction** +### Description +Returns the list of functions after applying an optional regex pattern. +Given number of functions supported by Spark is quite large, this statement +in conjuction with [describe function](sql-ref-syntax-aux-describe-function.html) +may be used to quickly find the function and understand its usage. The `LIKE` +clause is optional and supported only for compatibility with other systems. + +### Syntax +{% highlight sql %} +SHOW [ function_kind ] FUNCTIONS ([LIKE] function_name | regex_pattern) +{% endhighlight %} + +### Parameters + + function_kind + +Specifies the name space of the function to be searched upon. The valid name spaces are : + + USER - Looks up the function(s) among the user defined functions. + SYSTEM - Looks up the function(s) among the system defined functions. + ALL - Looks up the function(s) among both user and system defined functions. + + + function_name + +Specifies a name of an existing function in the system. The function name may be +optionally qualified with a database name. If `function_name` is qualified with +a database then the function is resolved from the user specified database, otherwise +it is resolved from the current database. +Syntax: + +[database_name.]function_name + + + regex_pattern + +Specifies a regular expression pattern that is used to limit the results of the +statement. + + Only `*` and `|` are allowed as wildcard pattern. + Excluding `*` and `|` the remaining pattern follows the regex semantics. + The leading and trailing blanks are trimmed in the input pattern before processing. + + + + +### Examples +{% highlight sql %} +-- List a system function `trim` by searching both user defined and system +-- defined functions. +SHOW FUNCTIONS trim; + ++ + |function| + ++ + |trim| + ++ + +-- List a system function `concat` by searching system defined functions. +SHOW SYSTEM
[spark] branch master updated (594c9c5 -> b992160)
This is an automated email from the ASF dual-hosted git repository. lixiao pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 594c9c5 [SPARK-25151][SS] Apply Apache Commons Pool to KafkaDataConsumer add b992160 [SPARK-28811][DOCS][SQL] Document SHOW TBLPROPERTIES in SQL Reference No new revisions were added by this update. Summary of changes: docs/sql-ref-syntax-aux-show-tblproperties.md | 94 ++- 1 file changed, 93 insertions(+), 1 deletion(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (a07f795 -> 9f478a6)
This is an automated email from the ASF dual-hosted git repository. lixiao pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from a07f795 [SPARK-28577][YARN] Resource capability requested for each executor add offHeapMemorySize add 9f478a6 [SPARK-28901][SQL] SparkThriftServer's Cancel SQL Operation show it in JDBC Tab UI No new revisions were added by this update. Summary of changes: .../sql/hive/thriftserver/HiveThriftServer2.scala | 48 +++- .../SparkExecuteStatementOperation.scala | 89 ++ 2 files changed, 87 insertions(+), 50 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (8980093 -> 56f2887)
This is an automated email from the ASF dual-hosted git repository. lixiao pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 8980093 [SPARK-3137][CORE] Replace the global TorrentBroadcast lock with fine grained KeyLock add 56f2887 [SPARK-28788][DOC][SQL] Document ANALYZE TABLE statement in SQL Reference No new revisions were added by this update. Summary of changes: docs/sql-ref-syntax-aux-analyze-table.md | 76 +++- docs/sql-ref-syntax-aux-analyze.md | 13 +++--- 2 files changed, 80 insertions(+), 9 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-28961][HOT-FIX][BUILD] Upgrade Maven from 3.6.1 to 3.6.2
This is an automated email from the ASF dual-hosted git repository. lixiao pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 2856398 [SPARK-28961][HOT-FIX][BUILD] Upgrade Maven from 3.6.1 to 3.6.2 2856398 is described below commit 2856398de9d35d136758bbc11afa4d1dc0c98830 Author: Xiao Li AuthorDate: Tue Sep 3 11:06:57 2019 -0700 [SPARK-28961][HOT-FIX][BUILD] Upgrade Maven from 3.6.1 to 3.6.2 ### What changes were proposed in this pull request? This PR is to upgrade the maven dependence from 3.6.1 to 3.6.2. ### Why are the changes needed? All the builds are broken because 3.6.1 is not available. http://ftp.wayne.edu/apache//maven/maven-3/ - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Compile/job/spark-master-compile-maven-hadoop-3.2/485/ - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Compile/job/spark-master-compile-maven-hadoop-2.7/10536/ ![image](https://user-images.githubusercontent.com/11567269/64196667-36d69100-ce39-11e9-8f93-40eb333d595d.png) ### Does this PR introduce any user-facing change? No ### How was this patch tested? N/A Closes #25665 from gatorsmile/upgradeMVN. Authored-by: Xiao Li Signed-off-by: Xiao Li --- dev/appveyor-install-dependencies.ps1 | 2 +- docs/building-spark.md| 2 +- pom.xml | 2 +- 3 files changed, 3 insertions(+), 3 deletions(-) diff --git a/dev/appveyor-install-dependencies.ps1 b/dev/appveyor-install-dependencies.ps1 index 85e0df6..d33a107 100644 --- a/dev/appveyor-install-dependencies.ps1 +++ b/dev/appveyor-install-dependencies.ps1 @@ -81,7 +81,7 @@ if (!(Test-Path $tools)) { # == Maven Push-Location $tools -$mavenVer = "3.6.1" +$mavenVer = "3.6.2" Start-FileDownload "https://archive.apache.org/dist/maven/maven-3/$mavenVer/binaries/apache-maven-$mavenVer-bin.zip"; "maven.zip" # extract diff --git a/docs/building-spark.md b/docs/building-spark.md index 1f8e51f..37f8986 100644 --- a/docs/building-spark.md +++ b/docs/building-spark.md @@ -27,7 +27,7 @@ license: | ## Apache Maven The Maven-based build is the build of reference for Apache Spark. -Building Spark using Maven requires Maven 3.6.1 and Java 8. +Building Spark using Maven requires Maven 3.6.2 and Java 8. Spark requires Scala 2.12; support for Scala 2.11 was removed in Spark 3.0.0. ### Setting up Maven's Memory Usage diff --git a/pom.xml b/pom.xml index 6544b50..e2a3559 100644 --- a/pom.xml +++ b/pom.xml @@ -118,7 +118,7 @@ 1.8 ${java.version} ${java.version} -3.6.1 +3.6.2 spark 1.7.16 1.2.17 - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (92ae271 -> 94e6674)
This is an automated email from the ASF dual-hosted git repository. lixiao pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 92ae271 [SPARK-28806][DOCS][SQL] Document SHOW COLUMNS in SQL Reference add 94e6674 [SPARK-28805][DOCS][SQL] Document DESCRIBE FUNCTION in SQL Reference No new revisions were added by this update. Summary of changes: docs/sql-ref-syntax-aux-describe-function.md | 92 +++- 1 file changed, 91 insertions(+), 1 deletion(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (5cf2602 -> 92ae271)
This is an automated email from the ASF dual-hosted git repository. lixiao pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 5cf2602 [SPARK-28946][R][DOCS] Add some more information about building SparkR on Windows add 92ae271 [SPARK-28806][DOCS][SQL] Document SHOW COLUMNS in SQL Reference No new revisions were added by this update. Summary of changes: docs/sql-ref-syntax-aux-show-columns.md | 77 - 1 file changed, 76 insertions(+), 1 deletion(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (eb037a8 -> 585954d)
This is an automated email from the ASF dual-hosted git repository. lixiao pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from eb037a8 [SPARK-28855][CORE][ML][SQL][STREAMING] Remove outdated usages of Experimental, Evolving annotations add 585954d [SPARK-28790][DOC][SQL] Document CACHE TABLE statement in SQL Reference No new revisions were added by this update. Summary of changes: docs/sql-ref-syntax-aux-cache-cache-table.md | 63 +++- 1 file changed, 62 insertions(+), 1 deletion(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-28786][DOC][SQL][FOLLOW-UP] Change "Related Statements" to bold
This is an automated email from the ASF dual-hosted git repository. lixiao pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new b85a554 [SPARK-28786][DOC][SQL][FOLLOW-UP] Change "Related Statements" to bold b85a554 is described below commit b85a554487540bbbae1c62ee6ddd315d6760ea6a Author: Huaxin Gao AuthorDate: Sat Aug 31 14:58:41 2019 -0700 [SPARK-28786][DOC][SQL][FOLLOW-UP] Change "Related Statements" to bold ### What changes were proposed in this pull request? Change "Related Statements" to bold ### Why are the changes needed? To make doc look nice and consistent. ### Does this PR introduce any user-facing change? Yes ### How was this patch tested? Tested using jykyll build --serve Before the change: ![image](https://user-images.githubusercontent.com/13592258/63965303-ae797a00-ca4d-11e9-8a85-71fbfdeaaccb.png) After the change: ![image](https://user-images.githubusercontent.com/13592258/63965316-b76a4b80-ca4d-11e9-9a85-48d7a909f0ef.png) Before the change: ![image](https://user-images.githubusercontent.com/13592258/63988989-7c8b0680-ca93-11e9-9352-a9ec5457b279.png) After the change: ![image](https://user-images.githubusercontent.com/13592258/63988996-87459b80-ca93-11e9-9e51-8cb36a632436.png) Closes #25623 from huaxingao/spark-28786-n. Authored-by: Huaxin Gao Signed-off-by: Xiao Li --- docs/sql-ref-syntax-dml-insert-into.md | 11 --- ...ql-ref-syntax-dml-insert-overwrite-directory-hive.md | 2 +- docs/sql-ref-syntax-dml-insert-overwrite-directory.md | 2 +- docs/sql-ref-syntax-dml-insert-overwrite-table.md | 17 ++--- 4 files changed, 12 insertions(+), 20 deletions(-) diff --git a/docs/sql-ref-syntax-dml-insert-into.md b/docs/sql-ref-syntax-dml-insert-into.md index 1bb043d..890e30b 100644 --- a/docs/sql-ref-syntax-dml-insert-into.md +++ b/docs/sql-ref-syntax-dml-insert-into.md @@ -95,9 +95,8 @@ INSERT INTO [ TABLE ] table_name {% endhighlight %} Insert Using a SELECT Statement -Assuming the `persons` table has already been created and populated. - {% highlight sql %} + -- Assuming the persons table has already been created and populated. SELECT * FROM persons; + -- + -- + -- + @@ -127,9 +126,8 @@ Assuming the `persons` table has already been created and populated. {% endhighlight %} Insert Using a TABLE Statement -Assuming the `visiting_students` table has already been created and populated. - {% highlight sql %} + -- Assuming the visiting_students table has already been created and populated. SELECT * FROM visiting_students; + -- + -- + -- + @@ -162,9 +160,8 @@ Assuming the `visiting_students` table has already been created and populated. {% endhighlight %} Insert Using a FROM Statement -Assuming the `applicants` table has already been created and populated. - {% highlight sql %} + -- Assuming the applicants table has already been created and populated. SELECT * FROM applicants; + -- + -- + -- + -- + @@ -203,7 +200,7 @@ Assuming the `applicants` table has already been created and populated. + -- + -- + -- + {% endhighlight %} -Related Statements: +### Related Statements * [INSERT OVERWRITE statement](sql-ref-syntax-dml-insert-overwrite-table.html) * [INSERT OVERWRITE DIRECTORY statement](sql-ref-syntax-dml-insert-overwrite-directory.html) * [INSERT OVERWRITE DIRECTORY with Hive format statement](sql-ref-syntax-dml-insert-overwrite-directory-hive.html) \ No newline at end of file diff --git a/docs/sql-ref-syntax-dml-insert-overwrite-directory-hive.md b/docs/sql-ref-syntax-dml-insert-overwrite-directory-hive.md index f09da80..784e631 100644 --- a/docs/sql-ref-syntax-dml-insert-overwrite-directory-hive.md +++ b/docs/sql-ref-syntax-dml-insert-overwrite-directory-hive.md @@ -81,7 +81,7 @@ INSERT OVERWRITE [ LOCAL ] DIRECTORY directory_path SELECT * FROM test_table; {% endhighlight %} -Related Statements: +### Related Statements * [INSERT INTO statement](sql-ref-syntax-dml-insert-into.html) * [INSERT OVERWRITE statement](sql-ref-syntax-dml-insert-overwrite-table.html) * [INSERT OVERWRITE DIRECTORY statement](sql-ref-syntax-dml-insert-overwrite-directory.html) \ No newline at end of file diff --git a/docs/sql-ref-syntax-dml-insert-overwrite-directory.md b/docs/sql-ref-syntax-dml-insert-overwrite-directory.md index 8262d9e..89d58e7 100644 --- a/docs/sql-ref-syntax-dml-insert-overwrite-directory.md +++ b/docs/sql-ref-syntax-dml-insert-overwrite-directo
[spark] branch master updated: [SPARK-28803][DOCS][SQL] Document DESCRIBE TABLE in SQL Reference
This is an automated email from the ASF dual-hosted git repository. lixiao pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new b4d7b30 [SPARK-28803][DOCS][SQL] Document DESCRIBE TABLE in SQL Reference b4d7b30 is described below commit b4d7b30aa62792e7dfaa1318d0ef608286fde806 Author: Dilip Biswal AuthorDate: Sat Aug 31 14:46:55 2019 -0700 [SPARK-28803][DOCS][SQL] Document DESCRIBE TABLE in SQL Reference ### What changes were proposed in this pull request? Document DESCRIBE TABLE statement in SQL Reference Guide. ### Why are the changes needed? Currently Spark lacks documentation on the supported SQL constructs causing confusion among users who sometimes have to look at the code to understand the usage. This is aimed at addressing this issue. ### Does this PR introduce any user-facing change? Yes. **Before:** There was no documentation for this. **After.** https://user-images.githubusercontent.com/14225158/64069071-f556a380-cbf6-11e9-985d-13dd37a32bbb.png";> https://user-images.githubusercontent.com/14225158/64069073-f982c100-cbf6-11e9-925b-eb2fc85c3341.png";> https://user-images.githubusercontent.com/14225158/64069076-0ef7eb00-cbf7-11e9-8062-9a9fb8700bb3.png";> https://user-images.githubusercontent.com/14225158/64069077-0f908180-cbf7-11e9-9a31-9b7f122db2d3.png";> https://user-images.githubusercontent.com/14225158/64069078-0f908180-cbf7-11e9-96ee-438a7b64c961.png";> https://user-images.githubusercontent.com/14225158/64069079-0f908180-cbf7-11e9-9bae-734a1994f936.png";> ### How was this patch tested? Tested using jykyll build --serve Closes #25527 from dilipbiswal/ref-doc-desc-table. Lead-authored-by: Dilip Biswal Co-authored-by: Xiao Li Signed-off-by: Xiao Li --- docs/sql-ref-syntax-aux-describe-table.md | 163 +- 1 file changed, 162 insertions(+), 1 deletion(-) diff --git a/docs/sql-ref-syntax-aux-describe-table.md b/docs/sql-ref-syntax-aux-describe-table.md index 110a5e4..e2cb0e4 100644 --- a/docs/sql-ref-syntax-aux-describe-table.md +++ b/docs/sql-ref-syntax-aux-describe-table.md @@ -18,5 +18,166 @@ license: | See the License for the specific language governing permissions and limitations under the License. --- +### Description +`DESCRIBE TABLE` statement returns the basic metadata information of a +table. The metadata information includes column name, column type +and column comment. Optionally a partition spec or column name may be specified +to return the metadata pertaining to a partition or column respectively. -**This page is under construction** +### Syntax +{% highlight sql %} +{DESC | DESCRIBE} [TABLE] [format] table_identifier [partition_spec] [col_name] +{% endhighlight %} + +### Parameters + + format + +Specifies the optional format of describe output. If `EXTENDED` is specified +then additional metadata information (such as parent database, owner, and access time) +is returned. + + table_identifier + +Specifies a table name, which may be optionally qualified with a database name. +Syntax: + +[database_name.]table_name + + + partition_spec + +An optional parameter that specifies a comma separated list of key and value pairs +for paritions. When specified, additional partition metadata is returned. +Syntax: + +PARTITION (partition_col_name = partition_col_val [ , ... ]) + + + col_name + +An optional paramter that specifies the column name that needs to be described. +The supplied column name may be optionally qualified. Parameters `partition_spec` +and `col_name` are mutually exclusive and can not be specified together. Currently +nested columns are not allowed to be specified. + +Syntax: + +[database_name.][table_name.]column_name + + + + +### Examples +{% highlight sql %} +-- Creates a table `customer`. Assumes current database is `salesdb`. +CREATE TABLE customer( +cust_id INT, +state VARCHAR(20), +name STRING COMMENT 'Short name' + ) + USING parquet + PARTITION BY state; + ; + +-- Returns basic metadata information for unqualified table `customer` +DESCRIBE TABLE customer; + +---+-+--+ + |col_name |data_type|comment | + +---+-+--+ + |cust_id|int |null | + |name |string |Short name| + |state |string |null | + |# Partition Information| | | + |# col_name |data_type|comment | + |state |string |null | + +---+-+--+ +
[spark] branch master updated (7cc0f0e -> a08f33b)
This is an automated email from the ASF dual-hosted git repository. lixiao pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 7cc0f0e [SPARK-28894][SQL][TESTS] Add a clue to make it easier to debug via Jenkins's test results add a08f33b [SPARK-28804][DOCS][SQL] Document DESCRIBE QUERY in SQL Reference No new revisions were added by this update. Summary of changes: docs/sql-ref-syntax-aux-describe-query.md | 87 ++- 1 file changed, 86 insertions(+), 1 deletion(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-28807][DOCS][SQL] Document SHOW DATABASES in SQL Reference
This is an automated email from the ASF dual-hosted git repository. lixiao pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new fb1053d [SPARK-28807][DOCS][SQL] Document SHOW DATABASES in SQL Reference fb1053d is described below commit fb1053d14a3691695539304457b50e7342020c1f Author: Dilip Biswal AuthorDate: Thu Aug 29 09:04:27 2019 -0700 [SPARK-28807][DOCS][SQL] Document SHOW DATABASES in SQL Reference ### What changes were proposed in this pull request? Document SHOW DATABASES statement in SQL Reference Guide. ### Why are the changes needed? Currently Spark lacks documentation on the supported SQL constructs causing confusion among users who sometimes have to look at the code to understand the usage. This is aimed at addressing this issue. ### Does this PR introduce any user-facing change? Yes. **Before:** There was no documentation for this. **After.** https://user-images.githubusercontent.com/14225158/63916727-dd600380-c9ed-11e9-8372-789110c9d2dc.png";> https://user-images.githubusercontent.com/14225158/63916734-e0f38a80-c9ed-11e9-8ad4-d854febeaab8.png";> https://user-images.githubusercontent.com/14225158/63916740-e4871180-c9ed-11e9-9cfc-199cd8a64852.png";> ### How was this patch tested? Tested using jykyll build --serve Closes #25526 from dilipbiswal/ref-doc-show-db. Authored-by: Dilip Biswal Signed-off-by: Xiao Li --- docs/sql-ref-syntax-aux-show-databases.md | 63 +-- 1 file changed, 60 insertions(+), 3 deletions(-) diff --git a/docs/sql-ref-syntax-aux-show-databases.md b/docs/sql-ref-syntax-aux-show-databases.md index e7aedf8..b1c48da 100644 --- a/docs/sql-ref-syntax-aux-show-databases.md +++ b/docs/sql-ref-syntax-aux-show-databases.md @@ -1,7 +1,7 @@ --- layout: global -title: SHOW DATABASE -displayTitle: SHOW DATABASE +title: SHOW DATABASES +displayTitle: SHOW DATABASES license: | Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements. See the NOTICE file distributed with @@ -19,4 +19,61 @@ license: | limitations under the License. --- -**This page is under construction** +### Description +Lists the databases that match an optionally supplied string pattern. If no +pattern is supplied then the command lists all the databases in the system. +Please note that the usage of `SCHEMAS` and `DATABASES` are interchangable +and mean the same thing. + +### Syntax +{% highlight sql %} +SHOW {DATABASES|SCHEMAS} [LIKE string_pattern] +{% endhighlight %} + +### Parameters + + LIKE string_pattern + +Specifies a string pattern that is used to match the databases in the system. In +the specified string pattern '*' matches any number of characters. + + + +### Examples +{% highlight sql %} +-- Create database. Assumes a database named `default` already exists in +-- the system. +CREATE DATABASE payroll_db; +CREATE DATABASE payments_db; + +-- Lists all the databases. +SHOW DATABASES; + ++ + |databaseName| + ++ + | default| + | payments_db| + | payroll_db| + ++ +-- Lists databases with name starting with string pattern `pay` +SHOW DATABASES LIKE 'pay*'; + ++ + |databaseName| + ++ + | payments_db| + | payroll_db| + ++ +-- Lists all databases. Keywords SCHEMAS and DATABASES are interchangeable. +SHOW SCHEMAS; + ++ + |databaseName| + ++ + | default| + | payments_db| + | payroll_db| + ++ +{% endhighlight %} +### Related Statements +- [DESCRIBE DATABASE](sql-ref-syntax-aux-describe-databases.html) +- [CREATE DATABASE](sql-ref-syntax-ddl-create-database.html) +- [ALTER DATABASE](sql-ref-syntax-ddl-alter-database.html) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (2465558 -> 3e09a0f)
This is an automated email from the ASF dual-hosted git repository. lixiao pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 2465558 [SPARK-28495][SQL][FOLLOW-UP] Disallow conversions between timestamp and long in ASNI mode add 3e09a0f [SPARK-28786][DOC][SQL] Document INSERT statement in SQL Reference No new revisions were added by this update. Summary of changes: docs/sql-ref-syntax-dml-insert-into.md | 209 + ...f-syntax-dml-insert-overwrite-directory-hive.md | 87 + ...ql-ref-syntax-dml-insert-overwrite-directory.md | 85 + docs/sql-ref-syntax-dml-insert-overwrite-table.md | 191 +++ docs/sql-ref-syntax-dml-insert.md | 12 +- 5 files changed, 580 insertions(+), 4 deletions(-) create mode 100644 docs/sql-ref-syntax-dml-insert-into.md create mode 100644 docs/sql-ref-syntax-dml-insert-overwrite-directory-hive.md create mode 100644 docs/sql-ref-syntax-dml-insert-overwrite-directory.md create mode 100644 docs/sql-ref-syntax-dml-insert-overwrite-table.md - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (1b404b9 -> 7452786)
This is an automated email from the ASF dual-hosted git repository. lixiao pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 1b404b9 [SPARK-28890][SQL] Upgrade Hive Metastore Client to the 3.1.2 for Hive 3.1 add 7452786 [SPARK-28789][DOCS][SQL] Document ALTER DATABASE command No new revisions were added by this update. Summary of changes: docs/css/main.css | 14 +++ docs/sql-ref-syntax-ddl-alter-database.md | 41 ++- 2 files changed, 54 insertions(+), 1 deletion(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-28642][SQL][TEST][FOLLOW-UP] Test spark.sql.redaction.options.regex with and without default values
This is an automated email from the ASF dual-hosted git repository. lixiao pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new c353a84 [SPARK-28642][SQL][TEST][FOLLOW-UP] Test spark.sql.redaction.options.regex with and without default values c353a84 is described below commit c353a84d1a991797f255ec312e5935438727536c Author: Yuming Wang AuthorDate: Sun Aug 25 23:12:16 2019 -0700 [SPARK-28642][SQL][TEST][FOLLOW-UP] Test spark.sql.redaction.options.regex with and without default values ### What changes were proposed in this pull request? Test `spark.sql.redaction.options.regex` with and without default values. ### Why are the changes needed? Normally, we do not rely on the default value of `spark.sql.redaction.options.regex`. ### Does this PR introduce any user-facing change? No ### How was this patch tested? N/A Closes #25579 from wangyum/SPARK-28642-f1. Authored-by: Yuming Wang Signed-off-by: Xiao Li --- .../scala/org/apache/spark/sql/jdbc/JDBCSuite.scala | 20 +--- 1 file changed, 17 insertions(+), 3 deletions(-) diff --git a/sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCSuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCSuite.scala index 72a5645..7fe00ae 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCSuite.scala +++ b/sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCSuite.scala @@ -1031,8 +1031,10 @@ class JDBCSuite extends QueryTest } test("Hide credentials in show create table") { +val userName = "testUser" val password = "testPass" val tableName = "tab1" +val dbTable = "TEST.PEOPLE" withTable(tableName) { sql( s""" @@ -1040,18 +1042,30 @@ class JDBCSuite extends QueryTest |USING org.apache.spark.sql.jdbc |OPTIONS ( | url '$urlWithUserAndPass', - | dbtable 'TEST.PEOPLE', - | user 'testUser', + | dbtable '$dbTable', + | user '$userName', | password '$password') """.stripMargin) val show = ShowCreateTableCommand(TableIdentifier(tableName)) spark.sessionState.executePlan(show).executedPlan.executeCollect().foreach { r => assert(!r.toString.contains(password)) +assert(r.toString.contains(dbTable)) +assert(r.toString.contains(userName)) } sql(s"SHOW CREATE TABLE $tableName").collect().foreach { r => -assert(!r.toString().contains(password)) +assert(!r.toString.contains(password)) +assert(r.toString.contains(dbTable)) +assert(r.toString.contains(userName)) + } + + withSQLConf(SQLConf.SQL_OPTIONS_REDACTION_PATTERN.key -> "(?i)dbtable|user") { + spark.sessionState.executePlan(show).executedPlan.executeCollect().foreach { r => + assert(!r.toString.contains(password)) + assert(!r.toString.contains(dbTable)) + assert(!r.toString.contains(userName)) +} } } } - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-28852][SQL] Implement SparkGetCatalogsOperation for Thrift Server
This is an automated email from the ASF dual-hosted git repository. lixiao pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new adb506a [SPARK-28852][SQL] Implement SparkGetCatalogsOperation for Thrift Server adb506a is described below commit adb506afd783d24d397c73c34c7fed89563c0a6b Author: Yuming Wang AuthorDate: Sun Aug 25 22:42:50 2019 -0700 [SPARK-28852][SQL] Implement SparkGetCatalogsOperation for Thrift Server ### What changes were proposed in this pull request? This PR implements `SparkGetCatalogsOperation` for Thrift Server metadata completeness. ### Why are the changes needed? Thrift Server metadata completeness. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Unit test Closes #2 from wangyum/SPARK-28852. Authored-by: Yuming Wang Signed-off-by: Xiao Li --- .../thriftserver/SparkGetCatalogsOperation.scala | 79 ++ .../server/SparkSQLOperationManager.scala | 11 +++ .../thriftserver/SparkMetadataOperationSuite.scala | 8 +++ .../cli/operation/GetCatalogsOperation.java| 2 +- .../cli/operation/GetCatalogsOperation.java| 2 +- 5 files changed, 100 insertions(+), 2 deletions(-) diff --git a/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkGetCatalogsOperation.scala b/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkGetCatalogsOperation.scala new file mode 100644 index 000..cde99fd --- /dev/null +++ b/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkGetCatalogsOperation.scala @@ -0,0 +1,79 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.hive.thriftserver + +import java.util.UUID + +import org.apache.hadoop.hive.ql.security.authorization.plugin.HiveOperationType +import org.apache.hive.service.cli.{HiveSQLException, OperationState} +import org.apache.hive.service.cli.operation.GetCatalogsOperation +import org.apache.hive.service.cli.session.HiveSession + +import org.apache.spark.internal.Logging +import org.apache.spark.sql.SQLContext +import org.apache.spark.util.{Utils => SparkUtils} + +/** + * Spark's own GetCatalogsOperation + * + * @param sqlContext SQLContext to use + * @param parentSession a HiveSession from SessionManager + */ +private[hive] class SparkGetCatalogsOperation( +sqlContext: SQLContext, +parentSession: HiveSession) + extends GetCatalogsOperation(parentSession) with Logging { + + private var statementId: String = _ + + override def close(): Unit = { +super.close() +HiveThriftServer2.listener.onOperationClosed(statementId) + } + + override def runInternal(): Unit = { +statementId = UUID.randomUUID().toString +val logMsg = "Listing catalogs" +logInfo(s"$logMsg with $statementId") +setState(OperationState.RUNNING) +// Always use the latest class loader provided by executionHive's state. +val executionHiveClassLoader = sqlContext.sharedState.jarClassLoader +Thread.currentThread().setContextClassLoader(executionHiveClassLoader) + +HiveThriftServer2.listener.onStatementStart( + statementId, + parentSession.getSessionHandle.getSessionId.toString, + logMsg, + statementId, + parentSession.getUsername) + +try { + if (isAuthV2Enabled) { +authorizeMetaGets(HiveOperationType.GET_CATALOGS, null) + } + setState(OperationState.FINISHED) +} catch { + case e: HiveSQLException => +setState(OperationState.ERROR) +HiveThriftServer2.listener.onStatementError( + statementId, e.getMessage, SparkUtils.exceptionString(e)) +throw e +} +HiveThriftServer2.listener.onStatementFinish(statementId) + } +} diff --git a/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/server/SparkSQLOperationManager.scala b/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/server/Spar
[spark] branch master updated (ed3ea67 -> aefb2e7)
This is an automated email from the ASF dual-hosted git repository. lixiao pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from ed3ea67 [SPARK-28837][SQL] CTAS/RTAS should use nullable schema add aefb2e7 [SPARK-28739][SQL] Add a simple cost check for Adaptive Query Execution No new revisions were added by this update. Summary of changes: .../apache/spark/sql/catalyst/trees/TreeNode.scala | 4 + .../spark/sql/execution/SparkStrategies.scala | 2 - .../execution/adaptive/AdaptiveSparkPlanExec.scala | 180 +++-- .../spark/sql/execution/adaptive/costing.scala}| 16 +- ...eAdaptiveSubquery.scala => simpleCosting.scala} | 43 ++--- .../adaptive/AdaptiveQueryExecSuite.scala | 31 +++- 6 files changed, 189 insertions(+), 87 deletions(-) copy sql/core/src/main/{java/org/apache/spark/sql/api/java/UDF0.java => scala/org/apache/spark/sql/execution/adaptive/costing.scala} (74%) copy sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/{ReuseAdaptiveSubquery.scala => simpleCosting.scala} (50%) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (d75a11d -> a5df5ff)
This is an automated email from the ASF dual-hosted git repository. lixiao pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from d75a11d [SPARK-27330][SS] support task abort in foreach writer add a5df5ff [SPARK-28734][DOC] Initial table of content in the left hand side bar for SQL doc No new revisions were added by this update. Summary of changes: docs/_data/menu-sql.yaml | 164 +- docs/css/main.css | 5 +- docs/sql-ref-arithmetic-ops.md| 22 +++ docs/{sql-reference.md => sql-ref-datatypes.md} | 28 +--- docs/sql-ref-functions-builtin-aggregate.md | 22 +++ docs/sql-ref-functions-builtin-scalar.md | 22 +++ docs/sql-ref-functions-builtin.md | 25 docs/sql-ref-functions-udf-aggregate.md | 22 +++ docs/sql-ref-functions-udf-scalar.md | 22 +++ docs/sql-ref-functions-udf.md | 25 docs/sql-ref-functions.md | 25 docs/sql-ref-nan-semantics.md | 29 docs/sql-ref-syntax-aux-analyze-table.md | 22 +++ docs/sql-ref-syntax-aux-analyze.md| 25 docs/sql-ref-syntax-aux-cache-cache-table.md | 22 +++ docs/sql-ref-syntax-aux-cache-clear-cache.md | 22 +++ docs/sql-ref-syntax-aux-cache-uncache-table.md| 22 +++ docs/sql-ref-syntax-aux-cache.md | 25 docs/sql-ref-syntax-aux-conf-mgmt-reset.md| 22 +++ docs/sql-ref-syntax-aux-conf-mgmt-set.md | 22 +++ docs/sql-ref-syntax-aux-conf-mgmt.md | 25 docs/sql-ref-syntax-aux-describe-database.md | 22 +++ docs/sql-ref-syntax-aux-describe-function.md | 22 +++ docs/sql-ref-syntax-aux-describe-query.md | 22 +++ docs/sql-ref-syntax-aux-describe-table.md | 22 +++ docs/sql-ref-syntax-aux-describe.md | 25 docs/sql-ref-syntax-aux-resource-mgmt-add-file.md | 22 +++ docs/sql-ref-syntax-aux-resource-mgmt-add-jar.md | 22 +++ docs/sql-ref-syntax-aux-resource-mgmt.md | 25 docs/sql-ref-syntax-aux-show-columns.md | 22 +++ docs/sql-ref-syntax-aux-show-create-table.md | 22 +++ docs/sql-ref-syntax-aux-show-databases.md | 22 +++ docs/sql-ref-syntax-aux-show-functions.md | 22 +++ docs/sql-ref-syntax-aux-show-partitions.md| 22 +++ docs/sql-ref-syntax-aux-show-table.md | 22 +++ docs/sql-ref-syntax-aux-show-tables.md| 22 +++ docs/sql-ref-syntax-aux-show-tblproperties.md | 22 +++ docs/sql-ref-syntax-aux-show.md | 25 docs/sql-ref-syntax-aux.md| 25 docs/sql-ref-syntax-ddl-alter-database.md | 22 +++ docs/sql-ref-syntax-ddl-alter-table.md| 22 +++ docs/sql-ref-syntax-ddl-alter-view.md | 22 +++ docs/sql-ref-syntax-ddl-create-database.md| 22 +++ docs/sql-ref-syntax-ddl-create-function.md| 22 +++ docs/sql-ref-syntax-ddl-create-table.md | 22 +++ docs/sql-ref-syntax-ddl-create-view.md| 22 +++ docs/sql-ref-syntax-ddl-drop-database.md | 22 +++ docs/sql-ref-syntax-ddl-drop-function.md | 22 +++ docs/sql-ref-syntax-ddl-drop-table.md | 22 +++ docs/sql-ref-syntax-ddl-drop-view.md | 22 +++ docs/sql-ref-syntax-ddl-repair-table.md | 22 +++ docs/sql-ref-syntax-ddl-truncate-table.md | 22 +++ docs/sql-ref-syntax-ddl.md| 25 docs/sql-ref-syntax-dml-insert.md | 22 +++ docs/sql-ref-syntax-dml-load.md | 22 +++ docs/sql-ref-syntax-dml.md| 25 docs/sql-ref-syntax-qry-aggregation.md| 22 +++ docs/sql-ref-syntax-qry-explain.md| 22 +++ docs/sql-ref-syntax-qry-sampling.md | 22 +++ docs/sql-ref-syntax-qry-select-cte.md | 22 +++ docs/sql-ref-syntax-qry-select-distinct.md| 22 +++ docs/sql-ref-syntax-qry-select-groupby.md | 22 +++ docs/sql-ref-syntax-qry-select-having.md | 22 +++ docs/sql-ref-syntax-qry-select-hints.md | 22 +++ docs/sql-ref-syntax-qry-select-join.md| 22 +++ docs/sql-ref-syntax-qry-select-limit.md | 22 +++ docs/sql-ref-syntax-qry-select-orderby.md | 22 +++ docs/sql-ref-syntax-qry-select-setops.md | 22 +++ docs/sql-ref-syntax-qry-select-subqueries.md | 22 +++ docs/sql-ref-syntax-qry-select.md | 25 docs/sql-ref-syntax-qry-window.md | 22 +++ docs/sql-ref-syntax-qry.md| 25 docs/sql-ref-syntax.md| 25 docs/sql-ref.md | 25 74 files changed, 1781 inserti
[spark] branch master updated (5756a47 -> efbb035)
This is an automated email from the ASF dual-hosted git repository. lixiao pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 5756a47 [SPARK-28766][R][DOC] Fix CRAN incoming feasibility warning on invalid URL add efbb035 [SPARK-28527][SQL][TEST] Re-run all the tests in SQLQueryTestSuite via Thrift Server No new revisions were added by this update. Summary of changes: project/SparkBuild.scala | 3 +- .../org/apache/spark/sql/SQLQueryTestSuite.scala | 67 ++-- sql/hive-thriftserver/pom.xml | 7 + .../sql/hive/thriftserver/HiveThriftServer2.scala | 3 +- .../thriftserver/ThriftServerQueryTestSuite.scala | 362 + 5 files changed, 409 insertions(+), 33 deletions(-) create mode 100644 sql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/ThriftServerQueryTestSuite.scala - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (469423f -> a59fdc4)
This is an automated email from the ASF dual-hosted git repository. lixiao pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 469423f [SPARK-28595][SQL] explain should not trigger partition listing add a59fdc4 [SPARK-28472][SQL][TEST] Add test for thriftserver protocol versions No new revisions were added by this update. Summary of changes: .../SparkThriftServerProtocolVersionsSuite.scala | 316 + .../hive/thriftserver/ThriftserverShimUtils.scala | 13 + .../hive/thriftserver/ThriftserverShimUtils.scala | 15 + 3 files changed, 344 insertions(+) create mode 100644 sql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/SparkThriftServerProtocolVersionsSuite.scala - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (150dbc5 -> bab88c4)
This is an automated email from the ASF dual-hosted git repository. lixiao pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 150dbc5 [SPARK-28391][PYTHON][SQL][TESTS][FOLLOW-UP] Add UDF cases into groupby clause in 'pgSQL/select_implicit.sql' add bab88c4 [SPARK-28622][SQL][PYTHON] Rename PullOutPythonUDFInJoinCondition to ExtractPythonUDFFromJoinCondition and move to 'Extract Python UDFs' No new revisions were added by this update. Summary of changes: .../org/apache/spark/sql/catalyst/expressions/predicates.scala | 2 +- .../scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala | 6 +- .../main/scala/org/apache/spark/sql/catalyst/optimizer/joins.scala | 2 +- ...ionSuite.scala => ExtractPythonUDFFromJoinConditionSuite.scala} | 4 ++-- .../main/scala/org/apache/spark/sql/execution/SparkOptimizer.scala | 7 ++- 5 files changed, 11 insertions(+), 10 deletions(-) rename sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/{PullOutPythonUDFInJoinConditionSuite.scala => ExtractPythonUDFFromJoinConditionSuite.scala} (98%) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (8ae032d -> 10d4ffd)
This is an automated email from the ASF dual-hosted git repository. lixiao pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 8ae032d Revert "[SPARK-28582][PYSPARK] Fix flaky test DaemonTests.do_termination_test which fail on Python 3.7" add 10d4ffd [SPARK-28532][SPARK-28530][SQL][FOLLOWUP] Inline doc for FixedPoint(1) batches "Subquery" and "Join Reorder" No new revisions were added by this update. Summary of changes: .../scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala | 4 1 file changed, 4 insertions(+) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-28510][SQL] Implement Spark's own GetFunctionsOperation
This is an automated email from the ASF dual-hosted git repository. lixiao pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new efd9299 [SPARK-28510][SQL] Implement Spark's own GetFunctionsOperation efd9299 is described below commit efd92993f403fe40b3abd1dac21f8d7c875f407d Author: Yuming Wang AuthorDate: Fri Aug 2 08:50:42 2019 -0700 [SPARK-28510][SQL] Implement Spark's own GetFunctionsOperation ## What changes were proposed in this pull request? This PR implements Spark's own GetFunctionsOperation which mitigates the differences between Spark SQL and Hive UDFs. But our implementation is different from Hive's implementation: - Our implementation always returns results. Hive only returns results when [(null == catalogName || "".equals(catalogName)) && (null == schemaName || "".equals(schemaName))](https://github.com/apache/hive/blob/rel/release-3.1.1/service/src/java/org/apache/hive/service/cli/operation/GetFunctionsOperation.java#L101-L119). - Our implementation pads the `REMARKS` field with the function usage - Hive returns an empty string. - Our implementation does not support `FUNCTION_TYPE`, but Hive does. ## How was this patch tested? unit tests Closes #25252 from wangyum/SPARK-28510. Authored-by: Yuming Wang Signed-off-by: gatorsmile --- .../thriftserver/SparkGetFunctionsOperation.scala | 115 + .../server/SparkSQLOperationManager.scala | 15 +++ .../thriftserver/SparkMetadataOperationSuite.scala | 43 +++- .../cli/operation/GetFunctionsOperation.java | 2 +- .../cli/operation/GetFunctionsOperation.java | 2 +- 5 files changed, 174 insertions(+), 3 deletions(-) diff --git a/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkGetFunctionsOperation.scala b/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkGetFunctionsOperation.scala new file mode 100644 index 000..462e5730 --- /dev/null +++ b/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkGetFunctionsOperation.scala @@ -0,0 +1,115 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.hive.thriftserver + +import java.sql.DatabaseMetaData +import java.util.UUID + +import scala.collection.JavaConverters.seqAsJavaListConverter + +import org.apache.hadoop.hive.ql.security.authorization.plugin.{HiveOperationType, HivePrivilegeObjectUtils} +import org.apache.hive.service.cli._ +import org.apache.hive.service.cli.operation.GetFunctionsOperation +import org.apache.hive.service.cli.operation.MetadataOperation.DEFAULT_HIVE_CATALOG +import org.apache.hive.service.cli.session.HiveSession + +import org.apache.spark.internal.Logging +import org.apache.spark.sql.SQLContext +import org.apache.spark.util.{Utils => SparkUtils} + +/** + * Spark's own GetFunctionsOperation + * + * @param sqlContext SQLContext to use + * @param parentSession a HiveSession from SessionManager + * @param catalogName catalog name. null if not applicable + * @param schemaName database name, null or a concrete database name + * @param functionName function name pattern + */ +private[hive] class SparkGetFunctionsOperation( +sqlContext: SQLContext, +parentSession: HiveSession, +catalogName: String, +schemaName: String, +functionName: String) + extends GetFunctionsOperation(parentSession, catalogName, schemaName, functionName) with Logging { + + private var statementId: String = _ + + override def close(): Unit = { +super.close() +HiveThriftServer2.listener.onOperationClosed(statementId) + } + + override def runInternal(): Unit = { +statementId = UUID.randomUUID().toString +// Do not change cmdStr. It's used for Hive auditing and authorization. +val cmdStr = s"catalog : $catalogName, schemaPattern : $schemaName" +val logMsg = s"Listing functions '$cmdStr, functionName : $functionName'" +logInfo
[spark] branch master updated: [SPARK-28375][SQL] Make pullupCorrelatedPredicate idempotent
This is an automated email from the ASF dual-hosted git repository. lixiao pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new ee3c1c7 [SPARK-28375][SQL] Make pullupCorrelatedPredicate idempotent ee3c1c7 is described below commit ee3c1c777ddb8034d62213a5d8e064b97cc067e5 Author: Dilip Biswal AuthorDate: Tue Jul 30 16:29:24 2019 -0700 [SPARK-28375][SQL] Make pullupCorrelatedPredicate idempotent ## What changes were proposed in this pull request? This PR makes the optimizer rule PullupCorrelatedPredicates idempotent. ## How was this patch tested? A new test PullupCorrelatedPredicatesSuite Closes #25268 from dilipbiswal/pr-25164. Authored-by: Dilip Biswal Signed-off-by: gatorsmile --- .../spark/sql/catalyst/optimizer/Optimizer.scala | 4 +- .../spark/sql/catalyst/optimizer/subquery.scala| 20 +-- .../PullupCorrelatedPredicatesSuite.scala | 62 +++--- 3 files changed, 72 insertions(+), 14 deletions(-) diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala index 670fc92..346b2e6 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala @@ -48,9 +48,7 @@ abstract class Optimizer(sessionCatalog: SessionCatalog) } override protected val blacklistedOnceBatches: Set[String] = -Set("Pullup Correlated Expressions", - "Extract Python UDFs" -) +Set("Extract Python UDFs") protected def fixedPoint = FixedPoint(SQLConf.get.optimizerMaxIterations) diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/subquery.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/subquery.scala index 4f7333c..32dbd389 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/subquery.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/subquery.scala @@ -273,16 +273,28 @@ object PullupCorrelatedPredicates extends Rule[LogicalPlan] with PredicateHelper } private def rewriteSubQueries(plan: LogicalPlan, outerPlans: Seq[LogicalPlan]): LogicalPlan = { +/** + * This function is used as a aid to enforce idempotency of pullUpCorrelatedPredicate rule. + * In the first call to rewriteSubqueries, all the outer references from the subplan are + * pulled up and join predicates are recorded as children of the enclosing subquery expression. + * The subsequent call to rewriteSubqueries would simply re-records the `children` which would + * contains the pulled up correlated predicates (from the previous call) in the enclosing + * subquery expression. + */ +def getJoinCondition(newCond: Seq[Expression], oldCond: Seq[Expression]): Seq[Expression] = { + if (newCond.isEmpty) oldCond else newCond +} + plan transformExpressions { case ScalarSubquery(sub, children, exprId) if children.nonEmpty => val (newPlan, newCond) = pullOutCorrelatedPredicates(sub, outerPlans) -ScalarSubquery(newPlan, newCond, exprId) +ScalarSubquery(newPlan, getJoinCondition(newCond, children), exprId) case Exists(sub, children, exprId) if children.nonEmpty => val (newPlan, newCond) = pullOutCorrelatedPredicates(sub, outerPlans) -Exists(newPlan, newCond, exprId) - case ListQuery(sub, _, exprId, childOutputs) => +Exists(newPlan, getJoinCondition(newCond, children), exprId) + case ListQuery(sub, children, exprId, childOutputs) if children.nonEmpty => val (newPlan, newCond) = pullOutCorrelatedPredicates(sub, outerPlans) -ListQuery(newPlan, newCond, exprId, childOutputs) +ListQuery(newPlan, getJoinCondition(newCond, children), exprId, childOutputs) } } diff --git a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/PullupCorrelatedPredicatesSuite.scala b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/PullupCorrelatedPredicatesSuite.scala index 960162a..2d86d5a 100644 --- a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/PullupCorrelatedPredicatesSuite.scala +++ b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/PullupCorrelatedPredicatesSuite.scala @@ -19,9 +19,9 @@ package org.apache.spark.sql.catalyst.optimizer import org.apache.spark.sql.catalyst.dsl.expressions._ import org.apache.spark.sql.catalyst.dsl.plans._ -import org.apache.spark.sql.catalyst.expressions.{InSubquery, ListQuery} +import org.apache.spark.sql.catalyst.expressions._ import org.apache
[spark] branch master updated: [SPARK-28530][SQL] Cost-based join reorder optimizer batch should be FixedPoint(1)
This is an automated email from the ASF dual-hosted git repository. lixiao pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new d4e2466 [SPARK-28530][SQL] Cost-based join reorder optimizer batch should be FixedPoint(1) d4e2466 is described below commit d4e246658a308e2546d89e7ec9bbd9b6ca7517f7 Author: Yesheng Ma AuthorDate: Fri Jul 26 22:57:39 2019 -0700 [SPARK-28530][SQL] Cost-based join reorder optimizer batch should be FixedPoint(1) ## What changes were proposed in this pull request? Since for AQP the cost for joins can change between multiple runs, there is no reason that we have an idempotence enforcement on this optimizer batch. We thus make it `FixedPoint(1)` instead of `Once`. ## How was this patch tested? Existing UTs. Closes #25266 from yeshengm/SPARK-28530. Lead-authored-by: Yesheng Ma Co-authored-by: Xiao Li Signed-off-by: gatorsmile --- .../main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala | 3 +-- .../org/apache/spark/sql/catalyst/optimizer/JoinReorderSuite.scala | 2 +- 2 files changed, 2 insertions(+), 3 deletions(-) diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala index 3efc41d..670fc92 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala @@ -49,7 +49,6 @@ abstract class Optimizer(sessionCatalog: SessionCatalog) override protected val blacklistedOnceBatches: Set[String] = Set("Pullup Correlated Expressions", - "Join Reorder", "Extract Python UDFs" ) @@ -168,7 +167,7 @@ abstract class Optimizer(sessionCatalog: SessionCatalog) RemoveLiteralFromGroupExpressions, RemoveRepetitionFromGroupExpressions) :: Nil ++ operatorOptimizationBatch) :+ -Batch("Join Reorder", Once, +Batch("Join Reorder", FixedPoint(1), CostBasedJoinReorder) :+ Batch("Remove Redundant Sorts", Once, RemoveRedundantSorts) :+ diff --git a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/JoinReorderSuite.scala b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/JoinReorderSuite.scala index 43e5bad..f775500 100644 --- a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/JoinReorderSuite.scala +++ b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/JoinReorderSuite.scala @@ -40,7 +40,7 @@ class JoinReorderSuite extends PlanTest with StatsEstimationTestBase { PushPredicateThroughJoin, ColumnPruning, CollapseProject) :: - Batch("Join Reorder", Once, + Batch("Join Reorder", FixedPoint(1), CostBasedJoinReorder) :: Nil } - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-28532][SQL] Make optimizer batch "subquery" FixedPoint(1)
This is an automated email from the ASF dual-hosted git repository. lixiao pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new e037a11 [SPARK-28532][SQL] Make optimizer batch "subquery" FixedPoint(1) e037a11 is described below commit e037a11494b8079d9440899a9e509bb6760bcc43 Author: Yesheng Ma AuthorDate: Fri Jul 26 22:48:42 2019 -0700 [SPARK-28532][SQL] Make optimizer batch "subquery" FixedPoint(1) ## What changes were proposed in this pull request? In the Catalyst optimizer, the batch subquery actually calls the optimizer recursively. Therefore it makes no sense to enforce idempotence on it and we change this batch to `FixedPoint(1)`. ## How was this patch tested? Existing UTs. Closes #25267 from yeshengm/SPARK-28532. Authored-by: Yesheng Ma Signed-off-by: gatorsmile --- .../main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala index 1c36cdc..3efc41d 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala @@ -50,7 +50,6 @@ abstract class Optimizer(sessionCatalog: SessionCatalog) override protected val blacklistedOnceBatches: Set[String] = Set("Pullup Correlated Expressions", "Join Reorder", - "Subquery", "Extract Python UDFs" ) @@ -156,7 +155,7 @@ abstract class Optimizer(sessionCatalog: SessionCatalog) PropagateEmptyRelation) :: Batch("Pullup Correlated Expressions", Once, PullupCorrelatedPredicates) :: -Batch("Subquery", Once, +Batch("Subquery", FixedPoint(1), OptimizeSubqueries) :: Batch("Replace Operators", fixedPoint, RewriteExceptAll, - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-28463][SQL] Thriftserver throws BigDecimal incompatible with HiveDecimal
This is an automated email from the ASF dual-hosted git repository. lixiao pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 545c7ee [SPARK-28463][SQL] Thriftserver throws BigDecimal incompatible with HiveDecimal 545c7ee is described below commit 545c7ee00b5d4c5b848be6b27b3820955a0803d6 Author: Yuming Wang AuthorDate: Fri Jul 26 10:30:01 2019 -0700 [SPARK-28463][SQL] Thriftserver throws BigDecimal incompatible with HiveDecimal ## What changes were proposed in this pull request? How to reproduce this issue: ```shell build/sbt clean package -Phive -Phive-thriftserver -Phadoop-3.2 export SPARK_PREPEND_CLASSES=true sbin/start-thriftserver.sh [rootspark-3267648 spark]# bin/beeline -u jdbc:hive2://localhost:1/default -e "select cast(1 as decimal(38, 18));" Connecting to jdbc:hive2://localhost:1/default Connected to: Spark SQL (version 3.0.0-SNAPSHOT) Driver: Hive JDBC (version 2.3.5) Transaction isolation: TRANSACTION_REPEATABLE_READ Error: java.lang.ClassCastException: java.math.BigDecimal incompatible with org.apache.hadoop.hive.common.type.HiveDecimal (state=,code=0) Closing: 0: jdbc:hive2://localhost:1/default ``` This pr fix this issue. ## How was this patch tested? unit tests Closes #25217 from wangyum/SPARK-28463. Authored-by: Yuming Wang Signed-off-by: gatorsmile --- .../spark/sql/hive/thriftserver/HiveThriftServer2Suites.scala| 8 .../main/java/org/apache/hive/service/cli/ColumnBasedSet.java| 9 + 2 files changed, 9 insertions(+), 8 deletions(-) diff --git a/sql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/HiveThriftServer2Suites.scala b/sql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/HiveThriftServer2Suites.scala index dd18add..9c53e90 100644 --- a/sql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/HiveThriftServer2Suites.scala +++ b/sql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/HiveThriftServer2Suites.scala @@ -654,6 +654,14 @@ class HiveThriftBinaryServerSuite extends HiveThriftJdbcTest { assert(resultSet.getString(1) === "4.56") } } + + test("SPARK-28463: Thriftserver throws BigDecimal incompatible with HiveDecimal") { +withJdbcStatement() { statement => + val rs = statement.executeQuery("SELECT CAST(1 AS decimal(38, 18))") + assert(rs.next()) + assert(rs.getBigDecimal(1) === new java.math.BigDecimal("1.00")) +} + } } class SingleSessionSuite extends HiveThriftJdbcTest { diff --git a/sql/hive-thriftserver/v2.3.5/src/main/java/org/apache/hive/service/cli/ColumnBasedSet.java b/sql/hive-thriftserver/v2.3.5/src/main/java/org/apache/hive/service/cli/ColumnBasedSet.java index 5546060..3ca18f0 100644 --- a/sql/hive-thriftserver/v2.3.5/src/main/java/org/apache/hive/service/cli/ColumnBasedSet.java +++ b/sql/hive-thriftserver/v2.3.5/src/main/java/org/apache/hive/service/cli/ColumnBasedSet.java @@ -23,9 +23,7 @@ import java.util.ArrayList; import java.util.Iterator; import java.util.List; -import org.apache.hadoop.hive.common.type.HiveDecimal; import org.apache.hadoop.hive.serde2.thrift.ColumnBuffer; -import org.apache.hadoop.hive.serde2.thrift.Type; import org.apache.hive.service.rpc.thrift.TColumn; import org.apache.hive.service.rpc.thrift.TRow; import org.apache.hive.service.rpc.thrift.TRowSet; @@ -105,12 +103,7 @@ public class ColumnBasedSet implements RowSet { } else { for (int i = 0; i < fields.length; i++) { TypeDescriptor descriptor = descriptors[i]; -Object field = fields[i]; -if (field != null && descriptor.getType() == Type.DECIMAL_TYPE) { - int scale = descriptor.getDecimalDigits(); - field = ((HiveDecimal) field).toFormatString(scale); -} -columns.get(i).addValue(descriptor.getType(), field); +columns.get(i).addValue(descriptor.getType(), fields[i]); } } return this; - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (bf41070 -> 6807a82)
This is an automated email from the ASF dual-hosted git repository. lixiao pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from bf41070 [SPARK-28499][ML] Optimize MinMaxScaler add 6807a82 [SPARK-28524][SQL] Fix ThriftServerTab lost error message No new revisions were added by this update. Summary of changes: .../org/apache/spark/sql/hive/thriftserver/ui/ThriftServerPage.scala| 2 +- .../apache/spark/sql/hive/thriftserver/ui/ThriftServerSessionPage.scala | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (ded1a74 -> c93d2dd)
This is an automated email from the ASF dual-hosted git repository. lixiao pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from ded1a74 [SPARK-28365][ML] Fallback locale to en_US in StopWordsRemover if system default locale isn't in available locales in JVM add c93d2dd [SPARK-28237][SQL] Enforce Idempotence for Once batches in RuleExecutor No new revisions were added by this update. Summary of changes: .../spark/sql/catalyst/optimizer/Optimizer.scala | 7 + .../spark/sql/catalyst/rules/RuleExecutor.scala| 30 -- .../sql/catalyst/optimizer/PruneFiltersSuite.scala | 2 +- .../PullupCorrelatedPredicatesSuite.scala | 2 ++ .../optimizer/StarJoinCostBasedReorderSuite.scala | 2 +- .../sql/catalyst/trees/RuleExecutorSuite.scala | 22 +--- 6 files changed, 58 insertions(+), 7 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (167fa04 -> 045191e)
This is an automated email from the ASF dual-hosted git repository. lixiao pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 167fa04 [SPARK-28390][SQL][PYTHON][TESTS] Convert and port 'pgSQL/select_having.sql' into UDF test base add 045191e [SPARK-28293][SQL] Implement Spark's own GetTableTypesOperation No new revisions were added by this update. Summary of changes: .../spark/sql/catalyst/catalog/interface.scala | 2 ++ ...ion.scala => SparkGetTableTypesOperation.scala} | 35 +++--- .../thriftserver/SparkGetTablesOperation.scala | 9 +- .../thriftserver/SparkMetadataOperationUtils.scala | 18 ++- .../server/SparkSQLOperationManager.scala | 17 +-- .../thriftserver/SparkMetadataOperationSuite.scala | 16 ++ .../cli/operation/GetTableTypesOperation.java | 2 +- .../cli/operation/GetTableTypesOperation.java | 2 +- 8 files changed, 56 insertions(+), 45 deletions(-) copy sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/{SparkGetSchemasOperation.scala => SparkGetTableTypesOperation.scala} (64%) copy common/network-common/src/main/java/org/apache/spark/network/protocol/AbstractResponseMessage.java => sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkMetadataOperationUtils.scala (61%) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (022667c -> e04f696)
This is an automated email from the ASF dual-hosted git repository. lixiao pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 022667c [SPARK-28469][SQL] Change CalendarIntervalType's readable string representation from calendarinterval to interval add e04f696 [SPARK-28346][SQL] clone the query plan between analyzer, optimizer and planner No new revisions were added by this update. Summary of changes: .../apache/spark/sql/catalyst/trees/TreeNode.scala | 13 +++-- .../spark/sql/execution/QueryExecution.scala | 24 +- .../sql/execution/columnar/InMemoryRelation.scala | 7 +++ .../spark/sql/execution/command/SetCommand.scala | 2 +- .../datasources/SaveIntoDataSourceCommand.scala| 6 +++ .../spark/sql/execution/QueryExecutionSuite.scala | 55 +- .../columnar/PartitionBatchPruningSuite.scala | 4 +- 7 files changed, 93 insertions(+), 18 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-2.4 updated: Revert "[SPARK-27485] EnsureRequirements.reorder should handle duplicate expressions gracefully"
This is an automated email from the ASF dual-hosted git repository. lixiao pushed a commit to branch branch-2.4 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-2.4 by this push: new 63898cb Revert "[SPARK-27485] EnsureRequirements.reorder should handle duplicate expressions gracefully" 63898cb is described below commit 63898cbc1db46be8bcbb46d21fe01340bc883520 Author: gatorsmile AuthorDate: Tue Jul 16 06:57:14 2019 -0700 Revert "[SPARK-27485] EnsureRequirements.reorder should handle duplicate expressions gracefully" This reverts commit 72f547d4a960ba0ba9cace53a0a5553eca1b4dd6. --- .../execution/exchange/EnsureRequirements.scala| 72 ++ .../scala/org/apache/spark/sql/JoinSuite.scala | 20 -- .../apache/spark/sql/execution/PlannerSuite.scala | 26 3 files changed, 32 insertions(+), 86 deletions(-) diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/EnsureRequirements.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/EnsureRequirements.scala index bdb9a31..d2d5011 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/EnsureRequirements.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/EnsureRequirements.scala @@ -24,7 +24,8 @@ import org.apache.spark.sql.catalyst.expressions._ import org.apache.spark.sql.catalyst.plans.physical._ import org.apache.spark.sql.catalyst.rules.Rule import org.apache.spark.sql.execution._ -import org.apache.spark.sql.execution.joins.{ShuffledHashJoinExec, SortMergeJoinExec} +import org.apache.spark.sql.execution.joins.{BroadcastHashJoinExec, ShuffledHashJoinExec, + SortMergeJoinExec} import org.apache.spark.sql.internal.SQLConf /** @@ -220,41 +221,25 @@ case class EnsureRequirements(conf: SQLConf) extends Rule[SparkPlan] { } private def reorder( - leftKeys: IndexedSeq[Expression], - rightKeys: IndexedSeq[Expression], + leftKeys: Seq[Expression], + rightKeys: Seq[Expression], expectedOrderOfKeys: Seq[Expression], currentOrderOfKeys: Seq[Expression]): (Seq[Expression], Seq[Expression]) = { -if (expectedOrderOfKeys.size != currentOrderOfKeys.size) { - return (leftKeys, rightKeys) -} - -// Build a lookup between an expression and the positions its holds in the current key seq. -val keyToIndexMap = mutable.Map.empty[Expression, mutable.BitSet] -currentOrderOfKeys.zipWithIndex.foreach { - case (key, index) => -keyToIndexMap.getOrElseUpdate(key.canonicalized, mutable.BitSet.empty).add(index) -} - -// Reorder the keys. -val leftKeysBuffer = new ArrayBuffer[Expression](leftKeys.size) -val rightKeysBuffer = new ArrayBuffer[Expression](rightKeys.size) -val iterator = expectedOrderOfKeys.iterator -while (iterator.hasNext) { - // Lookup the current index of this key. - keyToIndexMap.get(iterator.next().canonicalized) match { -case Some(indices) if indices.nonEmpty => - // Take the first available index from the map. - val index = indices.firstKey - indices.remove(index) +val leftKeysBuffer = ArrayBuffer[Expression]() +val rightKeysBuffer = ArrayBuffer[Expression]() +val pickedIndexes = mutable.Set[Int]() +val keysAndIndexes = currentOrderOfKeys.zipWithIndex - // Add the keys for that index to the reordered keys. - leftKeysBuffer += leftKeys(index) - rightKeysBuffer += rightKeys(index) -case _ => - // The expression cannot be found, or we have exhausted all indices for that expression. - return (leftKeys, rightKeys) - } -} +expectedOrderOfKeys.foreach(expression => { + val index = keysAndIndexes.find { case (e, idx) => +// As we may have the same key used many times, we need to filter out its occurrence we +// have already used. +e.semanticEquals(expression) && !pickedIndexes.contains(idx) + }.map(_._2).get + pickedIndexes += index + leftKeysBuffer.append(leftKeys(index)) + rightKeysBuffer.append(rightKeys(index)) +}) (leftKeysBuffer, rightKeysBuffer) } @@ -264,13 +249,20 @@ case class EnsureRequirements(conf: SQLConf) extends Rule[SparkPlan] { leftPartitioning: Partitioning, rightPartitioning: Partitioning): (Seq[Expression], Seq[Expression]) = { if (leftKeys.forall(_.deterministic) && rightKeys.forall(_.deterministic)) { - (leftPartitioning, rightPartitioning) match { -case (HashPartitioning(leftExpressions, _), _) => - reorder(leftKeys.toIndexedSeq, rightKeys.toIndexedSeq, leftExpressions, leftKeys) -case (_, HashPartitioning(rightExpressions, _)) => - reorder(leftKeys.toIndexedSeq, rightKeys.toIndexedSeq, rightExpressions, righ
[spark] branch master updated (8d1e87a -> 2f3997f)
This is an automated email from the ASF dual-hosted git repository. lixiao pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 8d1e87a [SPARK-28150][CORE][FOLLOW-UP] Don't try to log in when impersonating. add 2f3997f [SPARK-28306][SQL][FOLLOWUP] Fix NormalizeFloatingNumbers rule idempotence for equi-join with `<=>` predicates No new revisions were added by this update. Summary of changes: .../apache/spark/sql/catalyst/planning/patterns.scala| 16 ++-- .../optimizer/NormalizeFloatingPointNumbersSuite.scala | 15 ++- 2 files changed, 24 insertions(+), 7 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-28355][CORE][PYTHON] Use Spark conf for threshold at which command is compressed by broadcast
This is an automated email from the ASF dual-hosted git repository. lixiao pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 79e2047 [SPARK-28355][CORE][PYTHON] Use Spark conf for threshold at which command is compressed by broadcast 79e2047 is described below commit 79e204770300dab4a669b9f8e2421ef905236e7b Author: Jesse Cai AuthorDate: Sat Jul 13 08:44:16 2019 -0700 [SPARK-28355][CORE][PYTHON] Use Spark conf for threshold at which command is compressed by broadcast ## What changes were proposed in this pull request? The `_prepare_for_python_RDD` method currently broadcasts a pickled command if its length is greater than the hardcoded value `1 << 20` (1M). This change sets this value as a Spark conf instead. ## How was this patch tested? Unit tests, manual tests. Closes #25123 from jessecai/SPARK-28355. Authored-by: Jesse Cai Signed-off-by: gatorsmile --- .../scala/org/apache/spark/api/python/PythonUtils.scala | 4 .../scala/org/apache/spark/internal/config/package.scala| 8 core/src/test/scala/org/apache/spark/SparkConfSuite.scala | 13 + python/pyspark/rdd.py | 2 +- 4 files changed, 26 insertions(+), 1 deletion(-) diff --git a/core/src/main/scala/org/apache/spark/api/python/PythonUtils.scala b/core/src/main/scala/org/apache/spark/api/python/PythonUtils.scala index eee6e4b..62d6047 100644 --- a/core/src/main/scala/org/apache/spark/api/python/PythonUtils.scala +++ b/core/src/main/scala/org/apache/spark/api/python/PythonUtils.scala @@ -81,4 +81,8 @@ private[spark] object PythonUtils { def isEncryptionEnabled(sc: JavaSparkContext): Boolean = { sc.conf.get(org.apache.spark.internal.config.IO_ENCRYPTION_ENABLED) } + + def getBroadcastThreshold(sc: JavaSparkContext): Long = { + sc.conf.get(org.apache.spark.internal.config.BROADCAST_FOR_UDF_COMPRESSION_THRESHOLD) + } } diff --git a/core/src/main/scala/org/apache/spark/internal/config/package.scala b/core/src/main/scala/org/apache/spark/internal/config/package.scala index 46f..76d3d6e 100644 --- a/core/src/main/scala/org/apache/spark/internal/config/package.scala +++ b/core/src/main/scala/org/apache/spark/internal/config/package.scala @@ -1246,6 +1246,14 @@ package object config { "mechanisms to guarantee data won't be corrupted during broadcast") .booleanConf.createWithDefault(true) + private[spark] val BROADCAST_FOR_UDF_COMPRESSION_THRESHOLD = +ConfigBuilder("spark.broadcast.UDFCompressionThreshold") + .doc("The threshold at which user-defined functions (UDFs) and Python RDD commands " + +"are compressed by broadcast in bytes unless otherwise specified") + .bytesConf(ByteUnit.BYTE) + .checkValue(v => v >= 0, "The threshold should be non-negative.") + .createWithDefault(1L * 1024 * 1024) + private[spark] val RDD_COMPRESS = ConfigBuilder("spark.rdd.compress") .doc("Whether to compress serialized RDD partitions " + "(e.g. for StorageLevel.MEMORY_ONLY_SER in Scala " + diff --git a/core/src/test/scala/org/apache/spark/SparkConfSuite.scala b/core/src/test/scala/org/apache/spark/SparkConfSuite.scala index 6be1fed..202b85d 100644 --- a/core/src/test/scala/org/apache/spark/SparkConfSuite.scala +++ b/core/src/test/scala/org/apache/spark/SparkConfSuite.scala @@ -389,6 +389,19 @@ class SparkConfSuite extends SparkFunSuite with LocalSparkContext with ResetSyst """.stripMargin.trim) } + test("SPARK-28355: Use Spark conf for threshold at which UDFs are compressed by broadcast") { +val conf = new SparkConf() + +// Check the default value +assert(conf.get(BROADCAST_FOR_UDF_COMPRESSION_THRESHOLD) === 1L * 1024 * 1024) + +// Set the conf +conf.set(BROADCAST_FOR_UDF_COMPRESSION_THRESHOLD, 1L * 1024) + +// Verify that it has been set properly +assert(conf.get(BROADCAST_FOR_UDF_COMPRESSION_THRESHOLD) === 1L * 1024) + } + val defaultIllegalValue = "SomeIllegalValue" val illegalValueTests : Map[String, (SparkConf, String) => Any] = Map( "getTimeAsSeconds" -> (_.getTimeAsSeconds(_)), diff --git a/python/pyspark/rdd.py b/python/pyspark/rdd.py index 8bcc67a..96fdf5f 100644 --- a/python/pyspark/rdd.py +++ b/python/pyspark/rdd.py @@ -2490,7 +2490,7 @@ def _prepare_for_python_RDD(sc, command): # the serialized command will be compressed by broadcast ser = CloudPickleSerializer() pickled_command = ser.dumps(command) -if len(pickled_command) > (1 << 20): # 1M +if len(pickled_command) > sc._jvm.PythonUtils.getBroadcastThreshold(sc._jsc): # Default 1M
[spark] branch master updated: [SPARK-28260][SQL] Add CLOSED state to ExecutionState
This is an automated email from the ASF dual-hosted git repository. lixiao pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 687dd4e [SPARK-28260][SQL] Add CLOSED state to ExecutionState 687dd4e is described below commit 687dd4eb55739f802692b3c5457618fd6558e538 Author: Yuming Wang AuthorDate: Fri Jul 12 10:31:28 2019 -0700 [SPARK-28260][SQL] Add CLOSED state to ExecutionState ## What changes were proposed in this pull request? The `ThriftServerTab` displays a FINISHED state when the operation finishes execution, but quite often it still takes a lot of time to fetch the results. OperationState has state CLOSED for when after the iterator is closed. This PR add CLOSED state to ExecutionState, and override the `close()` in SparkExecuteStatementOperation, SparkGetColumnsOperation, SparkGetSchemasOperation and SparkGetTablesOperation. ## How was this patch tested? manual tests 1. Add `Thread.sleep(1)` before [SparkExecuteStatementOperation.scala#L112](https://github.com/apache/spark/blob/b2e7677f4d3d8f47f5f148680af39d38f2b558f0/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkExecuteStatementOperation.scala#L112) 2. Switch to `ThriftServerTab`: ![image](https://user-images.githubusercontent.com/5399861/60809590-9dcf2500-a1bd-11e9-826e-33729bb97daf.png) 3. After a while: ![image](https://user-images.githubusercontent.com/5399861/60809719-e850a180-a1bd-11e9-9a6a-546146e626ab.png) Closes #25062 from wangyum/SPARK-28260. Authored-by: Yuming Wang Signed-off-by: gatorsmile --- .../spark/sql/hive/thriftserver/HiveThriftServer2.scala| 14 ++ .../hive/thriftserver/SparkExecuteStatementOperation.scala | 3 ++- .../sql/hive/thriftserver/SparkGetColumnsOperation.scala | 9 - .../sql/hive/thriftserver/SparkGetSchemasOperation.scala | 9 - .../sql/hive/thriftserver/SparkGetTablesOperation.scala| 9 - .../spark/sql/hive/thriftserver/ui/ThriftServerPage.scala | 8 +--- .../sql/hive/thriftserver/ui/ThriftServerSessionPage.scala | 8 +--- 7 files changed, 46 insertions(+), 14 deletions(-) diff --git a/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/HiveThriftServer2.scala b/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/HiveThriftServer2.scala index d1de9f0..b4d1d0d 100644 --- a/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/HiveThriftServer2.scala +++ b/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/HiveThriftServer2.scala @@ -137,7 +137,7 @@ object HiveThriftServer2 extends Logging { } private[thriftserver] object ExecutionState extends Enumeration { -val STARTED, COMPILED, FAILED, FINISHED = Value +val STARTED, COMPILED, FAILED, FINISHED, CLOSED = Value type ExecutionState = Value } @@ -147,16 +147,17 @@ object HiveThriftServer2 extends Logging { val startTimestamp: Long, val userName: String) { var finishTimestamp: Long = 0L +var closeTimestamp: Long = 0L var executePlan: String = "" var detail: String = "" var state: ExecutionState.Value = ExecutionState.STARTED val jobId: ArrayBuffer[String] = ArrayBuffer[String]() var groupId: String = "" -def totalTime: Long = { - if (finishTimestamp == 0L) { +def totalTime(endTime: Long): Long = { + if (endTime == 0L) { System.currentTimeMillis - startTimestamp } else { -finishTimestamp - startTimestamp +endTime - startTimestamp } } } @@ -254,6 +255,11 @@ object HiveThriftServer2 extends Logging { trimExecutionIfNecessary() } +def onOperationClosed(id: String): Unit = synchronized { + executionList(id).closeTimestamp = System.currentTimeMillis + executionList(id).state = ExecutionState.CLOSED +} + private def trimExecutionIfNecessary() = { if (executionList.size > retainedStatements) { val toRemove = math.max(retainedStatements / 10, 1) diff --git a/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkExecuteStatementOperation.scala b/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkExecuteStatementOperation.scala index 820f76d..2f011c2 100644 --- a/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkExecuteStatementOperation.scala +++ b/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkExecuteStatementOperation.scala @@ -70,11 +70,12 @@ private[hive] class SparkExecuteStatementOperation( } } - def close(): Unit = { + override def close(): Unit = { // RDDs will be cleaned automatically upon
[spark] branch master updated: [SPARK-27815][SQL] Predicate pushdown in one pass for cascading joins
This is an automated email from the ASF dual-hosted git repository. lixiao pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 74f1176 [SPARK-27815][SQL] Predicate pushdown in one pass for cascading joins 74f1176 is described below commit 74f1176311676145d5d8669daacf67e5308f68b5 Author: Yesheng Ma AuthorDate: Wed Jul 3 09:01:16 2019 -0700 [SPARK-27815][SQL] Predicate pushdown in one pass for cascading joins ## What changes were proposed in this pull request? This PR makes the predicate pushdown logic in catalyst optimizer more efficient by unifying two existing rules `PushdownPredicates` and `PushPredicateThroughJoin`. Previously pushing down a predicate for queries such as `Filter(Join(Join(Join)))` requires n steps. This patch essentially reduces this to a single pass. To make this actually work, we need to unify a few rules such as `CombineFilters`, `PushDownPredicate` and `PushDownPrdicateThroughJoin`. Otherwise cases such as `Filter(Join(Filter(Join)))` still requires several passes to fully push down predicates. This unification is done by composing several partial functions, which makes a minimal code change and can reuse existing UTs. Results show that this optimization can improve the catalyst optimization time by 16.5%. For queries with more joins, the performance is even better. E.g., for TPC-DS q64, the performance boost is 49.2%. ## How was this patch tested? Existing UTs + new a UT for the new rule. Closes #24956 from yeshengm/fixed-point-opt. Authored-by: Yesheng Ma Signed-off-by: gatorsmile --- .../spark/sql/catalyst/optimizer/Optimizer.scala | 30 +++- .../optimizer/PushDownLeftSemiAntiJoin.scala | 10 +- .../catalyst/optimizer/ColumnPruningSuite.scala| 2 +- .../optimizer/FilterPushdownOnePassSuite.scala | 183 + .../catalyst/optimizer/FilterPushdownSuite.scala | 2 +- .../InferFiltersFromConstraintsSuite.scala | 2 +- .../catalyst/optimizer/JoinOptimizationSuite.scala | 2 +- .../sql/catalyst/optimizer/JoinReorderSuite.scala | 2 +- .../optimizer/LeftSemiAntiJoinPushDownSuite.scala | 2 +- .../catalyst/optimizer/OptimizerLoggingSuite.scala | 12 +- .../optimizer/OptimizerRuleExclusionSuite.scala| 6 +- .../optimizer/PropagateEmptyRelationSuite.scala| 4 +- .../sql/catalyst/optimizer/PruneFiltersSuite.scala | 2 +- .../sql/catalyst/optimizer/SetOperationSuite.scala | 2 +- .../optimizer/StarJoinCostBasedReorderSuite.scala | 2 +- .../catalyst/optimizer/StarJoinReorderSuite.scala | 2 +- .../spark/sql/execution/SparkOptimizer.scala | 4 +- 17 files changed, 235 insertions(+), 34 deletions(-) diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala index 17b4ff7..c99d2c0 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala @@ -63,8 +63,7 @@ abstract class Optimizer(sessionCatalog: SessionCatalog) PushProjectionThroughUnion, ReorderJoin, EliminateOuterJoin, -PushPredicateThroughJoin, -PushDownPredicate, +PushDownPredicates, PushDownLeftSemiAntiJoin, PushLeftSemiLeftAntiThroughJoin, LimitPushDown, @@ -911,7 +910,9 @@ object CombineUnions extends Rule[LogicalPlan] { * one conjunctive predicate. */ object CombineFilters extends Rule[LogicalPlan] with PredicateHelper { - def apply(plan: LogicalPlan): LogicalPlan = plan transform { + def apply(plan: LogicalPlan): LogicalPlan = plan transform applyLocally + + val applyLocally: PartialFunction[LogicalPlan, LogicalPlan] = { // The query execution/optimization does not guarantee the expressions are evaluated in order. // We only can combine them if and only if both are deterministic. case Filter(fc, nf @ Filter(nc, grandChild)) if fc.deterministic && nc.deterministic => @@ -997,14 +998,29 @@ object PruneFilters extends Rule[LogicalPlan] with PredicateHelper { } /** + * The unified version for predicate pushdown of normal operators and joins. + * This rule improves performance of predicate pushdown for cascading joins such as: + * Filter-Join-Join-Join. Most predicates can be pushed down in a single pass. + */ +object PushDownPredicates extends Rule[LogicalPlan] with PredicateHelper { + def apply(plan: LogicalPlan): LogicalPlan = plan transform { +CombineFilters.applyLocally + .orElse(PushPredicateThroughNonJoin.applyLocally) + .orElse(PushPredicateThroughJoin.applyLocally) + } +} + +/** * Pushes [[Filter]] operators through many operators iff:
[spark] branch master updated: [SPARK-28167][SQL] Show global temporary view in database tool
This is an automated email from the ASF dual-hosted git repository. lixiao pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new ea03030 [SPARK-28167][SQL] Show global temporary view in database tool ea03030 is described below commit ea0303063f5d0e25dbc6c126b13bbefe59a8d1f7 Author: Yuming Wang AuthorDate: Wed Jul 3 00:01:05 2019 -0700 [SPARK-28167][SQL] Show global temporary view in database tool ## What changes were proposed in this pull request? This pr add support show global temporary view and local temporary view in database tool. TODO: Database tools should support show temporary views because it's schema is null. ## How was this patch tested? unit tests and manual tests: ![image](https://user-images.githubusercontent.com/5399861/60392266-a5455d00-9b31-11e9-92c8-88a8e6c2aec3.png) Closes #24972 from wangyum/SPARK-28167. Authored-by: Yuming Wang Signed-off-by: gatorsmile --- .../thriftserver/SparkGetColumnsOperation.scala| 95 ++ .../thriftserver/SparkGetSchemasOperation.scala| 8 ++ .../thriftserver/SparkGetTablesOperation.scala | 55 + .../thriftserver/SparkMetadataOperationSuite.scala | 46 +-- 4 files changed, 148 insertions(+), 56 deletions(-) diff --git a/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkGetColumnsOperation.scala b/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkGetColumnsOperation.scala index 4b78e2f..6d3c9fc 100644 --- a/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkGetColumnsOperation.scala +++ b/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkGetColumnsOperation.scala @@ -17,7 +17,6 @@ package org.apache.spark.sql.hive.thriftserver -import java.util.UUID import java.util.regex.Pattern import scala.collection.JavaConverters.seqAsJavaListConverter @@ -33,6 +32,7 @@ import org.apache.spark.sql.SQLContext import org.apache.spark.sql.catalyst.TableIdentifier import org.apache.spark.sql.catalyst.catalog.SessionCatalog import org.apache.spark.sql.hive.thriftserver.ThriftserverShimUtils.toJavaSQLType +import org.apache.spark.sql.types.StructType /** * Spark's own SparkGetColumnsOperation @@ -56,8 +56,6 @@ private[hive] class SparkGetColumnsOperation( val catalog: SessionCatalog = sqlContext.sessionState.catalog - private var statementId: String = _ - override def runInternal(): Unit = { val cmdStr = s"catalog : $catalogName, schemaPattern : $schemaName, tablePattern : $tableName" + s", columnName : $columnName" @@ -77,7 +75,7 @@ private[hive] class SparkGetColumnsOperation( } val db2Tabs = catalog.listDatabases(schemaPattern).map { dbName => - (dbName, catalog.listTables(dbName, tablePattern)) + (dbName, catalog.listTables(dbName, tablePattern, includeLocalTempViews = false)) }.toMap if (isAuthV2Enabled) { @@ -88,42 +86,31 @@ private[hive] class SparkGetColumnsOperation( } try { + // Tables and views db2Tabs.foreach { case (dbName, tables) => catalog.getTablesByName(tables).foreach { catalogTable => -catalogTable.schema.foreach { column => - if (columnPattern != null && !columnPattern.matcher(column.name).matches()) { - } else { -val rowData = Array[AnyRef]( - null, // TABLE_CAT - dbName, // TABLE_SCHEM - catalogTable.identifier.table, // TABLE_NAME - column.name, // COLUMN_NAME - toJavaSQLType(column.dataType.sql).asInstanceOf[AnyRef], // DATA_TYPE - column.dataType.sql, // TYPE_NAME - null, // COLUMN_SIZE - null, // BUFFER_LENGTH, unused - null, // DECIMAL_DIGITS - null, // NUM_PREC_RADIX - (if (column.nullable) 1 else 0).asInstanceOf[AnyRef], // NULLABLE - column.getComment().getOrElse(""), // REMARKS - null, // COLUMN_DEF - null, // SQL_DATA_TYPE - null, // SQL_DATETIME_SUB - null, // CHAR_OCTET_LENGTH - null, // ORDINAL_POSITION - "YES", // IS_NULLABLE - null, // SCOPE_CATALOG - null, // SCOPE_SCHEMA - null, // SCOPE_TABLE - null, // SOURCE_DATA_TYPE - "NO" // IS_AUTO_INCREMENT -) -rowSet.addRow(rowData) - } -} +addToRowSet(columnPattern, dbName, catalogTable.identifier.table
[spark] branch master updated: [SPARK-28196][SQL] Add a new `listTables` and `listLocalTempViews` APIs for SessionCatalog
This is an automated email from the ASF dual-hosted git repository. lixiao pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 24e1e41 [SPARK-28196][SQL] Add a new `listTables` and `listLocalTempViews` APIs for SessionCatalog 24e1e41 is described below commit 24e1e41648de58d3437e008b187b84828830e238 Author: Yuming Wang AuthorDate: Sat Jun 29 18:36:36 2019 -0700 [SPARK-28196][SQL] Add a new `listTables` and `listLocalTempViews` APIs for SessionCatalog ## What changes were proposed in this pull request? This pr add two API for [SessionCatalog](https://github.com/apache/spark/blob/df4cb471c9712a2fe496664028d9303caebd8777/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala): ```scala def listTables(db: String, pattern: String, includeLocalTempViews: Boolean): Seq[TableIdentifier] def listLocalTempViews(pattern: String): Seq[TableIdentifier] ``` Because in some cases `listTables` does not need local temporary view and sometimes only need list local temporary view. ## How was this patch tested? unit tests Closes #24995 from wangyum/SPARK-28196. Authored-by: Yuming Wang Signed-off-by: gatorsmile --- .../sql/catalyst/catalog/SessionCatalog.scala | 29 +- .../sql/catalyst/catalog/SessionCatalogSuite.scala | 65 ++ 2 files changed, 91 insertions(+), 3 deletions(-) diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala index dcc6298..74559f5 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala @@ -784,7 +784,19 @@ class SessionCatalog( * Note that, if the specified database is global temporary view database, we will list global * temporary views. */ - def listTables(db: String, pattern: String): Seq[TableIdentifier] = { + def listTables(db: String, pattern: String): Seq[TableIdentifier] = listTables(db, pattern, true) + + /** + * List all matching tables in the specified database, including local temporary views + * if includeLocalTempViews is enabled. + * + * Note that, if the specified database is global temporary view database, we will list global + * temporary views. + */ + def listTables( + db: String, + pattern: String, + includeLocalTempViews: Boolean): Seq[TableIdentifier] = { val dbName = formatDatabaseName(db) val dbTables = if (dbName == globalTempViewManager.database) { globalTempViewManager.listViewNames(pattern).map { name => @@ -796,12 +808,23 @@ class SessionCatalog( TableIdentifier(name, Some(dbName)) } } -val localTempViews = synchronized { + +if (includeLocalTempViews) { + dbTables ++ listLocalTempViews(pattern) +} else { + dbTables +} + } + + /** + * List all matching local temporary views. + */ + def listLocalTempViews(pattern: String): Seq[TableIdentifier] = { +synchronized { StringUtils.filterPattern(tempViews.keys.toSeq, pattern).map { name => TableIdentifier(name) } } -dbTables ++ localTempViews } /** diff --git a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalogSuite.scala b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalogSuite.scala index 5a9e4ad..bce8553 100644 --- a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalogSuite.scala +++ b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalogSuite.scala @@ -717,6 +717,71 @@ abstract class SessionCatalogSuite extends AnalysisTest { } } + test("list tables with pattern and includeLocalTempViews") { +withEmptyCatalog { catalog => + catalog.createDatabase(newDb("mydb"), ignoreIfExists = false) + catalog.createTable(newTable("tbl1", "mydb"), ignoreIfExists = false) + catalog.createTable(newTable("tbl2", "mydb"), ignoreIfExists = false) + val tempTable = Range(1, 10, 2, 10) + catalog.createTempView("temp_view1", tempTable, overrideIfExists = false) + catalog.createTempView("temp_view4", tempTable, overrideIfExists = false) + + assert(catalog.listTables("mydb").toSet == catalog.listTables("mydb", "*").toSet) + assert(catalog.listTables("mydb").toSet == catalog.listTables("mydb", "*", true).toSet) + assert(catalog.listTables("mydb").toSet == +catalog.listTables("mydb&q
[spark] branch master updated (73183b3 -> e0e2144)
This is an automated email from the ASF dual-hosted git repository. lixiao pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 73183b3 [SPARK-11412][SQL] Support merge schema for ORC add e0e2144 [SPARK-28184][SQL][TEST] Avoid creating new sessions in SparkMetadataOperationSuite No new revisions were added by this update. Summary of changes: .../thriftserver/SparkMetadataOperationSuite.scala | 271 ++--- 1 file changed, 68 insertions(+), 203 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-11412][SQL] Support merge schema for ORC
This is an automated email from the ASF dual-hosted git repository. lixiao pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 73183b3 [SPARK-11412][SQL] Support merge schema for ORC 73183b3 is described below commit 73183b3c8c2022846587f08e8dea5c387ed3b8d5 Author: wangguangxin.cn AuthorDate: Sat Jun 29 17:08:31 2019 -0700 [SPARK-11412][SQL] Support merge schema for ORC ## What changes were proposed in this pull request? Currently, ORC's `inferSchema` is implemented as randomly choosing one ORC file and reading its schema. This PR follows the behavior of Parquet, it implements merge schemas logic by reading all ORC files in parallel through a spark job. Users can enable merge schema by `spark.read.orc("xxx").option("mergeSchema", "true")` or by setting `spark.sql.orc.mergeSchema` to `true`, the prior one has higher priority. ## How was this patch tested? tested by UT OrcUtilsSuite.scala Closes #24043 from WangGuangxin/SPARK-11412. Lead-authored-by: wangguangxin.cn Co-authored-by: wangguangxin.cn Signed-off-by: gatorsmile --- .../org/apache/spark/sql/internal/SQLConf.scala| 8 ++ .../execution/datasources/SchemaMergeUtils.scala | 106 ++ .../execution/datasources/orc/OrcFileFormat.scala | 2 +- .../sql/execution/datasources/orc/OrcOptions.scala | 11 ++ .../sql/execution/datasources/orc/OrcUtils.scala | 26 +++- .../datasources/parquet/ParquetFileFormat.scala| 81 ++- .../execution/datasources/v2/orc/OrcTable.scala| 4 +- .../execution/datasources/ReadSchemaSuite.scala| 21 +++ .../execution/datasources/orc/OrcSourceSuite.scala | 154 - .../apache/spark/sql/hive/orc/OrcFileFormat.scala | 20 ++- .../spark/sql/hive/orc/OrcFileOperator.scala | 21 ++- .../spark/sql/hive/orc/HiveOrcSourceSuite.scala| 4 + 12 files changed, 375 insertions(+), 83 deletions(-) diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala index ff84f4e..e4c58ef 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala @@ -566,6 +566,12 @@ object SQLConf { .booleanConf .createWithDefault(true) + val ORC_SCHEMA_MERGING_ENABLED = buildConf("spark.sql.orc.mergeSchema") +.doc("When true, the Orc data source merges schemas collected from all data files, " + + "otherwise the schema is picked from a random data file.") +.booleanConf +.createWithDefault(false) + val HIVE_VERIFY_PARTITION_PATH = buildConf("spark.sql.hive.verifyPartitionPath") .doc("When true, check all the partition paths under the table\'s root directory " + "when reading data stored in HDFS. This configuration will be deprecated in the future " + @@ -1956,6 +1962,8 @@ class SQLConf extends Serializable with Logging { def orcFilterPushDown: Boolean = getConf(ORC_FILTER_PUSHDOWN_ENABLED) + def isOrcSchemaMergingEnabled: Boolean = getConf(ORC_SCHEMA_MERGING_ENABLED) + def verifyPartitionPath: Boolean = getConf(HIVE_VERIFY_PARTITION_PATH) def metastorePartitionPruning: Boolean = getConf(HIVE_METASTORE_PARTITION_PRUNING) diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/SchemaMergeUtils.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/SchemaMergeUtils.scala new file mode 100644 index 000..99882b0 --- /dev/null +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/SchemaMergeUtils.scala @@ -0,0 +1,106 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.datasources + +import org.apache.hadoop.conf.Configuration +import org.apache.hadoop.fs.{FileStatus, Path} + +import org.apache.spark.SparkException +import org.apach
[spark] branch master updated: [SPARK-28056][.2][PYTHON][SQL] add docstring/doctest for SCALAR_ITER Pandas UDF
This is an automated email from the ASF dual-hosted git repository. lixiao pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 8299600 [SPARK-28056][.2][PYTHON][SQL] add docstring/doctest for SCALAR_ITER Pandas UDF 8299600 is described below commit 8299600575024ca81127b7bf8ef48ae11fdd0594 Author: Xiangrui Meng AuthorDate: Fri Jun 28 15:09:57 2019 -0700 [SPARK-28056][.2][PYTHON][SQL] add docstring/doctest for SCALAR_ITER Pandas UDF ## What changes were proposed in this pull request? Add docstring/doctest for `SCALAR_ITER` Pandas UDF. I explicitly mentioned that per-partition execution is an implementation detail, not guaranteed. I will submit another PR to add the same to user guide, just to keep this PR minimal. I didn't add "doctest: +SKIP" in the first commit so it is easy to test locally. cc: HyukjinKwon gatorsmile icexelloss BryanCutler WeichenXu123 ![Screen Shot 2019-06-28 at 9 52 41 AM](https://user-images.githubusercontent.com/829644/60358349-b0aa5400-998a-11e9-9ebf-8481dfd555b5.png) ![Screen Shot 2019-06-28 at 9 53 19 AM](https://user-images.githubusercontent.com/829644/60358355-b1db8100-998a-11e9-8f6f-00a11bdbdc4d.png) ## How was this patch tested? doctest Closes #25005 from mengxr/SPARK-28056.2. Authored-by: Xiangrui Meng Signed-off-by: gatorsmile --- python/pyspark/sql/functions.py | 104 +++- 1 file changed, 102 insertions(+), 2 deletions(-) diff --git a/python/pyspark/sql/functions.py b/python/pyspark/sql/functions.py index 34f6593..5d1e69e 100644 --- a/python/pyspark/sql/functions.py +++ b/python/pyspark/sql/functions.py @@ -2951,7 +2951,107 @@ def pandas_udf(f=None, returnType=None, functionType=None): Therefore, this can be used, for example, to ensure the length of each returned `pandas.Series`, and can not be used as the column length. -2. GROUPED_MAP +2. SCALAR_ITER + + A scalar iterator UDF is semantically the same as the scalar Pandas UDF above except that the + wrapped Python function takes an iterator of batches as input instead of a single batch and, + instead of returning a single output batch, it yields output batches or explicitly returns an + generator or an iterator of output batches. + It is useful when the UDF execution requires initializing some state, e.g., loading a machine + learning model file to apply inference to every input batch. + + .. note:: It is not guaranteed that one invocation of a scalar iterator UDF will process all + batches from one partition, although it is currently implemented this way. + Your code shall not rely on this behavior because it might change in the future for + further optimization, e.g., one invocation processes multiple partitions. + + Scalar iterator UDFs are used with :meth:`pyspark.sql.DataFrame.withColumn` and + :meth:`pyspark.sql.DataFrame.select`. + + >>> import pandas as pd # doctest: +SKIP + >>> from pyspark.sql.functions import col, pandas_udf, struct, PandasUDFType + >>> pdf = pd.DataFrame([1, 2, 3], columns=["x"]) # doctest: +SKIP + >>> df = spark.createDataFrame(pdf) # doctest: +SKIP + + When the UDF is called with a single column that is not `StructType`, the input to the + underlying function is an iterator of `pd.Series`. + + >>> @pandas_udf("long", PandasUDFType.SCALAR_ITER) # doctest: +SKIP + ... def plus_one(batch_iter): + ... for x in batch_iter: + ... yield x + 1 + ... + >>> df.select(plus_one(col("x"))).show() # doctest: +SKIP + +---+ + |plus_one(x)| + +---+ + | 2| + | 3| + | 4| + +---+ + + When the UDF is called with more than one columns, the input to the underlying function is an + iterator of `pd.Series` tuple. + + >>> @pandas_udf("long", PandasUDFType.SCALAR_ITER) # doctest: +SKIP + ... def multiply_two_cols(batch_iter): + ... for a, b in batch_iter: + ... yield a * b + ... + >>> df.select(multiply_two_cols(col("x"), col("x"))).show() # doctest: +SKIP + +---+ + |multiply_two_cols(x, x)| + +---+ + | 1| + | 4| + | 9| + +---+ + + When the UDF is called with a single column that is `StructType`, the input to the underlying + function is an iterator
[spark] branch master updated (a7e1619 -> cded421)
This is an automated email from the ASF dual-hosted git repository. lixiao pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from a7e1619 [SPARK-28174][BUILD][SS] Upgrade to Kafka 2.3.0 add cded421 [SPARK-27871][SQL] LambdaVariable should use per-query unique IDs instead of globally unique IDs No new revisions were added by this update. Summary of changes: .../sql/catalyst/encoders/ExpressionEncoder.scala | 30 ++- .../sql/catalyst/expressions/objects/objects.scala | 268 + .../spark/sql/catalyst/optimizer/Optimizer.scala | 3 +- .../spark/sql/catalyst/optimizer/objects.scala | 53 +++- .../org/apache/spark/sql/internal/SQLConf.scala| 7 + .../expressions/ObjectExpressionsSuite.scala | 2 +- .../optimizer/ReassignLambdaVariableIDSuite.scala | 61 + .../main/scala/org/apache/spark/sql/Dataset.scala | 35 +-- .../aggregate/TypedAggregateExpression.scala | 23 +- .../spark/sql/DatasetOptimizationSuite.scala | 48 +++- .../scala/org/apache/spark/sql/DatasetSuite.scala | 17 +- .../sql/execution/WholeStageCodegenSuite.scala | 2 +- 12 files changed, 342 insertions(+), 207 deletions(-) create mode 100644 sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/ReassignLambdaVariableIDSuite.scala - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-28059][SQL][TEST] Port int4.sql
This is an automated email from the ASF dual-hosted git repository. lixiao pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 929d313 [SPARK-28059][SQL][TEST] Port int4.sql 929d313 is described below commit 929d3135687226aa7b24edeeb0ff8e62cd087fae Author: Yuming Wang AuthorDate: Sat Jun 22 23:59:30 2019 -0700 [SPARK-28059][SQL][TEST] Port int4.sql ## What changes were proposed in this pull request? This PR is to port int4.sql from PostgreSQL regression tests. https://github.com/postgres/postgres/blob/REL_12_BETA1/src/test/regress/sql/int4.sql The expected results can be found in the link: https://github.com/postgres/postgres/blob/REL_12_BETA1/src/test/regress/expected/int4.out When porting the test cases, found two PostgreSQL specific features that do not exist in Spark SQL: [SPARK-28023](https://issues.apache.org/jira/browse/SPARK-28023): Trim the string when cast string type to other types [SPARK-28027](https://issues.apache.org/jira/browse/SPARK-28027): Add bitwise shift left/right operators Also, found a bug: [SPARK-28024](https://issues.apache.org/jira/browse/SPARK-28024): Incorrect value when out of range Also, found four inconsistent behavior: [SPARK-27923](https://issues.apache.org/jira/browse/SPARK-27923): Invalid input syntax for integer: "34.5" at PostgreSQL [SPARK-28027](https://issues.apache.org/jira/browse/SPARK-28027) Our operator `!` and `!!` has different meanings [SPARK-28028](https://issues.apache.org/jira/browse/SPARK-28028): Cast numeric to integral type need round [SPARK-2659](https://issues.apache.org/jira/browse/SPARK-2659): HiveQL: Division operator should always perform fractional division, for example: ```sql select 1/2; ``` ## How was this patch tested? N/A Closes #24877 from wangyum/SPARK-28059. Authored-by: Yuming Wang Signed-off-by: gatorsmile --- .../test/resources/sql-tests/inputs/pgSQL/int4.sql | 178 +++ .../resources/sql-tests/results/pgSQL/int4.sql.out | 530 + 2 files changed, 708 insertions(+) diff --git a/sql/core/src/test/resources/sql-tests/inputs/pgSQL/int4.sql b/sql/core/src/test/resources/sql-tests/inputs/pgSQL/int4.sql new file mode 100644 index 000..89cac00 --- /dev/null +++ b/sql/core/src/test/resources/sql-tests/inputs/pgSQL/int4.sql @@ -0,0 +1,178 @@ +-- +-- Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group +-- +-- +-- INT4 +-- https://github.com/postgres/postgres/blob/REL_12_BETA1/src/test/regress/sql/int4.sql +-- + +CREATE TABLE INT4_TBL(f1 int) USING parquet; + +-- [SPARK-28023] Trim the string when cast string type to other types +INSERT INTO INT4_TBL VALUES (trim(' 0 ')); + +INSERT INTO INT4_TBL VALUES (trim('123456 ')); + +INSERT INTO INT4_TBL VALUES (trim('-123456')); + +-- [SPARK-27923] Invalid input syntax for integer: "34.5" at PostgreSQL +-- INSERT INTO INT4_TBL(f1) VALUES ('34.5'); + +-- largest and smallest values +INSERT INTO INT4_TBL VALUES ('2147483647'); + +INSERT INTO INT4_TBL VALUES ('-2147483647'); + +-- [SPARK-27923] Spark SQL insert these bad inputs to NULL +-- bad input values +-- INSERT INTO INT4_TBL(f1) VALUES ('1'); +-- INSERT INTO INT4_TBL(f1) VALUES ('asdf'); +-- INSERT INTO INT4_TBL(f1) VALUES (' '); +-- INSERT INTO INT4_TBL(f1) VALUES (' asdf '); +-- INSERT INTO INT4_TBL(f1) VALUES ('- 1234'); +-- INSERT INTO INT4_TBL(f1) VALUES ('123 5'); +-- INSERT INTO INT4_TBL(f1) VALUES (''); + + +SELECT '' AS five, * FROM INT4_TBL; + +SELECT '' AS four, i.* FROM INT4_TBL i WHERE i.f1 <> smallint('0'); + +SELECT '' AS four, i.* FROM INT4_TBL i WHERE i.f1 <> int('0'); + +SELECT '' AS one, i.* FROM INT4_TBL i WHERE i.f1 = smallint('0'); + +SELECT '' AS one, i.* FROM INT4_TBL i WHERE i.f1 = int('0'); + +SELECT '' AS two, i.* FROM INT4_TBL i WHERE i.f1 < smallint('0'); + +SELECT '' AS two, i.* FROM INT4_TBL i WHERE i.f1 < int('0'); + +SELECT '' AS three, i.* FROM INT4_TBL i WHERE i.f1 <= smallint('0'); + +SELECT '' AS three, i.* FROM INT4_TBL i WHERE i.f1 <= int('0'); + +SELECT '' AS two, i.* FROM INT4_TBL i WHERE i.f1 > smallint('0'); + +SELECT '' AS two, i.* FROM INT4_TBL i WHERE i.f1 > int('0'); + +SELECT '' AS three, i.* FROM INT4_TBL i WHERE i.f1 >= smallint('0'); + +SELECT '' AS three, i.* FROM INT4_TBL i WHERE i.f1 >= int('0'); + +-- positive odds
[spark] branch master updated (870f972 -> 0768fad)
This is an automated email from the ASF dual-hosted git repository. lixiao pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 870f972 [SPARK-28104][SQL] Implement Spark's own GetColumnsOperation add 0768fad [SPARK-28126][SQL] Support TRIM(trimStr FROM str) syntax No new revisions were added by this update. Summary of changes: .../antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4 | 2 +- .../spark/sql/catalyst/expressions/stringExpressions.scala | 12 +--- .../org/apache/spark/sql/catalyst/parser/AstBuilder.scala| 2 +- .../apache/spark/sql/catalyst/parser/PlanParserSuite.scala | 5 + .../src/test/resources/sql-tests/inputs/string-functions.sql | 4 ++-- .../resources/sql-tests/results/string-functions.sql.out | 12 ++-- 6 files changed, 24 insertions(+), 13 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-28104][SQL] Implement Spark's own GetColumnsOperation
This is an automated email from the ASF dual-hosted git repository. lixiao pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 870f972 [SPARK-28104][SQL] Implement Spark's own GetColumnsOperation 870f972 is described below commit 870f972dcc7b2b0e5bea2ae64f2c9598c681eddf Author: Yuming Wang AuthorDate: Sat Jun 22 09:15:07 2019 -0700 [SPARK-28104][SQL] Implement Spark's own GetColumnsOperation ## What changes were proposed in this pull request? [SPARK-24196](https://issues.apache.org/jira/browse/SPARK-24196) and [SPARK-24570](https://issues.apache.org/jira/browse/SPARK-24570) implemented Spark's own `GetSchemasOperation` and `GetTablesOperation`. This pr implements Spark's own `GetColumnsOperation`. ## How was this patch tested? unit tests and manual tests: ![image](https://user-images.githubusercontent.com/5399861/59745367-3a7d6180-92a7-11e9-862d-96bc494c5f00.png) Closes #24906 from wangyum/SPARK-28104. Authored-by: Yuming Wang Signed-off-by: gatorsmile --- .../thriftserver/SparkGetColumnsOperation.scala| 142 + .../server/SparkSQLOperationManager.scala | 20 ++- .../thriftserver/HiveThriftServer2Suites.scala | 9 +- .../thriftserver/SparkMetadataOperationSuite.scala | 92 - .../service/cli/operation/GetColumnsOperation.java | 4 +- .../hive/thriftserver/ThriftserverShimUtils.scala | 5 +- .../service/cli/operation/GetColumnsOperation.java | 4 +- .../hive/thriftserver/ThriftserverShimUtils.scala | 4 + 8 files changed, 270 insertions(+), 10 deletions(-) diff --git a/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkGetColumnsOperation.scala b/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkGetColumnsOperation.scala new file mode 100644 index 000..4b78e2f --- /dev/null +++ b/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkGetColumnsOperation.scala @@ -0,0 +1,142 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.hive.thriftserver + +import java.util.UUID +import java.util.regex.Pattern + +import scala.collection.JavaConverters.seqAsJavaListConverter + +import org.apache.hadoop.hive.ql.security.authorization.plugin.{HiveOperationType, HivePrivilegeObject} +import org.apache.hadoop.hive.ql.security.authorization.plugin.HivePrivilegeObject.HivePrivilegeObjectType +import org.apache.hive.service.cli._ +import org.apache.hive.service.cli.operation.GetColumnsOperation +import org.apache.hive.service.cli.session.HiveSession + +import org.apache.spark.internal.Logging +import org.apache.spark.sql.SQLContext +import org.apache.spark.sql.catalyst.TableIdentifier +import org.apache.spark.sql.catalyst.catalog.SessionCatalog +import org.apache.spark.sql.hive.thriftserver.ThriftserverShimUtils.toJavaSQLType + +/** + * Spark's own SparkGetColumnsOperation + * + * @param sqlContext SQLContext to use + * @param parentSession a HiveSession from SessionManager + * @param catalogName catalog name. NULL if not applicable. + * @param schemaName database name, NULL or a concrete database name + * @param tableName table name + * @param columnName column name + */ +private[hive] class SparkGetColumnsOperation( +sqlContext: SQLContext, +parentSession: HiveSession, +catalogName: String, +schemaName: String, +tableName: String, +columnName: String) + extends GetColumnsOperation(parentSession, catalogName, schemaName, tableName, columnName) +with Logging { + + val catalog: SessionCatalog = sqlContext.sessionState.catalog + + private var statementId: String = _ + + override def runInternal(): Unit = { +val cmdStr = s"catalog : $catalogName, schemaPattern : $schemaName, tablePattern : $tableName" + + s", columnName : $columnName" +logInfo(s"GetColumnsOperation: $cmdStr") + +setState(OperationState.RUNNING) +// Always use the latest class
[spark] branch master updated (47f54b1 -> 54da3bb)
This is an automated email from the ASF dual-hosted git repository. lixiao pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 47f54b1 [SPARK-28118][CORE] Add `spark.eventLog.compression.codec` configuration add 54da3bb [SPARK-28127][SQL] Micro optimization on TreeNode's mapChildren method No new revisions were added by this update. Summary of changes: .../src/main/scala/org/apache/spark/sql/catalyst/trees/TreeNode.scala | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-27890][SQL] Improve SQL parser error message for character-only identifier with hyphens except those in expressions
This is an automated email from the ASF dual-hosted git repository. lixiao pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 7b7f16f [SPARK-27890][SQL] Improve SQL parser error message for character-only identifier with hyphens except those in expressions 7b7f16f is described below commit 7b7f16f2a7a6a6685a8917a9b5ba403fff76 Author: Yesheng Ma AuthorDate: Tue Jun 18 21:51:15 2019 -0700 [SPARK-27890][SQL] Improve SQL parser error message for character-only identifier with hyphens except those in expressions ## What changes were proposed in this pull request? Current SQL parser's error message for hyphen-connected identifiers without surrounding backquotes(e.g. hyphen-table) is confusing for end users. A possible approach to tackle this is to explicitly capture these wrong usages in the SQL parser. In this way, the end users can fix these errors more quickly. For example, for a simple query such as `SELECT * FROM test-table`, the original error message is ``` Error in SQL statement: ParseException: mismatched input '-' expecting (line 1, pos 18) ``` which can be confusing in a large query. After the fix, the error message is: ``` Error in query: Possibly unquoted identifier test-table detected. Please consider quoting it with back-quotes as `test-table`(line 1, pos 14) == SQL == SELECT * FROM test-table --^^^ ``` which is easier for end users to identify the issue and fix. We safely augmented the current grammar rule to explicitly capture these error cases. The error handling logic is implemented in the SQL parsing listener `PostProcessor`. However, note that for cases such as `a - my-func(b)`, the parser can't actually tell whether this should be ``a -`my-func`(b) `` or `a - my - func(b)`. Therefore for these cases, we leave the parser as is. Also, in this patch we only provide better error messages for character-only identifiers. ## How was this patch tested? Adding new unit tests. Closes #24749 from yeshengm/hyphen-ident. Authored-by: Yesheng Ma Signed-off-by: gatorsmile --- .../apache/spark/sql/catalyst/parser/SqlBase.g4| 60 ++- .../spark/sql/catalyst/parser/AstBuilder.scala | 16 +-- .../spark/sql/catalyst/parser/ParseDriver.scala| 8 ++ .../sql/catalyst/parser/ErrorParserSuite.scala | 110 + .../spark/sql/execution/SparkSqlParser.scala | 10 +- 5 files changed, 169 insertions(+), 35 deletions(-) diff --git a/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4 b/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4 index dcb7939..f57a659 100644 --- a/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4 +++ b/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4 @@ -82,13 +82,15 @@ singleTableSchema statement : query #statementDefault | ctes? dmlStatementNoWith #dmlStatement -| USE db=identifier#use -| CREATE database (IF NOT EXISTS)? identifier +| USE db=errorCapturingIdentifier #use +| CREATE database (IF NOT EXISTS)? db=errorCapturingIdentifier ((COMMENT comment=STRING) | locationSpec | (WITH DBPROPERTIES tablePropertyList))* #createDatabase -| ALTER database identifier SET DBPROPERTIES tablePropertyList #setDatabaseProperties -| DROP database (IF EXISTS)? identifier (RESTRICT | CASCADE)? #dropDatabase +| ALTER database db=errorCapturingIdentifier +SET DBPROPERTIES tablePropertyList #setDatabaseProperties +| DROP database (IF EXISTS)? db=errorCapturingIdentifier +(RESTRICT | CASCADE)? #dropDatabase | SHOW DATABASES (LIKE? pattern=STRING)? #showDatabases | createTableHeader ('(' colTypeList ')')? tableProvider ((OPTIONS options=tablePropertyList) | @@ -135,7 +137,8 @@ statement (ALTER | CHANGE) COLUMN? qualifiedName (TYPE dataType)? (COMMENT comment=STRING)? colPosition? #alterTableColumn | ALTER TABLE tableIdentifier partitionSpec? -CHANGE COLUMN? identifier colType colPosition? #changeColumn +CHANGE COLUMN? +colName=errorCapturingIdentifier colType colPosition? #changeColumn | ALTER TABLE tableIdentifier (partitionSpec)? SET SERDE STRING (WITH SERDEPROPERTIES tablePropertyList)? #setTable
[spark] branch master updated (a5dcb82 -> 15de6d0)
This is an automated email from the ASF dual-hosted git repository. lixiao pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from a5dcb82 [SPARK-27105][SQL] Optimize away exponential complexity in ORC predicate conversion add 15de6d0 [SPARK-28096][SQL] Convert defs to lazy vals to avoid expensive reference computation in QueryPlan and Expression No new revisions were added by this update. Summary of changes: .../spark/sql/catalyst/expressions/Expression.scala | 9 - .../catalyst/expressions/aggregate/interfaces.scala | 3 ++- .../spark/sql/catalyst/expressions/grouping.scala | 8 ++-- .../sql/catalyst/expressions/namedExpressions.scala | 3 ++- .../apache/spark/sql/catalyst/plans/QueryPlan.scala | 7 +-- .../sql/catalyst/plans/logical/LogicalPlan.scala | 2 +- .../catalyst/plans/logical/QueryPlanConstraints.scala | 2 +- .../catalyst/plans/logical/ScriptTransformation.scala | 3 ++- .../plans/logical/basicLogicalOperators.scala | 19 ++- .../spark/sql/catalyst/plans/logical/object.scala | 9 ++--- .../plans/logical/pythonLogicalOperators.scala| 3 ++- .../org/apache/spark/sql/execution/ExpandExec.scala | 3 ++- .../org/apache/spark/sql/execution/objects.scala | 3 ++- .../spark/sql/execution/python/EvalPythonExec.scala | 3 ++- 14 files changed, 51 insertions(+), 26 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-28039][SQL][TEST] Port float4.sql
This is an automated email from the ASF dual-hosted git repository. lixiao pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 2e3ae97 [SPARK-28039][SQL][TEST] Port float4.sql 2e3ae97 is described below commit 2e3ae97668f9170c820ec5564edc50dff8347915 Author: Yuming Wang AuthorDate: Tue Jun 18 16:22:30 2019 -0700 [SPARK-28039][SQL][TEST] Port float4.sql ## What changes were proposed in this pull request? This PR is to port float4.sql from PostgreSQL regression tests. https://github.com/postgres/postgres/blob/REL_12_BETA1/src/test/regress/sql/float4.sql The expected results can be found in the link: https://github.com/postgres/postgres/blob/REL_12_BETA1/src/test/regress/expected/float4.out When porting the test cases, found three PostgreSQL specific features that do not exist in Spark SQL: [SPARK-28060](https://issues.apache.org/jira/browse/SPARK-28060): Float type can not accept some special inputs [SPARK-28027](https://issues.apache.org/jira/browse/SPARK-28027): Spark SQL does not support prefix operator `` [SPARK-28061](https://issues.apache.org/jira/browse/SPARK-28061): Support for converting float to binary format Also, found a bug: [SPARK-28024](https://issues.apache.org/jira/browse/SPARK-28024): Incorrect value when out of range Also, found three inconsistent behavior: [SPARK-27923](https://issues.apache.org/jira/browse/SPARK-27923): Spark SQL insert there bad inputs to NULL [SPARK-28028](https://issues.apache.org/jira/browse/SPARK-28028): Cast numeric to integral type need round [SPARK-27923](https://issues.apache.org/jira/browse/SPARK-27923): Spark SQL returns NULL when dividing by zero ## How was this patch tested? N/A Closes #24887 from wangyum/SPARK-28039. Authored-by: Yuming Wang Signed-off-by: gatorsmile --- .../resources/sql-tests/inputs/pgSQL/float4.sql| 363 .../sql-tests/results/pgSQL/float4.sql.out | 379 + 2 files changed, 742 insertions(+) diff --git a/sql/core/src/test/resources/sql-tests/inputs/pgSQL/float4.sql b/sql/core/src/test/resources/sql-tests/inputs/pgSQL/float4.sql new file mode 100644 index 000..9e684d1 --- /dev/null +++ b/sql/core/src/test/resources/sql-tests/inputs/pgSQL/float4.sql @@ -0,0 +1,363 @@ +-- +-- Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group +-- +-- +-- FLOAT4 +-- https://github.com/postgres/postgres/blob/REL_12_BETA1/src/test/regress/sql/float4.sql + +CREATE TABLE FLOAT4_TBL (f1 float) USING parquet; + +INSERT INTO FLOAT4_TBL VALUES ('0.0'); +INSERT INTO FLOAT4_TBL VALUES ('1004.30 '); +INSERT INTO FLOAT4_TBL VALUES (' -34.84'); +INSERT INTO FLOAT4_TBL VALUES ('1.2345678901234e+20'); +INSERT INTO FLOAT4_TBL VALUES ('1.2345678901234e-20'); + +-- [SPARK-28024] Incorrect numeric values when out of range +-- test for over and under flow +-- INSERT INTO FLOAT4_TBL VALUES ('10e70'); +-- INSERT INTO FLOAT4_TBL VALUES ('-10e70'); +-- INSERT INTO FLOAT4_TBL VALUES ('10e-70'); +-- INSERT INTO FLOAT4_TBL VALUES ('-10e-70'); + +-- INSERT INTO FLOAT4_TBL VALUES ('10e400'); +-- INSERT INTO FLOAT4_TBL VALUES ('-10e400'); +-- INSERT INTO FLOAT4_TBL VALUES ('10e-400'); +-- INSERT INTO FLOAT4_TBL VALUES ('-10e-400'); + +-- [SPARK-27923] Spark SQL insert there bad inputs to NULL +-- bad input +-- INSERT INTO FLOAT4_TBL VALUES (''); +-- INSERT INTO FLOAT4_TBL VALUES (' '); +-- INSERT INTO FLOAT4_TBL VALUES ('xyz'); +-- INSERT INTO FLOAT4_TBL VALUES ('5.0.0'); +-- INSERT INTO FLOAT4_TBL VALUES ('5 . 0'); +-- INSERT INTO FLOAT4_TBL VALUES ('5. 0'); +-- INSERT INTO FLOAT4_TBL VALUES (' - 3.0'); +-- INSERT INTO FLOAT4_TBL VALUES ('1235'); + +-- special inputs +SELECT float('NaN'); +-- [SPARK-28060] Float type can not accept some special inputs +SELECT float('nan'); +SELECT float(' NAN '); +SELECT float('infinity'); +SELECT float(' -INFINiTY '); +-- [SPARK-27923] Spark SQL insert there bad special inputs to NULL +-- bad special inputs +SELECT float('N A N'); +SELECT float('NaN x'); +SELECT float(' INFINITYx'); + +-- [SPARK-28060] Float type can not accept some special inputs +SELECT float('Infinity') + 100.0; +SELECT float('Infinity') / float('Infinity'); +SELECT float('nan') / float('nan'); +SELECT float(decimal('nan')); + +SELECT '' AS five, * FROM FLOAT4_TBL; + +SELECT '' AS four, f.* FROM FLOAT4_TBL f WHERE f.f1 <
[spark] branch master updated (ed280c2 -> 1ada36b)
This is an automated email from the ASF dual-hosted git repository. lixiao pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from ed280c2 [SPARK-28072][SQL] Fix IncompatibleClassChangeError in `FromUnixTime` codegen on JDK9+ add 1ada36b [SPARK-27783][SQL] Add customizable hint error handler No new revisions were added by this update. Summary of changes: .../spark/sql/catalyst/analysis/Analyzer.scala | 2 +- .../sql/catalyst/analysis/HintErrorLogger.scala| 55 ++ .../spark/sql/catalyst/analysis/ResolveHints.scala | 22 - .../catalyst/optimizer/EliminateResolvedHint.scala | 16 +++ .../spark/sql/catalyst/plans/logical/hints.scala | 43 +++-- .../org/apache/spark/sql/internal/SQLConf.scala| 8 +++- .../scala/org/apache/spark/sql/JoinHintSuite.scala | 4 +- 7 files changed, 120 insertions(+), 30 deletions(-) create mode 100644 sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/HintErrorLogger.scala - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (63e0711 -> 6284ac7)
This is an automated email from the ASF dual-hosted git repository. lixiao pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 63e0711 [SPARK-27899][SQL] Make HiveMetastoreClient.getTableObjectsByName available in ExternalCatalog/SessionCatalog API add 6284ac7 [SPARK-27934][SQL][TEST] Port case.sql No new revisions were added by this update. Summary of changes: .../test/resources/sql-tests/inputs/pgSQL/case.sql | 267 + .../resources/sql-tests/results/pgSQL/case.sql.out | 425 + .../org/apache/spark/sql/SQLQueryTestSuite.scala | 2 + 3 files changed, 694 insertions(+) create mode 100644 sql/core/src/test/resources/sql-tests/inputs/pgSQL/case.sql create mode 100644 sql/core/src/test/resources/sql-tests/results/pgSQL/case.sql.out - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-27899][SQL] Make HiveMetastoreClient.getTableObjectsByName available in ExternalCatalog/SessionCatalog API
This is an automated email from the ASF dual-hosted git repository. lixiao pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 63e0711 [SPARK-27899][SQL] Make HiveMetastoreClient.getTableObjectsByName available in ExternalCatalog/SessionCatalog API 63e0711 is described below commit 63e071152498ccaf10f038231ae7e1706b786a0b Author: LantaoJin AuthorDate: Tue Jun 11 15:32:59 2019 +0800 [SPARK-27899][SQL] Make HiveMetastoreClient.getTableObjectsByName available in ExternalCatalog/SessionCatalog API ## What changes were proposed in this pull request? The new Spark ThriftServer SparkGetTablesOperation implemented in https://github.com/apache/spark/pull/22794 does a catalog.getTableMetadata request for every table. This can get very slow for large schemas (~50ms per table with an external Hive metastore). Hive ThriftServer GetTablesOperation uses HiveMetastoreClient.getTableObjectsByName to get table information in bulk, but we don't expose that through our APIs that go through Hive -> HiveClientImpl (HiveClient) -> HiveExternalCatalog (ExternalCatalog) -> SessionCatalog. If we added and exposed getTableObjectsByName through our catalog APIs, we could resolve that performance problem in SparkGetTablesOperation. ## How was this patch tested? Add UT Closes #24774 from LantaoJin/SPARK-27899. Authored-by: LantaoJin Signed-off-by: gatorsmile --- .../sql/catalyst/catalog/ExternalCatalog.scala | 2 + .../catalog/ExternalCatalogWithListener.scala | 4 + .../sql/catalyst/catalog/InMemoryCatalog.scala | 5 ++ .../sql/catalyst/catalog/SessionCatalog.scala | 28 +++ .../catalyst/catalog/ExternalCatalogSuite.scala| 22 ++ .../sql/catalyst/catalog/SessionCatalogSuite.scala | 90 ++ .../thriftserver/SparkGetTablesOperation.scala | 3 +- .../spark/sql/hive/HiveExternalCatalog.scala | 8 ++ .../apache/spark/sql/hive/client/HiveClient.scala | 3 + .../spark/sql/hive/client/HiveClientImpl.scala | 65 +++- .../apache/spark/sql/hive/client/HiveShim.scala| 14 .../spark/sql/hive/client/VersionsSuite.scala | 27 +++ 12 files changed, 265 insertions(+), 6 deletions(-) diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/ExternalCatalog.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/ExternalCatalog.scala index 1a145c2..dcc1439 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/ExternalCatalog.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/ExternalCatalog.scala @@ -128,6 +128,8 @@ trait ExternalCatalog { def getTable(db: String, table: String): CatalogTable + def getTablesByName(db: String, tables: Seq[String]): Seq[CatalogTable] + def tableExists(db: String, table: String): Boolean def listTables(db: String): Seq[String] diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/ExternalCatalogWithListener.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/ExternalCatalogWithListener.scala index 2f009be..86113d3 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/ExternalCatalogWithListener.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/ExternalCatalogWithListener.scala @@ -138,6 +138,10 @@ class ExternalCatalogWithListener(delegate: ExternalCatalog) delegate.getTable(db, table) } + override def getTablesByName(db: String, tables: Seq[String]): Seq[CatalogTable] = { +delegate.getTablesByName(db, tables) + } + override def tableExists(db: String, table: String): Boolean = { delegate.tableExists(db, table) } diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/InMemoryCatalog.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/InMemoryCatalog.scala index 741dc46..abf6993 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/InMemoryCatalog.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/InMemoryCatalog.scala @@ -327,6 +327,11 @@ class InMemoryCatalog( catalog(db).tables(table).table } + override def getTablesByName(db: String, tables: Seq[String]): Seq[CatalogTable] = { +requireDbExists(db) +tables.flatMap(catalog(db).tables.get).map(_.table) + } + override def tableExists(db: String, table: String): Boolean = synchronized { requireDbExists(db) catalog(db).tables.contains(table) diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala index c05f777..e49e54f 100644 --- a/sql/cat
[spark] branch master updated: [SPARK-27970][SQL] Support Hive 3.0 metastore
This is an automated email from the ASF dual-hosted git repository. lixiao pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 2926890 [SPARK-27970][SQL] Support Hive 3.0 metastore 2926890 is described below commit 2926890ffbcf3a92c7e0863c69e31c3d22191112 Author: Yuming Wang AuthorDate: Fri Jun 7 15:24:07 2019 -0700 [SPARK-27970][SQL] Support Hive 3.0 metastore ## What changes were proposed in this pull request? It seems that some users are using Hive 3.0.0. This pr makes it support Hive 3.0 metastore. ## How was this patch tested? unit tests Closes #24688 from wangyum/SPARK-26145. Authored-by: Yuming Wang Signed-off-by: gatorsmile --- docs/sql-data-sources-hive-tables.md | 2 +- docs/sql-migration-guide-hive-compatibility.md| 2 +- .../main/scala/org/apache/spark/sql/hive/HiveUtils.scala | 2 +- .../org/apache/spark/sql/hive/client/HiveClientImpl.scala | 3 ++- .../scala/org/apache/spark/sql/hive/client/HiveShim.scala | 4 +++- .../spark/sql/hive/client/IsolatedClientLoader.scala | 1 + .../scala/org/apache/spark/sql/hive/client/package.scala | 13 - .../apache/spark/sql/hive/execution/SaveAsHiveFile.scala | 2 +- .../apache/spark/sql/hive/client/HiveClientVersions.scala | 3 ++- .../apache/spark/sql/hive/client/HiveVersionSuite.scala | 4 ++-- .../org/apache/spark/sql/hive/client/VersionsSuite.scala | 15 +++ 11 files changed, 33 insertions(+), 18 deletions(-) diff --git a/docs/sql-data-sources-hive-tables.md b/docs/sql-data-sources-hive-tables.md index 3d58e94..5688011 100644 --- a/docs/sql-data-sources-hive-tables.md +++ b/docs/sql-data-sources-hive-tables.md @@ -130,7 +130,7 @@ The following options can be used to configure the version of Hive that is used 1.2.1 Version of the Hive metastore. Available - options are 0.12.0 through 2.3.5 and 3.1.0 through 3.1.1. + options are 0.12.0 through 2.3.5 and 3.0.0 through 3.1.1. diff --git a/docs/sql-migration-guide-hive-compatibility.md b/docs/sql-migration-guide-hive-compatibility.md index 4a8076d..f955e31 100644 --- a/docs/sql-migration-guide-hive-compatibility.md +++ b/docs/sql-migration-guide-hive-compatibility.md @@ -25,7 +25,7 @@ license: | Spark SQL is designed to be compatible with the Hive Metastore, SerDes and UDFs. Currently, Hive SerDes and UDFs are based on Hive 1.2.1, and Spark SQL can be connected to different versions of Hive Metastore -(from 0.12.0 to 2.3.5 and 3.1.0 to 3.1.1. Also see [Interacting with Different Versions of Hive Metastore](sql-data-sources-hive-tables.html#interacting-with-different-versions-of-hive-metastore)). +(from 0.12.0 to 2.3.5 and 3.0.0 to 3.1.1. Also see [Interacting with Different Versions of Hive Metastore](sql-data-sources-hive-tables.html#interacting-with-different-versions-of-hive-metastore)). Deploying in Existing Hive Warehouses diff --git a/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveUtils.scala b/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveUtils.scala index 38ad061..c3ae3d5 100644 --- a/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveUtils.scala +++ b/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveUtils.scala @@ -64,7 +64,7 @@ private[spark] object HiveUtils extends Logging { val HIVE_METASTORE_VERSION = buildConf("spark.sql.hive.metastore.version") .doc("Version of the Hive metastore. Available options are " + "0.12.0 through 2.3.5 and " + -"3.1.0 through 3.1.1.") +"3.0.0 through 3.1.1.") .stringConf .createWithDefault(builtinHiveVersion) diff --git a/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala b/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala index b8d5f21..2b80165 100644 --- a/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala +++ b/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala @@ -107,6 +107,7 @@ private[hive] class HiveClientImpl( case hive.v2_1 => new Shim_v2_1() case hive.v2_2 => new Shim_v2_2() case hive.v2_3 => new Shim_v2_3() +case hive.v3_0 => new Shim_v3_0() case hive.v3_1 => new Shim_v3_1() } @@ -744,7 +745,7 @@ private[hive] class HiveClientImpl( // Since HIVE-18238(Hive 3.0.0), the Driver.close function's return type changed // and the CommandProcessorFactory.clean function removed. driver.getClass.getMethod("close").invoke(driver) - if (version != hive.v3_1) { + if (version != hive.v3_0 && version != hive.v3_1) { CommandProcessorFactory.clean(conf) } } diff --git a/sql/hive/src/main/sca
[spark] branch master updated: [SPARK-27870][SQL][PYSPARK] Flush batch timely for pandas UDF (for improving pandas UDFs pipeline)
This is an automated email from the ASF dual-hosted git repository. lixiao pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 9c4eb99 [SPARK-27870][SQL][PYSPARK] Flush batch timely for pandas UDF (for improving pandas UDFs pipeline) 9c4eb99 is described below commit 9c4eb99c52803f2488ac3787672aa8d3e4d1544e Author: WeichenXu AuthorDate: Fri Jun 7 14:02:43 2019 -0700 [SPARK-27870][SQL][PYSPARK] Flush batch timely for pandas UDF (for improving pandas UDFs pipeline) ## What changes were proposed in this pull request? Flush batch timely for pandas UDF. This could improve performance when multiple pandas UDF plans are pipelined. When batch being flushed in time, downstream pandas UDFs will get pipelined as soon as possible, and pipeline will help hide the donwstream UDFs computation time. For example: When the first UDF start computing on batch-3, the second pipelined UDF can start computing on batch-2, and the third pipelined UDF can start computing on batch-1. If we do not flush each batch in time, the donwstream UDF's pipeline will lag behind too much, which may increase the total processing time. I add flush at two places: * JVM process feed data into python worker. In jvm side, when write one batch, flush it * VM process read data from python worker output, In python worker side, when write one batch, flush it If no flush, the default buffer size for them are both 65536. Especially in the ML case, in order to make realtime prediction, we will make batch size very small. The buffer size is too large for the case, which cause downstream pandas UDF pipeline lag behind too much. ### Note * This is only applied to pandas scalar UDF. * Do not flush for each batch. The minimum interval between two flush is 0.1 second. This avoid too frequent flushing when batch size is small. It works like: ``` last_flush_time = time.time() for batch in iterator: writer.write_batch(batch) flush_time = time.time() if self.flush_timely and (flush_time - last_flush_time > 0.1): stream.flush() last_flush_time = flush_time ``` ## How was this patch tested? ### Benchmark to make sure the flush do not cause performance regression Test code: ``` numRows = ... batchSize = ... spark.conf.set('spark.sql.execution.arrow.maxRecordsPerBatch', str(batchSize)) df = spark.range(1, numRows + 1, numPartitions=1).select(col('id').alias('a')) pandas_udf("int", PandasUDFType.SCALAR) def fp1(x): return x + 10 beg_time = time.time() result = df.select(sum(fp1('a'))).head() print("result: " + str(result[0])) print("consume time: " + str(time.time() - beg_time)) ``` Test Result: params| Consume time (Before) | Consume time (After) | --- | -- numRows=1, batchSize=1 | 23.43s | 24.64s numRows=1, batchSize=1000 | 36.73s | 34.50s numRows=1000, batchSize=100 | 35.67s | 32.64s numRows=100, batchSize=10 | 33.60s | 32.11s numRows=10, batchSize=1 | 33.36s | 31.82s ### Benchmark pipelined pandas UDF Test code: ``` spark.conf.set('spark.sql.execution.arrow.maxRecordsPerBatch', '1') df = spark.range(1, 31, numPartitions=1).select(col('id').alias('a')) pandas_udf("int", PandasUDFType.SCALAR) def fp1(x): print("run fp1") time.sleep(1) return x + 100 pandas_udf("int", PandasUDFType.SCALAR) def fp2(x, y): print("run fp2") time.sleep(1) return x + y beg_time = time.time() result = df.select(sum(fp2(fp1('a'), col('a'.head() print("result: " + str(result[0])) print("consume time: " + str(time.time() - beg_time)) ``` Test Result: **Before**: consume time: 63.57s **After**: consume time: 32.43s **So the PR improve performance by make downstream UDF get pipelined early.** Please review https://spark.apache.org/contributing.html before opening a pull request. Closes #24734 from WeichenXu123/improve_pandas_udf_pipeline. Lead-authored-by: WeichenXu Co-authored-by: Xiangrui Meng Signed-off-by: gatorsmile --- python/pyspark/serializers.py | 18 -- python/pyspark/testing/utils.py | 3 +++ python/pyspark/tests/te
[spark] branch master updated (eee3467 -> b30655b)
This is an automated email from the ASF dual-hosted git repository. lixiao pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from eee3467 [SPARK-27938][SQL] Remove feature flag LEGACY_PASS_PARTITION_BY_AS_OPTIONS add b30655b [SPARK-27965][SQL] Add extractors for v2 catalog transforms. No new revisions were added by this update. Summary of changes: .../sql/catalog/v2/expressions/expressions.scala | 83 +++ .../v2/expressions/TransformExtractorSuite.scala | 156 + 2 files changed, 239 insertions(+) create mode 100644 sql/catalyst/src/test/scala/org/apache/spark/sql/catalog/v2/expressions/TransformExtractorSuite.scala - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-27918][SQL] Port boolean.sql
This is an automated email from the ASF dual-hosted git repository. lixiao pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new eadb538 [SPARK-27918][SQL] Port boolean.sql eadb538 is described below commit eadb53824d08480131498c7eb5bd7674f48b62c7 Author: Yuming Wang AuthorDate: Thu Jun 6 10:57:10 2019 -0700 [SPARK-27918][SQL] Port boolean.sql ## What changes were proposed in this pull request? This PR is to port boolean.sql from PostgreSQL regression tests. https://github.com/postgres/postgres/blob/REL_12_BETA1/src/test/regress/sql/boolean.sql The expected results can be found in the link: https://github.com/postgres/postgres/blob/REL_12_BETA1/src/test/regress/expected/boolean.out When porting the test cases, found two PostgreSQL specific features that do not exist in Spark SQL: - [SPARK-27931](https://issues.apache.org/jira/browse/SPARK-27931): Accept 'on' and 'off' as input for boolean data type / Trim the string when cast to boolean type / Accept unique prefixes thereof - [SPARK-27924](https://issues.apache.org/jira/browse/SPARK-27924): Support E061-14: Search Conditions Also, found an inconsistent behavior: - [SPARK-27923](https://issues.apache.org/jira/browse/SPARK-27923): Unsupported input throws an exception in PostgreSQL but Spark accepts it and sets the value to `NULL`, for example: ```sql SELECT bool 'test' AS error; -- SELECT boolean('test') AS error; ``` ## How was this patch tested? N/A Closes #24767 from wangyum/SPARK-27918. Authored-by: Yuming Wang Signed-off-by: gatorsmile --- .../resources/sql-tests/inputs/pgSQL/boolean.sql | 284 .../sql-tests/results/pgSQL/boolean.sql.out| 741 + .../org/apache/spark/sql/SQLQueryTestSuite.scala | 10 + 3 files changed, 1035 insertions(+) diff --git a/sql/core/src/test/resources/sql-tests/inputs/pgSQL/boolean.sql b/sql/core/src/test/resources/sql-tests/inputs/pgSQL/boolean.sql new file mode 100644 index 000..8ba6f97 --- /dev/null +++ b/sql/core/src/test/resources/sql-tests/inputs/pgSQL/boolean.sql @@ -0,0 +1,284 @@ +-- +-- Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group +-- +-- +-- BOOLEAN +-- https://github.com/postgres/postgres/blob/REL_12_BETA1/src/test/regress/sql/boolean.sql + +-- +-- sanity check - if this fails go insane! +-- +SELECT 1 AS one; + + +-- **testing built-in type bool + +-- check bool input syntax + +SELECT true AS true; + +SELECT false AS false; + +SELECT boolean('t') AS true; + +-- [SPARK-27931] Trim the string when cast string type to boolean type +SELECT boolean(' f ') AS false; + +SELECT boolean('true') AS true; + +-- [SPARK-27923] PostgreSQL does not accept 'test' but Spark SQL accepts it and sets it to NULL +SELECT boolean('test') AS error; + +SELECT boolean('false') AS false; + +-- [SPARK-27923] PostgreSQL does not accept 'foo' but Spark SQL accepts it and sets it to NULL +SELECT boolean('foo') AS error; + +SELECT boolean('y') AS true; + +SELECT boolean('yes') AS true; + +-- [SPARK-27923] PostgreSQL does not accept 'yeah' but Spark SQL accepts it and sets it to NULL +SELECT boolean('yeah') AS error; + +SELECT boolean('n') AS false; + +SELECT boolean('no') AS false; + +-- [SPARK-27923] PostgreSQL does not accept 'nay' but Spark SQL accepts it and sets it to NULL +SELECT boolean('nay') AS error; + +-- [SPARK-27931] Accept 'on' and 'off' as input for boolean data type +SELECT boolean('on') AS true; + +SELECT boolean('off') AS false; + +-- [SPARK-27931] Accept unique prefixes thereof +SELECT boolean('of') AS false; + +-- [SPARK-27923] PostgreSQL does not accept 'o' but Spark SQL accepts it and sets it to NULL +SELECT boolean('o') AS error; + +-- [SPARK-27923] PostgreSQL does not accept 'on_' but Spark SQL accepts it and sets it to NULL +SELECT boolean('on_') AS error; + +-- [SPARK-27923] PostgreSQL does not accept 'off_' but Spark SQL accepts it and sets it to NULL +SELECT boolean('off_') AS error; + +SELECT boolean('1') AS true; + +-- [SPARK-27923] PostgreSQL does not accept '11' but Spark SQL accepts it and sets it to NULL +SELECT boolean('11') AS error; + +SELECT boolean('0') AS false; + +-- [SPARK-27923] PostgreSQL does not accept '000' but Spark SQL accepts it and sets it to NULL +SELECT boolean('000') AS error; + +-- [SPARK-27923] PostgreSQL does not accept '' but Spark SQL accep
[spark] branch master updated: [SPARK-27883][SQL] Port AGGREGATES.sql [Part 2]
This is an automated email from the ASF dual-hosted git repository. lixiao pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 4de9649 [SPARK-27883][SQL] Port AGGREGATES.sql [Part 2] 4de9649 is described below commit 4de96493ae1595bf6a80596c99df0e003ef0cf7d Author: Yuming Wang AuthorDate: Thu Jun 6 09:28:59 2019 -0700 [SPARK-27883][SQL] Port AGGREGATES.sql [Part 2] ## What changes were proposed in this pull request? This PR is to port AGGREGATES.sql from PostgreSQL regression tests. https://github.com/postgres/postgres/blob/REL_12_BETA1/src/test/regress/sql/aggregates.sql#L145-L350 The expected results can be found in the link: https://github.com/postgres/postgres/blob/REL_12_BETA1/src/test/regress/expected/aggregates.out#L499-L984 When porting the test cases, found four PostgreSQL specific features that do not exist in Spark SQL: - [SPARK-27877](https://issues.apache.org/jira/browse/SPARK-27877): Implement SQL-standard LATERAL subqueries - [SPARK-27878](https://issues.apache.org/jira/browse/SPARK-27878): Support ARRAY(sub-SELECT) expressions - [SPARK-27879](https://issues.apache.org/jira/browse/SPARK-27879): Implement bitwise integer aggregates(BIT_AND and BIT_OR) - [SPARK-27880](https://issues.apache.org/jira/browse/SPARK-27880): Implement boolean aggregates(BOOL_AND, BOOL_OR and EVERY) ## How was this patch tested? N/A Closes #24743 from wangyum/SPARK-27883. Authored-by: Yuming Wang Signed-off-by: gatorsmile --- .../sql-tests/inputs/pgSQL/aggregates_part1.sql| 2 +- .../sql-tests/inputs/pgSQL/aggregates_part2.sql| 228 + .../results/pgSQL/aggregates_part2.sql.out | 162 +++ 3 files changed, 391 insertions(+), 1 deletion(-) diff --git a/sql/core/src/test/resources/sql-tests/inputs/pgSQL/aggregates_part1.sql b/sql/core/src/test/resources/sql-tests/inputs/pgSQL/aggregates_part1.sql index de7bbda..a81eca2 100644 --- a/sql/core/src/test/resources/sql-tests/inputs/pgSQL/aggregates_part1.sql +++ b/sql/core/src/test/resources/sql-tests/inputs/pgSQL/aggregates_part1.sql @@ -3,7 +3,7 @@ -- -- -- AGGREGATES [Part 1] --- https://github.com/postgres/postgres/blob/02ddd499322ab6f2f0d58692955dc9633c2150fc/src/test/regress/sql/aggregates.sql#L1-L143 +-- https://github.com/postgres/postgres/blob/REL_12_BETA1/src/test/regress/sql/aggregates.sql#L1-L143 -- avoid bit-exact output here because operations may not be bit-exact. -- SET extra_float_digits = 0; diff --git a/sql/core/src/test/resources/sql-tests/inputs/pgSQL/aggregates_part2.sql b/sql/core/src/test/resources/sql-tests/inputs/pgSQL/aggregates_part2.sql new file mode 100644 index 000..c461370 --- /dev/null +++ b/sql/core/src/test/resources/sql-tests/inputs/pgSQL/aggregates_part2.sql @@ -0,0 +1,228 @@ +-- +-- Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group +-- +-- +-- AGGREGATES [Part 2] +-- https://github.com/postgres/postgres/blob/REL_12_BETA1/src/test/regress/sql/aggregates.sql#L145-L350 + +create temporary view int4_tbl as select * from values + (0), + (123456), + (-123456), + (2147483647), + (-2147483647) + as int4_tbl(f1); + +-- Test handling of Params within aggregate arguments in hashed aggregation. +-- Per bug report from Jeevan Chalke. +-- [SPARK-27877] Implement SQL-standard LATERAL subqueries +-- explain (verbose, costs off) +-- select s1, s2, sm +-- from generate_series(1, 3) s1, +-- lateral (select s2, sum(s1 + s2) sm +-- from generate_series(1, 3) s2 group by s2) ss +-- order by 1, 2; +-- select s1, s2, sm +-- from generate_series(1, 3) s1, +-- lateral (select s2, sum(s1 + s2) sm +-- from generate_series(1, 3) s2 group by s2) ss +-- order by 1, 2; + +-- [SPARK-27878] Support ARRAY(sub-SELECT) expressions +-- explain (verbose, costs off) +-- select array(select sum(x+y) s +-- from generate_series(1,3) y group by y order by s) +-- from generate_series(1,3) x; +-- select array(select sum(x+y) s +-- from generate_series(1,3) y group by y order by s) +-- from generate_series(1,3) x; + +-- [SPARK-27879] Implement bitwise integer aggregates(BIT_AND and BIT_OR) +-- +-- test for bitwise integer aggregates +-- +-- CREATE TEMPORARY TABLE bitwise_test( +-- i2 INT2, +-- i4 INT4, +-- i8 INT8, +-- i INTEGER, +-- x INT2, +-- y BIT(4) +-- ); + +-- empty case +-- SELECT +-- BIT_AND(i2) AS "?", +-- BIT_OR(i4) AS "?" +-- FROM bitwise_test; + +-- COPY bitwise_test FROM STDIN NULL 'null'; +-- 1 1 1 1 1 B0101 +-- 3 3 3 null2 B0100 +-- 7 7 7 3 4 B1100 +-- \. + +-- SELECT +-- BIT_AND(i2) AS "1", +-- BIT_AND(i4) AS "1", +-- BIT_AND(i8)
[spark] branch master updated (6c28ef1 -> 5d6758c)
This is an automated email from the ASF dual-hosted git repository. lixiao pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 6c28ef1 [SPARK-27933][SS] Extracting common purge behaviour to the parent StreamExecution add 5d6758c [SPARK-27857][SQL] Move ALTER TABLE parsing into Catalyst No new revisions were added by this update. Summary of changes: .../apache/spark/sql/catalyst/parser/SqlBase.g4| 40 +++- .../spark/sql/catalyst/parser/AstBuilder.scala | 162 +++- .../plans/logical/sql/AlterTableStatements.scala | 78 ...ewStatement.scala => AlterViewStatements.scala} | 20 +- .../plans/logical/sql/CreateTableStatement.scala | 10 +- .../plans/logical/sql/ParsedStatement.scala| 5 + .../spark/sql/catalyst/parser/DDLParserSuite.scala | 207 - .../spark/sql/execution/SparkSqlParser.scala | 60 +- .../datasources/DataSourceResolution.scala | 42 - .../sql/execution/command/DDLParserSuite.scala | 63 +-- .../execution/command/PlanResolutionSuite.scala| 76 11 files changed, 612 insertions(+), 151 deletions(-) create mode 100644 sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/sql/AlterTableStatements.scala copy sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/sql/{DropViewStatement.scala => AlterViewStatements.scala} (69%) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (3f102a8 -> 8b6232b)
This is an automated email from the ASF dual-hosted git repository. lixiao pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 3f102a8 [SPARK-27749][SQL] hadoop-3.2 support hive-thriftserver add 8b6232b [SPARK-27521][SQL] Move data source v2 to catalyst module No new revisions were added by this update. Summary of changes: project/MimaExcludes.scala | 39 ++ sql/catalyst/pom.xml | 4 +++ .../spark/sql/sources/v2/SessionConfigSupport.java | 0 .../apache/spark/sql/sources/v2/SupportsRead.java | 0 .../apache/spark/sql/sources/v2/SupportsWrite.java | 0 .../apache/spark/sql/sources/v2/TableProvider.java | 9 + .../apache/spark/sql/sources/v2/reader/Batch.java | 0 .../sql/sources/v2/reader/InputPartition.java | 0 .../sql/sources/v2/reader/PartitionReader.java | 0 .../sources/v2/reader/PartitionReaderFactory.java | 0 .../apache/spark/sql/sources/v2/reader/Scan.java | 0 .../spark/sql/sources/v2/reader/ScanBuilder.java | 0 .../spark/sql/sources/v2/reader/Statistics.java| 0 .../sources/v2/reader/SupportsPushDownFilters.java | 0 .../v2/reader/SupportsPushDownRequiredColumns.java | 0 .../v2/reader/SupportsReportPartitioning.java | 0 .../v2/reader/SupportsReportStatistics.java| 0 .../reader/partitioning/ClusteredDistribution.java | 0 .../v2/reader/partitioning/Distribution.java | 0 .../v2/reader/partitioning/Partitioning.java | 0 .../streaming/ContinuousPartitionReader.java | 0 .../ContinuousPartitionReaderFactory.java | 0 .../v2/reader/streaming/ContinuousStream.java | 0 .../v2/reader/streaming/MicroBatchStream.java | 0 .../sql/sources/v2/reader/streaming/Offset.java| 0 .../v2/reader/streaming/PartitionOffset.java | 0 .../v2/reader/streaming/SparkDataStream.java | 0 .../spark/sql/sources/v2/writer/BatchWrite.java| 0 .../spark/sql/sources/v2/writer/DataWriter.java| 0 .../sql/sources/v2/writer/DataWriterFactory.java | 0 .../v2/writer/SupportsDynamicOverwrite.java| 0 .../sql/sources/v2/writer/SupportsOverwrite.java | 0 .../sql/sources/v2/writer/SupportsTruncate.java| 0 .../spark/sql/sources/v2/writer/WriteBuilder.java | 0 .../sql/sources/v2/writer/WriterCommitMessage.java | 0 .../streaming/StreamingDataWriterFactory.java | 0 .../v2/writer/streaming/StreamingWrite.java| 0 .../spark/sql/vectorized/ArrowColumnVector.java| 2 +- .../apache/spark/sql/vectorized/ColumnVector.java | 0 .../apache/spark/sql/vectorized/ColumnarArray.java | 0 .../apache/spark/sql/vectorized/ColumnarBatch.java | 0 .../apache/spark/sql/vectorized/ColumnarMap.java | 0 .../apache/spark/sql/vectorized/ColumnarRow.java | 0 .../org/apache/spark/sql/sources/filters.scala | 0 .../org/apache/spark/sql/util}/ArrowUtils.scala| 2 +- .../apache/spark/sql/util}/ArrowUtilsSuite.scala | 4 +-- sql/core/pom.xml | 4 --- .../sql/execution/arrow/ArrowConverters.scala | 1 + .../spark/sql/execution/arrow/ArrowWriter.scala| 1 + .../execution/python/AggregateInPandasExec.scala | 2 +- .../sql/execution/python/ArrowEvalPythonExec.scala | 2 +- .../sql/execution/python/ArrowPythonRunner.scala | 3 +- .../python/FlatMapGroupsInPandasExec.scala | 4 +-- .../sql/execution/python/WindowInPandasExec.scala | 2 +- .../spark/sql/execution/r/ArrowRRunner.scala | 3 +- .../sql/execution/arrow/ArrowConvertersSuite.scala | 1 + .../sources/RateStreamProviderSuite.scala | 2 +- .../streaming/sources/TextSocketStreamSuite.scala | 2 +- .../vectorized/ArrowColumnVectorSuite.scala| 2 +- .../execution/vectorized/ColumnarBatchSuite.scala | 4 +-- 60 files changed, 65 insertions(+), 28 deletions(-) rename sql/{core => catalyst}/src/main/java/org/apache/spark/sql/sources/v2/SessionConfigSupport.java (100%) rename sql/{core => catalyst}/src/main/java/org/apache/spark/sql/sources/v2/SupportsRead.java (100%) rename sql/{core => catalyst}/src/main/java/org/apache/spark/sql/sources/v2/SupportsWrite.java (100%) rename sql/{core => catalyst}/src/main/java/org/apache/spark/sql/sources/v2/TableProvider.java (89%) rename sql/{core => catalyst}/src/main/java/org/apache/spark/sql/sources/v2/reader/Batch.java (100%) rename sql/{core => catalyst}/src/main/java/org/apache/spark/sql/sources/v2/reader/InputPartition.java (100%) rename sql/{core => catalyst}/src/main/java/org/apache/spark/sql/sources/v2/reader/PartitionReader.java (100%) rename sql/{core => catalyst}/src/main/java/org/apache/spark/sql/sources/v2/reader/PartitionReaderFactory.java (100%) rename sql/{core => catalyst}/src/main/java/org/apache/spark/sql/sources/v2/reader/Scan.java (100%) rename sql/{core => catalyst}/src/main/jav
[spark] branch master updated (b312033 -> 18834e8)
This is an automated email from the ASF dual-hosted git repository. lixiao pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from b312033 [SPARK-20286][CORE] Improve logic for timing out executors in dynamic allocation. add 18834e8 [SPARK-27899][SQL] Refactor getTableOption() to extract a common method No new revisions were added by this update. Summary of changes: .../spark/sql/hive/client/HiveClientImpl.scala | 206 +++-- 1 file changed, 104 insertions(+), 102 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-2.4 updated: [MINOR][DOC] Avro data source documentation change
This is an automated email from the ASF dual-hosted git repository. lixiao pushed a commit to branch branch-2.4 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-2.4 by this push: new 9d307dd [MINOR][DOC] Avro data source documentation change 9d307dd is described below commit 9d307dd53857e3eac9509992505be5ca37f2de7a Author: Jules Damji AuthorDate: Tue Jun 4 16:17:53 2019 -0700 [MINOR][DOC] Avro data source documentation change This is a minor documentation change whereby the https://spark.apache.org/docs/latest/sql-data-sources-avro.html mentions "The date type and naming of record fields should match the input Avro data or Catalyst data," The term Catalyst data is confusing. It should instead say, Spark's internal data type such as String Type or IntegerType. (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) There are no code changes; only doc changes. Please review https://spark.apache.org/contributing.html before opening a pull request. Closes #24787 from dmatrix/br-orc-ds.doc.changes. Authored-by: Jules Damji Signed-off-by: gatorsmile (cherry picked from commit b71abd654de2886ff2b44cada81ea909a0712f7c) Signed-off-by: gatorsmile --- docs/sql-data-sources-avro.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/sql-data-sources-avro.md b/docs/sql-data-sources-avro.md index d3b81f0..285b3aa 100644 --- a/docs/sql-data-sources-avro.md +++ b/docs/sql-data-sources-avro.md @@ -148,8 +148,8 @@ Data source options of Avro can be set using the `.option` method on `DataFrameR avroSchema None -Optional Avro schema provided by an user in JSON format. The date type and naming of record fields -should match the input Avro data or Catalyst data, otherwise the read/write action will fail. +Optional Avro schema provided by a user in JSON format. The data type and naming of record fields +should match the Avro data type when reading from Avro or match the Spark's internal data type (e.g., StringType, IntegerType) when writing to Avro files; otherwise, the read/write action will fail. read and write - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (de73a54 -> b71abd6)
This is an automated email from the ASF dual-hosted git repository. lixiao pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from de73a54 [SPARK-27909][SQL] Do not run analysis inside CTE substitution add b71abd6 [MINOR][DOC] Avro data source documentation change No new revisions were added by this update. Summary of changes: docs/sql-data-sources-avro.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-27909][SQL] Do not run analysis inside CTE substitution
This is an automated email from the ASF dual-hosted git repository. lixiao pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new de73a54 [SPARK-27909][SQL] Do not run analysis inside CTE substitution de73a54 is described below commit de73a54269cd73fba1a9735304f641e1d9c45789 Author: Ryan Blue AuthorDate: Tue Jun 4 14:46:13 2019 -0700 [SPARK-27909][SQL] Do not run analysis inside CTE substitution ## What changes were proposed in this pull request? This updates CTE substitution to avoid needing to run all resolution rules on each substituted expression. Running resolution rules was previously used to avoid infinite recursion. In the updated rule, CTE plans are substituted as sub-queries from right to left. Using this scope-based order, it is not necessary to replace multiple CTEs at the same time using `resolveOperatorsDown`. Instead, `resolveOperatorsUp` is used to replace each CTE individually. By resolving using `resolveOperatorsUp`, this no longer needs to run all analyzer rules on each substituted expression. Previously, this was done to apply `ResolveRelations`, which would throw an `AnalysisException` for all unresolved relations so that unresolved relations that may cause recursive substitutions were not left in the plan. Because this is no longer needed, `ResolveRelations` no longer needs to throw `AnalysisException` and resolution can be done in multiple rules. ## How was this patch tested? Existing tests in `SQLQueryTestSuite`, `cte.sql`. Closes #24763 from rdblue/SPARK-27909-fix-cte-substitution. Authored-by: Ryan Blue Signed-off-by: gatorsmile --- .../spark/sql/catalyst/analysis/Analyzer.scala | 21 - 1 file changed, 12 insertions(+), 9 deletions(-) diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala index 91365fc..841b858 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala @@ -215,23 +215,26 @@ class Analyzer( object CTESubstitution extends Rule[LogicalPlan] { def apply(plan: LogicalPlan): LogicalPlan = plan.resolveOperatorsUp { case With(child, relations) => -substituteCTE(child, relations.foldLeft(Seq.empty[(String, LogicalPlan)]) { - case (resolved, (name, relation)) => -resolved :+ name -> executeSameContext(substituteCTE(relation, resolved)) -}) +// substitute CTE expressions right-to-left to resolve references to previous CTEs: +// with a as (select * from t), b as (select * from a) select * from b +relations.foldRight(child) { + case ((cteName, ctePlan), currentPlan) => +substituteCTE(currentPlan, cteName, ctePlan) +} case other => other } -def substituteCTE(plan: LogicalPlan, cteRelations: Seq[(String, LogicalPlan)]): LogicalPlan = { - plan resolveOperatorsDown { +def substituteCTE(plan: LogicalPlan, cteName: String, ctePlan: LogicalPlan): LogicalPlan = { + plan resolveOperatorsUp { +case UnresolvedRelation(TableIdentifier(table, None)) if resolver(cteName, table) => + ctePlan case u: UnresolvedRelation => - cteRelations.find(x => resolver(x._1, u.tableIdentifier.table)) -.map(_._2).getOrElse(u) + u case other => // This cannot be done in ResolveSubquery because ResolveSubquery does not know the CTE. other transformExpressions { case e: SubqueryExpression => - e.withNewPlan(substituteCTE(e.plan, cteRelations)) + e.withNewPlan(substituteCTE(e.plan, cteName, ctePlan)) } } } - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-27926][SQL] Allow altering table add columns with CSVFileFormat/JsonFileFormat provider
This is an automated email from the ASF dual-hosted git repository. lixiao pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new d1937c1 [SPARK-27926][SQL] Allow altering table add columns with CSVFileFormat/JsonFileFormat provider d1937c1 is described below commit d1937c14795aed4bea89065c3796a5bc1dd18c8f Author: Gengliang Wang AuthorDate: Mon Jun 3 23:51:05 2019 -0700 [SPARK-27926][SQL] Allow altering table add columns with CSVFileFormat/JsonFileFormat provider ## What changes were proposed in this pull request? In the previous work of csv/json migration, CSVFileFormat/JsonFileFormat is removed in the table provider whitelist of `AlterTableAddColumnsCommand.verifyAlterTableAddColumn`: https://github.com/apache/spark/pull/24005 https://github.com/apache/spark/pull/24058 This is regression. If a table is created with Provider `org.apache.spark.sql.execution.datasources.csv.CSVFileFormat` or `org.apache.spark.sql.execution.datasources.json.JsonFileFormat`, Spark should allow the "alter table add column" operation. ## How was this patch tested? Unit test Closes #24776 from gengliangwang/v1Table. Authored-by: Gengliang Wang Signed-off-by: gatorsmile --- .../main/scala/org/apache/spark/sql/execution/command/tables.scala | 5 - .../test/scala/org/apache/spark/sql/execution/command/DDLSuite.scala | 5 - 2 files changed, 8 insertions(+), 2 deletions(-) diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala index ea29eff..64f739f 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala @@ -37,6 +37,8 @@ import org.apache.spark.sql.catalyst.expressions.{Attribute, AttributeReference} import org.apache.spark.sql.catalyst.plans.logical._ import org.apache.spark.sql.catalyst.util.{escapeSingleQuotedString, quoteIdentifier} import org.apache.spark.sql.execution.datasources.{DataSource, PartitioningUtils} +import org.apache.spark.sql.execution.datasources.csv.CSVFileFormat +import org.apache.spark.sql.execution.datasources.json.JsonFileFormat import org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat import org.apache.spark.sql.execution.datasources.v2.csv.CSVDataSourceV2 import org.apache.spark.sql.execution.datasources.v2.json.JsonDataSourceV2 @@ -238,7 +240,8 @@ case class AlterTableAddColumnsCommand( // TextFileFormat only default to one column "value" // Hive type is already considered as hive serde table, so the logic will not // come in here. -case _: JsonDataSourceV2 | _: CSVDataSourceV2 | _: ParquetFileFormat | _: OrcDataSourceV2 => +case _: CSVFileFormat | _: JsonFileFormat | _: ParquetFileFormat => +case _: JsonDataSourceV2 | _: CSVDataSourceV2 | _: OrcDataSourceV2 => case s if s.getClass.getCanonicalName.endsWith("OrcFileFormat") => case s => throw new AnalysisException( diff --git a/sql/core/src/test/scala/org/apache/spark/sql/execution/command/DDLSuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/execution/command/DDLSuite.scala index 0124f28..b777db7 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/execution/command/DDLSuite.scala +++ b/sql/core/src/test/scala/org/apache/spark/sql/execution/command/DDLSuite.scala @@ -2566,7 +2566,10 @@ abstract class DDLSuite extends QueryTest with SQLTestUtils { } } - val supportedNativeFileFormatsForAlterTableAddColumns = Seq("parquet", "json", "csv") + val supportedNativeFileFormatsForAlterTableAddColumns = Seq("csv", "json", "parquet", +"org.apache.spark.sql.execution.datasources.csv.CSVFileFormat", +"org.apache.spark.sql.execution.datasources.json.JsonFileFormat", +"org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat") supportedNativeFileFormatsForAlterTableAddColumns.foreach { provider => test(s"alter datasource table add columns - $provider") { - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (a12de29 -> 5e3520f)
This is an automated email from the ASF dual-hosted git repository. lixiao pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from a12de29 [SPARK-27838][SQL] Support user provided non-nullable avro schema for nullable catalyst schema without any null record add 5e3520f [SPARK-27809][SQL] Make optional clauses order insensitive for CREATE DATABASE/VIEW SQL statement No new revisions were added by this update. Summary of changes: .../apache/spark/sql/catalyst/parser/SqlBase.g4| 13 ++-- .../sql/catalyst/parser/ParserUtilsSuite.scala | 7 +- .../spark/sql/execution/SparkSqlParser.scala | 76 +- .../sql/execution/command/DDLParserSuite.scala | 38 ++- 4 files changed, 94 insertions(+), 40 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-27824][SQL] Make rule EliminateResolvedHint idempotent
This is an automated email from the ASF dual-hosted git repository. lixiao pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new de13f70 [SPARK-27824][SQL] Make rule EliminateResolvedHint idempotent de13f70 is described below commit de13f70ce1b3d5c90dc9b4f185e50dea3f1508b7 Author: maryannxue AuthorDate: Fri May 24 11:25:22 2019 -0700 [SPARK-27824][SQL] Make rule EliminateResolvedHint idempotent ## What changes were proposed in this pull request? This fix prevents the rule EliminateResolvedHint from being applied again if it's already applied. ## How was this patch tested? Added new UT. Closes #24692 from maryannxue/eliminatehint-bug. Authored-by: maryannxue Signed-off-by: gatorsmile --- .../catalyst/optimizer/EliminateResolvedHint.scala | 2 +- .../scala/org/apache/spark/sql/JoinHintSuite.scala | 23 ++ 2 files changed, 24 insertions(+), 1 deletion(-) diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/EliminateResolvedHint.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/EliminateResolvedHint.scala index aebd660..6419f9c 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/EliminateResolvedHint.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/EliminateResolvedHint.scala @@ -29,7 +29,7 @@ object EliminateResolvedHint extends Rule[LogicalPlan] { // is using transformUp rather than resolveOperators. def apply(plan: LogicalPlan): LogicalPlan = { val pulledUp = plan transformUp { - case j: Join => + case j: Join if j.hint == JoinHint.NONE => val (newLeft, leftHints) = extractHintsFromPlan(j.left) val (newRight, rightHints) = extractHintsFromPlan(j.right) val newJoinHint = JoinHint(mergeHints(leftHints), mergeHints(rightHints)) diff --git a/sql/core/src/test/scala/org/apache/spark/sql/JoinHintSuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/JoinHintSuite.scala index 9c2dc0c..534560b 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/JoinHintSuite.scala +++ b/sql/core/src/test/scala/org/apache/spark/sql/JoinHintSuite.scala @@ -22,8 +22,10 @@ import scala.collection.mutable.ArrayBuffer import org.apache.log4j.{AppenderSkeleton, Level} import org.apache.log4j.spi.LoggingEvent +import org.apache.spark.sql.catalyst.optimizer.EliminateResolvedHint import org.apache.spark.sql.catalyst.plans.PlanTest import org.apache.spark.sql.catalyst.plans.logical._ +import org.apache.spark.sql.catalyst.rules.RuleExecutor import org.apache.spark.sql.execution.joins._ import org.apache.spark.sql.internal.SQLConf import org.apache.spark.sql.test.SharedSQLContext @@ -557,4 +559,25 @@ class JoinHintSuite extends PlanTest with SharedSQLContext { } } } + + test("Verify that the EliminatedResolvedHint rule is idempotent") { +withTempView("t1", "t2") { + Seq((1, "4"), (2, "2")).toDF("key", "value").createTempView("t1") + Seq((1, "1"), (2, "12.3"), (2, "123")).toDF("key", "value").createTempView("t2") + val df = sql("SELECT /*+ broadcast(t2) */ * from t1 join t2 ON t1.key = t2.key") + val optimize = new RuleExecutor[LogicalPlan] { +val batches = Batch("EliminateResolvedHint", FixedPoint(10), EliminateResolvedHint) :: Nil + } + val optimized = optimize.execute(df.logicalPlan) + val expectedHints = +JoinHint( + None, + Some(HintInfo(strategy = Some(BROADCAST :: Nil + val joinHints = optimized collect { +case Join(_, _, _, _, hint) => hint +case _: ResolvedHint => fail("ResolvedHint should not appear after optimize.") + } + assert(joinHints == expectedHints) +} + } } - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-26356][SQL] remove SaveMode from data source v2
This is an automated email from the ASF dual-hosted git repository. lixiao pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 7d318bf [SPARK-26356][SQL] remove SaveMode from data source v2 7d318bf is described below commit 7d318bfe907ab22b904d118e4ff4970af32b0e44 Author: Wenchen Fan AuthorDate: Fri May 24 10:45:46 2019 -0700 [SPARK-26356][SQL] remove SaveMode from data source v2 ## What changes were proposed in this pull request? In data source v1, save mode specified in `DataFrameWriter` is passed to data source implementation directly, and each data source can define its own behavior about save mode. This is confusing and we want to get rid of save mode in data source v2. For data source v2, we expect data source to implement the `TableCatalog` API, and end-users use SQL(or the new write API described in [this doc](https://docs.google.com/document/d/1gYm5Ji2Mge3QBdOliFV5gSPTKlX4q1DCBXIkiyMv62A/edit?ts=5ace0718#heading=h.e9v1af12g5zo)) to acess data sources. The SQL API has very clear semantic and we don't need save mode at all. However, for simple data sources that do not have table management (like a JIRA data source, a noop sink, etc.), it's not ideal to ask them to implement the `TableCatalog` API, and throw exception here and there. `TableProvider` API is created for simple data sources. It can only get tables, without any other table management methods. This means, it can only deal with existing tables. `TableProvider` fits well with `DataStreamReader` and `DataStreamWriter`, as they can only read/write existing tables. However, `TableProvider` doesn't fit `DataFrameWriter` well, as the save mode requires more than just get table. More specifically, `ErrorIfExists` mode needs to check if table exists, and create table. `Ignore` mode needs to check if table exists. When end-users specify `ErrorIfExists` or `Ignore` mode and write data to `TableProvider` via `DataFrameWriter`, Spark fa [...] The file source is in the middle of `TableProvider` and `TableCatalog`: it's simple but it can check table(path) exists and create table(path). That said, file source supports all the save modes. Currently file source implements `TableProvider`, and it's not working because `TableProvider` doesn't support `ErrorIfExists` and `Ignore` modes. Ideally we should create a new API for path-based data sources, but to unblock the work of file source v2 migration, this PR proposes to special-case file source v2 in `DataFrameWriter`, to make it work. This PR also removes `SaveMode` from data source v2, as now only the internal file source v2 needs it. ## How was this patch tested? existing tests Closes #24233 from cloud-fan/file. Authored-by: Wenchen Fan Signed-off-by: gatorsmile --- .../apache/spark/sql/sources/v2/TableProvider.java | 4 ++ .../sql/sources/v2/writer/SupportsSaveMode.java| 26 .../spark/sql/sources/v2/writer/WriteBuilder.java | 4 -- .../org/apache/spark/sql/DataFrameWriter.scala | 77 -- .../datasources/noop/NoopDataSource.scala | 6 +- .../datasources/v2/FileWriteBuilder.scala | 7 +- .../datasources/v2/WriteToDataSourceV2Exec.scala | 35 +++--- .../spark/sql/sources/v2/DataSourceV2Suite.scala | 44 - .../sources/v2/FileDataSourceV2FallBackSuite.scala | 4 +- .../sql/sources/v2/SimpleWritableDataSource.scala | 23 ++- .../sql/test/DataFrameReaderWriterSuite.scala | 72 ++-- 11 files changed, 147 insertions(+), 155 deletions(-) diff --git a/sql/core/src/main/java/org/apache/spark/sql/sources/v2/TableProvider.java b/sql/core/src/main/java/org/apache/spark/sql/sources/v2/TableProvider.java index 04ad8fd..0e2eb9c 100644 --- a/sql/core/src/main/java/org/apache/spark/sql/sources/v2/TableProvider.java +++ b/sql/core/src/main/java/org/apache/spark/sql/sources/v2/TableProvider.java @@ -26,6 +26,10 @@ import org.apache.spark.sql.util.CaseInsensitiveStringMap; * The base interface for v2 data sources which don't have a real catalog. Implementations must * have a public, 0-arg constructor. * + * Note that, TableProvider can only apply data operations to existing tables, like read, append, + * delete, and overwrite. It does not support the operations that require metadata changes, like + * create/drop tables. + * * The major responsibility of this interface is to return a {@link Table} for read/write. * */ diff --git a/sql/core/src/main/java/org/apache/spark/sql/sources/v2/writer/SupportsSaveMode.java b/sql/core/src/main/java/org/apache/spark/sql/sources/v2/writer/SupportsSaveMode.java deleted file mode 100644 index c4295f2..000 --- a/sql/core/src/main/java/org
[spark] branch master updated: [SPARK-27816][SQL] make TreeNode tag type safe
This is an automated email from the ASF dual-hosted git repository. lixiao pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 1a68fc3 [SPARK-27816][SQL] make TreeNode tag type safe 1a68fc3 is described below commit 1a68fc38f0aafb9015c499b3f9f7fbe63739e909 Author: Wenchen Fan AuthorDate: Thu May 23 11:53:21 2019 -0700 [SPARK-27816][SQL] make TreeNode tag type safe ## What changes were proposed in this pull request? Add type parameter to `TreeNodeTag`. ## How was this patch tested? existing tests Closes #24687 from cloud-fan/tag. Authored-by: Wenchen Fan Signed-off-by: gatorsmile --- .../plans/logical/basicLogicalOperators.scala | 2 +- .../apache/spark/sql/catalyst/trees/TreeNode.scala | 21 - .../spark/sql/catalyst/trees/TreeNodeSuite.scala| 18 ++ .../org/apache/spark/sql/execution/SparkPlan.scala | 5 +++-- .../spark/sql/execution/SparkStrategies.scala | 2 +- .../execution/LogicalPlanTagInSparkPlanSuite.scala | 11 +-- 6 files changed, 36 insertions(+), 23 deletions(-) diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala index a2a7eb1..4350f91 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala @@ -1083,7 +1083,7 @@ case class OneRowRelation() extends LeafNode { /** [[org.apache.spark.sql.catalyst.trees.TreeNode.makeCopy()]] does not support 0-arg ctor. */ override def makeCopy(newArgs: Array[AnyRef]): OneRowRelation = { val newCopy = OneRowRelation() -newCopy.tags ++= this.tags +newCopy.copyTagsFrom(this) newCopy } } diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreeNode.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreeNode.scala index a5705d0..cd5dfb7 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreeNode.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreeNode.scala @@ -74,9 +74,8 @@ object CurrentOrigin { } } -// The name of the tree node tag. This is preferred over using string directly, as we can easily -// find all the defined tags. -case class TreeNodeTagName(name: String) +// A tag of a `TreeNode`, which defines name and type +case class TreeNodeTag[T](name: String) // scalastyle:off abstract class TreeNode[BaseType <: TreeNode[BaseType]] extends Product { @@ -89,7 +88,19 @@ abstract class TreeNode[BaseType <: TreeNode[BaseType]] extends Product { * A mutable map for holding auxiliary information of this tree node. It will be carried over * when this node is copied via `makeCopy`, or transformed via `transformUp`/`transformDown`. */ - val tags: mutable.Map[TreeNodeTagName, Any] = mutable.Map.empty + private val tags: mutable.Map[TreeNodeTag[_], Any] = mutable.Map.empty + + protected def copyTagsFrom(other: BaseType): Unit = { +tags ++= other.tags + } + + def setTagValue[T](tag: TreeNodeTag[T], value: T): Unit = { +tags(tag) = value + } + + def getTagValue[T](tag: TreeNodeTag[T]): Option[T] = { +tags.get(tag).map(_.asInstanceOf[T]) + } /** * Returns a Seq of the children of this node. @@ -418,7 +429,7 @@ abstract class TreeNode[BaseType <: TreeNode[BaseType]] extends Product { try { CurrentOrigin.withOrigin(origin) { val res = defaultCtor.newInstance(allArgs.toArray: _*).asInstanceOf[BaseType] -res.tags ++= this.tags +res.copyTagsFrom(this) res } } catch { diff --git a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/trees/TreeNodeSuite.scala b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/trees/TreeNodeSuite.scala index 5cfa84d..744d522 100644 --- a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/trees/TreeNodeSuite.scala +++ b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/trees/TreeNodeSuite.scala @@ -622,31 +622,33 @@ class TreeNodeSuite extends SparkFunSuite with SQLHelper { } test("tags will be carried over after copy & transform") { +val tag = TreeNodeTag[String]("test") + withClue("makeCopy") { val node = Dummy(None) - node.tags += TreeNodeTagName("test") -> "a" + node.setTagValue(tag, "a") val copied = node.makeCopy(Array(Some(Literal(1 - assert(copied.tags(TreeNodeTagName("test")) == "a") + assert(copied.getTagValue(tag) == S
[spark] branch master updated: [SPARK-27439][SQL] Explainging Dataset should show correct resolved plans
This is an automated email from the ASF dual-hosted git repository. lixiao pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new c033a3e [SPARK-27439][SQL] Explainging Dataset should show correct resolved plans c033a3e is described below commit c033a3e1e60a785a42d48c60dd5d72dcb7b0f7c9 Author: Liang-Chi Hsieh AuthorDate: Tue May 21 11:27:05 2019 -0700 [SPARK-27439][SQL] Explainging Dataset should show correct resolved plans ## What changes were proposed in this pull request? Because a temporary view is resolved during analysis when we create a dataset, the content of the view is determined when the dataset is created, not when it is evaluated. Now the explain result of a dataset is not correctly consistent with the collected result of it, because we use pre-analyzed logical plan of the dataset in explain command. The explain command will analyzed the logical plan passed in. So if a view is changed after the dataset was created, the plans shown by explain [...] ```scala scala> spark.range(10).createOrReplaceTempView("test") scala> spark.range(5).createOrReplaceTempView("test2") scala> spark.sql("select * from test").createOrReplaceTempView("tmp001") scala> val df = spark.sql("select * from tmp001") scala> spark.sql("select * from test2").createOrReplaceTempView("tmp001") scala> df.show +---+ | id| +---+ | 0| | 1| | 2| | 3| | 4| | 5| | 6| | 7| | 8| | 9| +---+ scala> df.explain(true) ``` Before: ```scala == Parsed Logical Plan == 'Project [*] +- 'UnresolvedRelation `tmp001` == Analyzed Logical Plan == id: bigint Project [id#2L] +- SubqueryAlias `tmp001` +- Project [id#2L] +- SubqueryAlias `test2` +- Range (0, 5, step=1, splits=Some(12)) == Optimized Logical Plan == Range (0, 5, step=1, splits=Some(12)) == Physical Plan == *(1) Range (0, 5, step=1, splits=12) ``` After: ```scala == Parsed Logical Plan == 'Project [*] +- 'UnresolvedRelation `tmp001` == Analyzed Logical Plan == id: bigint Project [id#0L] +- SubqueryAlias `tmp001` +- Project [id#0L] +- SubqueryAlias `test` +- Range (0, 10, step=1, splits=Some(12)) == Optimized Logical Plan == Range (0, 10, step=1, splits=Some(12)) == Physical Plan == *(1) Range (0, 10, step=1, splits=12) ``` Previous PR to this issue has a regression when to explain an explain statement, like `sql("explain select 1").explain(true)`. This new fix is following up with hvanhovell's advice at https://github.com/apache/spark/pull/24464#issuecomment-494165538. Explain an explain: ```scala scala> sql("explain select 1").explain(true) == Parsed Logical Plan == ExplainCommand 'Project [unresolvedalias(1, None)], false, false, false == Analyzed Logical Plan == plan: string ExplainCommand 'Project [unresolvedalias(1, None)], false, false, false == Optimized Logical Plan == ExplainCommand 'Project [unresolvedalias(1, None)], false, false, false == Physical Plan == Execute ExplainCommand +- ExplainCommand 'Project [unresolvedalias(1, None)], false, false, false ``` Btw, I found there is a regression after applying hvanhovell's advice: ```scala spark.readStream .format("org.apache.spark.sql.streaming.test") .load() .explain(true) ``` ```scala == Parsed Logical Plan == StreamingRelation DataSource(org.apache.spark.sql.test.TestSparkSession3e8c7175,org.apache.spark.sql.streaming.test,List(),None,List(),None,Map(),None ), dummySource, [a#559] == Analyzed Logical Plan == a: int StreamingRelation DataSource(org.apache.spark.sql.test.TestSparkSession3e8c7175,org.apache.spark.sql.streaming.test,List(),None,List(),None,Map(),Non$ ), dummySource, [a#559] == Optimized Logical Plan == org.apache.spark.sql.AnalysisException: Queries with streaming sources must be executed with writeStream.start();; dummySource == Physical Plan == org.apache.spark.sql.AnalysisException: Queries with streaming sources must be executed with writeStream.start();; dummySource ``` So I did a change to that to fix it too. ## How was this patch tested? Added test and manually test. Closes #24654 from viirya/SPARK-27439-3. Authored-by: Liang-Chi Hsieh Signed-off-by: gatorsmile --- .../main/scala/org/apache/spark/s
[spark] branch master updated: [SPARK-27747][SQL] add a logical plan link in the physical plan
This is an automated email from the ASF dual-hosted git repository. lixiao pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 0e6601a [SPARK-27747][SQL] add a logical plan link in the physical plan 0e6601a is described below commit 0e6601acdf17c770f880fbc263747779739f4c92 Author: Wenchen Fan AuthorDate: Mon May 20 13:42:25 2019 -0700 [SPARK-27747][SQL] add a logical plan link in the physical plan ## What changes were proposed in this pull request? It's pretty useful if we can convert a physical plan back to a logical plan, e.g., in https://github.com/apache/spark/pull/24389 This PR introduces a new feature to `TreeNode`, which allows `TreeNode` to carry some extra information via a mutable map, and keep the information when it's copied. The planner leverages this feature to put the logical plan into the physical plan. ## How was this patch tested? a test suite that runs all TPCDS queries and checks that some common physical plans contain the corresponding logical plans. Closes #24626 from cloud-fan/link. Lead-authored-by: Wenchen Fan Co-authored-by: Peng Bo Signed-off-by: gatorsmile --- .../plans/logical/basicLogicalOperators.scala | 6 +- .../apache/spark/sql/catalyst/trees/TreeNode.scala | 25 +++- .../spark/sql/catalyst/trees/TreeNodeSuite.scala | 51 .../org/apache/spark/sql/execution/SparkPlan.scala | 9 +- .../spark/sql/execution/SparkStrategies.scala | 11 ++ .../execution/LogicalPlanTagInSparkPlanSuite.scala | 133 + 6 files changed, 228 insertions(+), 7 deletions(-) diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala index 1925d45..a2a7eb1 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala @@ -1081,7 +1081,11 @@ case class OneRowRelation() extends LeafNode { override def computeStats(): Statistics = Statistics(sizeInBytes = 1) /** [[org.apache.spark.sql.catalyst.trees.TreeNode.makeCopy()]] does not support 0-arg ctor. */ - override def makeCopy(newArgs: Array[AnyRef]): OneRowRelation = OneRowRelation() + override def makeCopy(newArgs: Array[AnyRef]): OneRowRelation = { +val newCopy = OneRowRelation() +newCopy.tags ++= this.tags +newCopy + } } /** A logical plan for `dropDuplicates`. */ diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreeNode.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreeNode.scala index 84ca066..a5705d0 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreeNode.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreeNode.scala @@ -19,7 +19,7 @@ package org.apache.spark.sql.catalyst.trees import java.util.UUID -import scala.collection.Map +import scala.collection.{mutable, Map} import scala.reflect.ClassTag import org.apache.commons.lang3.ClassUtils @@ -35,7 +35,7 @@ import org.apache.spark.sql.catalyst.errors._ import org.apache.spark.sql.catalyst.expressions._ import org.apache.spark.sql.catalyst.plans.JoinType import org.apache.spark.sql.catalyst.plans.physical.{BroadcastMode, Partitioning} -import org.apache.spark.sql.catalyst.util.StringUtils.{PlanStringConcat, StringConcat} +import org.apache.spark.sql.catalyst.util.StringUtils.PlanStringConcat import org.apache.spark.sql.catalyst.util.truncatedString import org.apache.spark.sql.internal.SQLConf import org.apache.spark.sql.types._ @@ -74,6 +74,10 @@ object CurrentOrigin { } } +// The name of the tree node tag. This is preferred over using string directly, as we can easily +// find all the defined tags. +case class TreeNodeTagName(name: String) + // scalastyle:off abstract class TreeNode[BaseType <: TreeNode[BaseType]] extends Product { // scalastyle:on @@ -82,6 +86,12 @@ abstract class TreeNode[BaseType <: TreeNode[BaseType]] extends Product { val origin: Origin = CurrentOrigin.get /** + * A mutable map for holding auxiliary information of this tree node. It will be carried over + * when this node is copied via `makeCopy`, or transformed via `transformUp`/`transformDown`. + */ + val tags: mutable.Map[TreeNodeTagName, Any] = mutable.Map.empty + + /** * Returns a Seq of the children of this node. * Children should not change. Immutability required for containsChild optimization */ @@ -262,6 +272,8 @@ abstract class TreeNode[BaseType <: TreeNode[BaseType]] extends Product { if (this fastEquals afterRule) {
[spark] branch master updated: [SPARK-27674][SQL] the hint should not be dropped after cache lookup
This is an automated email from the ASF dual-hosted git repository. lixiao pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 3e30a98 [SPARK-27674][SQL] the hint should not be dropped after cache lookup 3e30a98 is described below commit 3e30a988102e162f2702ae223312763a0bdc15eb Author: Wenchen Fan AuthorDate: Wed May 15 15:47:52 2019 -0700 [SPARK-27674][SQL] the hint should not be dropped after cache lookup ## What changes were proposed in this pull request? This is a followup of https://github.com/apache/spark/pull/20365 . #20365 fixed this problem when the hint node is a root node. This PR fixes this problem for all the cases. ## How was this patch tested? a new test Closes #24580 from cloud-fan/bug. Authored-by: Wenchen Fan Signed-off-by: gatorsmile --- .../catalyst/optimizer/EliminateResolvedHint.scala | 2 +- .../apache/spark/sql/execution/CacheManager.scala | 22 + .../org/apache/spark/sql/CachedTableSuite.scala| 56 -- 3 files changed, 54 insertions(+), 26 deletions(-) diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/EliminateResolvedHint.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/EliminateResolvedHint.scala index 5586690..aebd660 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/EliminateResolvedHint.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/EliminateResolvedHint.scala @@ -56,7 +56,7 @@ object EliminateResolvedHint extends Rule[LogicalPlan] { * in this method will be cleaned up later by this rule, and may emit warnings depending on the * configurations. */ - private def extractHintsFromPlan(plan: LogicalPlan): (LogicalPlan, Seq[HintInfo]) = { + private[sql] def extractHintsFromPlan(plan: LogicalPlan): (LogicalPlan, Seq[HintInfo]) = { plan match { case h: ResolvedHint => val (plan, hints) = extractHintsFromPlan(h.child) diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/CacheManager.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/CacheManager.scala index b3c253b..a13e6ad 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/CacheManager.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/CacheManager.scala @@ -23,7 +23,8 @@ import org.apache.hadoop.fs.{FileSystem, Path} import org.apache.spark.internal.Logging import org.apache.spark.sql.{Dataset, SparkSession} -import org.apache.spark.sql.catalyst.expressions.{Attribute, AttributeMap, SubqueryExpression} +import org.apache.spark.sql.catalyst.expressions.{Attribute, SubqueryExpression} +import org.apache.spark.sql.catalyst.optimizer.EliminateResolvedHint import org.apache.spark.sql.catalyst.plans.logical.{IgnoreCachedData, LogicalPlan, ResolvedHint} import org.apache.spark.sql.execution.columnar.InMemoryRelation import org.apache.spark.sql.execution.command.CommandUtils @@ -212,17 +213,18 @@ class CacheManager extends Logging { def useCachedData(plan: LogicalPlan): LogicalPlan = { val newPlan = plan transformDown { case command: IgnoreCachedData => command - // Do not lookup the cache by hint node. Hint node is special, we should ignore it when - // canonicalizing plans, so that plans which are same except hint can hit the same cache. - // However, we also want to keep the hint info after cache lookup. Here we skip the hint - // node, so that the returned caching plan won't replace the hint node and drop the hint info - // from the original plan. - case hint: ResolvedHint => hint case currentFragment => -lookupCachedData(currentFragment) - .map(_.cachedRepresentation.withOutput(currentFragment.output)) - .getOrElse(currentFragment) +lookupCachedData(currentFragment).map { cached => + // After cache lookup, we should still keep the hints from the input plan. + val hints = EliminateResolvedHint.extractHintsFromPlan(currentFragment)._2 + val cachedPlan = cached.cachedRepresentation.withOutput(currentFragment.output) + // The returned hint list is in top-down order, we should create the hint nodes from + // right to left. + hints.foldRight[LogicalPlan](cachedPlan) { case (hint, p) => +ResolvedHint(p, hint) + } +}.getOrElse(currentFragment) } newPlan transformAllExpressions { diff --git a/sql/core/src/test/scala/org/apache/spark/sql/CachedTableSuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/CachedTableSuite.scala index 76350ad..62e77bf 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/CachedTableSuite.scala +++ b/sql/core/src/test/sca
[spark] branch master updated: [SPARK-27354][SQL] Move incompatible code from the hive-thriftserver module to sql/hive-thriftserver/v1.2.1
This is an automated email from the ASF dual-hosted git repository. lixiao pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 02c3369 [SPARK-27354][SQL] Move incompatible code from the hive-thriftserver module to sql/hive-thriftserver/v1.2.1 02c3369 is described below commit 02c33694c8254f69cb36c71c0876194dccdbc014 Author: Yuming Wang AuthorDate: Wed May 15 14:52:08 2019 -0700 [SPARK-27354][SQL] Move incompatible code from the hive-thriftserver module to sql/hive-thriftserver/v1.2.1 ## What changes were proposed in this pull request? When we upgraded the built-in Hive to 2.3.4, the current `hive-thriftserver` module is not compatible, such as these Hive changes: 1. [HIVE-12442](https://issues.apache.org/jira/browse/HIVE-12442) HiveServer2: Refactor/repackage HiveServer2's Thrift code so that it can be used in the tasks 2. [HIVE-12237](https://issues.apache.org/jira/browse/HIVE-12237) Use slf4j as logging facade 3. [HIVE-13169](https://issues.apache.org/jira/browse/HIVE-13169) HiveServer2: Support delegation token based connection when using http transport So this PR moves the incompatible code to `sql/hive-thriftserver/v1.2.1` and copies it to `sql/hive-thriftserver/v2.3.4` for the next code review. ## How was this patch tested? manual tests: ``` diff -urNa sql/hive-thriftserver/v1.2.1 sql/hive-thriftserver/v2.3.4 ``` Closes #24282 from wangyum/SPARK-27354. Authored-by: Yuming Wang Signed-off-by: gatorsmile --- sql/hive-thriftserver/pom.xml | 4 +++- sql/hive-thriftserver/{ => v1.2.1}/if/TCLIService.thrift | 0 .../gen/java/org/apache/hive/service/cli/thrift/TArrayTypeEntry.java | 0 .../gen/java/org/apache/hive/service/cli/thrift/TBinaryColumn.java| 0 .../src/gen/java/org/apache/hive/service/cli/thrift/TBoolColumn.java | 0 .../src/gen/java/org/apache/hive/service/cli/thrift/TBoolValue.java | 0 .../src/gen/java/org/apache/hive/service/cli/thrift/TByteColumn.java | 0 .../src/gen/java/org/apache/hive/service/cli/thrift/TByteValue.java | 0 .../src/gen/java/org/apache/hive/service/cli/thrift/TCLIService.java | 0 .../java/org/apache/hive/service/cli/thrift/TCLIServiceConstants.java | 0 .../org/apache/hive/service/cli/thrift/TCancelDelegationTokenReq.java | 0 .../apache/hive/service/cli/thrift/TCancelDelegationTokenResp.java| 0 .../java/org/apache/hive/service/cli/thrift/TCancelOperationReq.java | 0 .../java/org/apache/hive/service/cli/thrift/TCancelOperationResp.java | 0 .../java/org/apache/hive/service/cli/thrift/TCloseOperationReq.java | 0 .../java/org/apache/hive/service/cli/thrift/TCloseOperationResp.java | 0 .../gen/java/org/apache/hive/service/cli/thrift/TCloseSessionReq.java | 0 .../java/org/apache/hive/service/cli/thrift/TCloseSessionResp.java| 0 .../src/gen/java/org/apache/hive/service/cli/thrift/TColumn.java | 0 .../src/gen/java/org/apache/hive/service/cli/thrift/TColumnDesc.java | 0 .../src/gen/java/org/apache/hive/service/cli/thrift/TColumnValue.java | 0 .../gen/java/org/apache/hive/service/cli/thrift/TDoubleColumn.java| 0 .../src/gen/java/org/apache/hive/service/cli/thrift/TDoubleValue.java | 0 .../java/org/apache/hive/service/cli/thrift/TExecuteStatementReq.java | 0 .../org/apache/hive/service/cli/thrift/TExecuteStatementResp.java | 0 .../java/org/apache/hive/service/cli/thrift/TFetchOrientation.java| 0 .../gen/java/org/apache/hive/service/cli/thrift/TFetchResultsReq.java | 0 .../java/org/apache/hive/service/cli/thrift/TFetchResultsResp.java| 0 .../gen/java/org/apache/hive/service/cli/thrift/TGetCatalogsReq.java | 0 .../gen/java/org/apache/hive/service/cli/thrift/TGetCatalogsResp.java | 0 .../gen/java/org/apache/hive/service/cli/thrift/TGetColumnsReq.java | 0 .../gen/java/org/apache/hive/service/cli/thrift/TGetColumnsResp.java | 0 .../org/apache/hive/service/cli/thrift/TGetDelegationTokenReq.java| 0 .../org/apache/hive/service/cli/thrift/TGetDelegationTokenResp.java | 0 .../gen/java/org/apache/hive/service/cli/thrift/TGetFunctionsReq.java | 0 .../java/org/apache/hive/service/cli/thrift/TGetFunctionsResp.java| 0 .../src/gen/java/org/apache/hive/service/cli/thrift/TGetInfoReq.java | 0 .../src/gen/java/org/apache/hive/service/cli/thrift/TGetInfoResp.java | 0 .../src/gen/java/org/apache/hive/service/cli/thrift/TGetInfoType.java | 0 .../gen/java/org/apache/hive/service/cli/thrift/TGetInfoValue.java| 0 .../org/apache/hive/service/cli/thrift/TGetOperationStatusReq.java| 0 .../org/apache/hive/service/cli/thrift/TGetOperationStatusResp.java | 0 .../org/apache/hive/service/cli/thrift/TGetResultSetMetadataReq.java | 0 .../org/apache/hive/service/cli/thrift/TGetResultSetMetadataResp.java | 0 .../gen/java/or
[spark] branch master updated: [SPARK-20774][SPARK-27036][SQL] Cancel the running broadcast execution on BroadcastTimeout
This is an automated email from the ASF dual-hosted git repository. lixiao pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 0bba5cf [SPARK-20774][SPARK-27036][SQL] Cancel the running broadcast execution on BroadcastTimeout 0bba5cf is described below commit 0bba5cf56832f0690a4ebd733d01a0416e4c7252 Author: Xingbo Jiang AuthorDate: Wed May 15 14:47:15 2019 -0700 [SPARK-20774][SPARK-27036][SQL] Cancel the running broadcast execution on BroadcastTimeout ## What changes were proposed in this pull request? In the existing code, a broadcast execution timeout for the Future only causes a query failure, but the job running with the broadcast and the computation in the Future are not canceled. This wastes resources and slows down the other jobs. This PR tries to cancel both the running job and the running hashed relation construction thread. ## How was this patch tested? Add new test suite `BroadcastExchangeExec` Closes #24595 from jiangxb1987/SPARK-20774. Authored-by: Xingbo Jiang Signed-off-by: gatorsmile --- .../org/apache/spark/sql/internal/SQLConf.scala| 5 +- .../execution/exchange/BroadcastExchangeExec.scala | 151 +++-- .../sql/execution/BroadcastExchangeSuite.scala | 93 + 3 files changed, 176 insertions(+), 73 deletions(-) diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala index b5e40dd..b4c68a7 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala @@ -2026,7 +2026,10 @@ class SQLConf extends Serializable with Logging { def columnNameOfCorruptRecord: String = getConf(COLUMN_NAME_OF_CORRUPT_RECORD) - def broadcastTimeout: Long = getConf(BROADCAST_TIMEOUT) + def broadcastTimeout: Long = { +val timeoutValue = getConf(BROADCAST_TIMEOUT) +if (timeoutValue < 0) Long.MaxValue else timeoutValue + } def defaultDataSourceName: String = getConf(DEFAULT_DATA_SOURCE_NAME) diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/BroadcastExchangeExec.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/BroadcastExchangeExec.scala index aa0dd1d..8017188 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/BroadcastExchangeExec.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/BroadcastExchangeExec.scala @@ -17,10 +17,11 @@ package org.apache.spark.sql.execution.exchange -import java.util.concurrent.TimeoutException +import java.util.UUID +import java.util.concurrent._ -import scala.concurrent.{ExecutionContext, Future} -import scala.concurrent.duration._ +import scala.concurrent.ExecutionContext +import scala.concurrent.duration.NANOSECONDS import scala.util.control.NonFatal import org.apache.spark.{broadcast, SparkException} @@ -43,6 +44,8 @@ case class BroadcastExchangeExec( mode: BroadcastMode, child: SparkPlan) extends Exchange { + private val runId: UUID = UUID.randomUUID + override lazy val metrics = Map( "dataSize" -> SQLMetrics.createSizeMetric(sparkContext, "data size"), "collectTime" -> SQLMetrics.createTimingMetric(sparkContext, "time to collect"), @@ -56,79 +59,79 @@ case class BroadcastExchangeExec( } @transient - private val timeout: Duration = { -val timeoutValue = sqlContext.conf.broadcastTimeout -if (timeoutValue < 0) { - Duration.Inf -} else { - timeoutValue.seconds -} - } + private val timeout: Long = SQLConf.get.broadcastTimeout @transient private lazy val relationFuture: Future[broadcast.Broadcast[Any]] = { -// broadcastFuture is used in "doExecute". Therefore we can get the execution id correctly here. +// relationFuture is used in "doExecute". Therefore we can get the execution id correctly here. val executionId = sparkContext.getLocalProperty(SQLExecution.EXECUTION_ID_KEY) -Future { - // This will run in another thread. Set the execution id so that we can connect these jobs - // with the correct execution. - SQLExecution.withExecutionId(sqlContext.sparkSession, executionId) { -try { - val beforeCollect = System.nanoTime() - // Use executeCollect/executeCollectIterator to avoid conversion to Scala types - val (numRows, input) = child.executeCollectIterator() - if (numRows >= 51200) { -throw new SparkException( - s"Cannot broadcast the table with 512 million or more rows: $numRows rows") - } - - val beforeBuild = System.nanoTime() -
[spark] branch master updated: [SPARK-27402][SQL][TEST-HADOOP3.2][TEST-MAVEN] Fix hadoop-3.2 test issue(except the hive-thriftserver module)
This is an automated email from the ASF dual-hosted git repository. lixiao pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new f3ddd6f [SPARK-27402][SQL][TEST-HADOOP3.2][TEST-MAVEN] Fix hadoop-3.2 test issue(except the hive-thriftserver module) f3ddd6f is described below commit f3ddd6f9da27925607c06e55cdfb9a809633238b Author: Yuming Wang AuthorDate: Mon May 13 10:35:26 2019 -0700 [SPARK-27402][SQL][TEST-HADOOP3.2][TEST-MAVEN] Fix hadoop-3.2 test issue(except the hive-thriftserver module) ## What changes were proposed in this pull request? This pr fix hadoop-3.2 test issues(except the `hive-thriftserver` module): 1. Add `hive.metastore.schema.verification` and `datanucleus.schema.autoCreateAll` to HiveConf. 2. hadoop-3.2 support access the Hive metastore from 0.12 to 2.2 After [SPARK-27176](https://issues.apache.org/jira/browse/SPARK-27176) and this PR, we upgraded the built-in Hive to 2.3 when enabling the Hadoop 3.2+ profile. This upgrade fixes the following issues: - [HIVE-6727](https://issues.apache.org/jira/browse/HIVE-6727): Table level stats for external tables are set incorrectly. - [HIVE-15653](https://issues.apache.org/jira/browse/HIVE-15653): Some ALTER TABLE commands drop table stats. - [SPARK-12014](https://issues.apache.org/jira/browse/SPARK-12014): Spark SQL query containing semicolon is broken in Beeline. - [SPARK-25193](https://issues.apache.org/jira/browse/SPARK-25193): insert overwrite doesn't throw exception when drop old data fails. - [SPARK-25919](https://issues.apache.org/jira/browse/SPARK-25919): Date value corrupts when tables are "ParquetHiveSerDe" formatted and target table is Partitioned. - [SPARK-26332](https://issues.apache.org/jira/browse/SPARK-26332): Spark sql write orc table on viewFS throws exception. - [SPARK-26437](https://issues.apache.org/jira/browse/SPARK-26437): Decimal data becomes bigint to query, unable to query. ## How was this patch tested? This pr test Spark’s Hadoop 3.2 profile on jenkins and #24591 test Spark’s Hadoop 2.7 profile on jenkins This PR close #24591 Closes #24391 from wangyum/SPARK-27402. Authored-by: Yuming Wang Signed-off-by: gatorsmile --- dev/sparktestsupport/modules.py| 14 ++- .../spark/sql/execution/command/DDLSuite.scala | 2 +- .../execution/datasources/orc/OrcSourceSuite.scala | 4 +- .../spark/sql/hive/client/HiveClientImpl.scala | 25 +++- .../sql/hive/client/IsolatedClientLoader.scala | 2 + .../org/apache/spark/sql/hive/test/TestHive.scala | 9 - sql/hive/src/test/resources/hive-contrib-2.3.4.jar | Bin 0 -> 125719 bytes .../test/resources/hive-hcatalog-core-2.3.4.jar| Bin 0 -> 263795 bytes .../sql/hive/ClasspathDependenciesSuite.scala | 21 -- .../hive/HiveExternalCatalogVersionsSuite.scala| 4 ++ .../spark/sql/hive/HiveMetastoreCatalogSuite.scala | 17 ++-- .../org/apache/spark/sql/hive/HiveShimSuite.scala | 12 +- .../apache/spark/sql/hive/StatisticsSuite.scala| 34 +++- .../spark/sql/hive/orc/HiveOrcFilterSuite.scala| 43 ++--- .../spark/sql/hive/orc/HiveOrcSourceSuite.scala| 9 - 15 files changed, 154 insertions(+), 42 deletions(-) diff --git a/dev/sparktestsupport/modules.py b/dev/sparktestsupport/modules.py index d496eec..812c2ef 100644 --- a/dev/sparktestsupport/modules.py +++ b/dev/sparktestsupport/modules.py @@ -15,9 +15,17 @@ # limitations under the License. # +from __future__ import print_function from functools import total_ordering import itertools import re +import os + +if os.environ.get("AMPLAB_JENKINS"): +hadoop_version = os.environ.get("AMPLAB_JENKINS_BUILD_PROFILE", "hadoop2.7") +else: +hadoop_version = os.environ.get("HADOOP_PROFILE", "hadoop2.7") +print("[info] Choosing supported modules with Hadoop profile", hadoop_version) all_modules = [] @@ -72,7 +80,11 @@ class Module(object): self.dependent_modules = set() for dep in dependencies: dep.dependent_modules.add(self) -all_modules.append(self) +# TODO: Skip hive-thriftserver module for hadoop-3.2. remove this once hadoop-3.2 support it +if name == "hive-thriftserver" and hadoop_version == "hadoop3.2": +print("[info] Skip unsupported module:", name) +else: +all_modules.append(self) def contains_file(self, filename): return any(re.match(p, filename) for p in self.source_file_prefixes) diff --git a/sql/core/src/test/scala/org/apache/spark/sql/execution/command/DDLSuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/execution/com
[spark] branch master updated: [SPARK-27642][SS] make v1 offset extends v2 offset
This is an automated email from the ASF dual-hosted git repository. lixiao pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new bae5baa [SPARK-27642][SS] make v1 offset extends v2 offset bae5baa is described below commit bae5baae5281d01dc8c67077b90592be857329bd Author: Wenchen Fan AuthorDate: Tue May 7 23:03:15 2019 -0700 [SPARK-27642][SS] make v1 offset extends v2 offset ## What changes were proposed in this pull request? To move DS v2 to the catalyst module, we can't make v2 offset rely on v1 offset, as v1 offset is in sql/core. ## How was this patch tested? existing tests Closes #24538 from cloud-fan/offset. Authored-by: Wenchen Fan Signed-off-by: gatorsmile --- .../spark/sql/kafka010/KafkaContinuousStream.scala | 2 +- .../spark/sql/kafka010/KafkaSourceOffset.scala | 4 +-- .../spark/sql/execution/streaming/Offset.java | 42 +++--- .../sql/sources/v2/reader/streaming/Offset.java| 11 ++ .../spark/sql/execution/streaming/LongOffset.scala | 14 +--- .../execution/streaming/MicroBatchExecution.scala | 10 +++--- .../spark/sql/execution/streaming/OffsetSeq.scala | 9 ++--- .../sql/execution/streaming/OffsetSeqLog.scala | 3 +- .../sql/execution/streaming/StreamExecution.scala | 4 +-- .../sql/execution/streaming/StreamProgress.scala | 19 +- .../spark/sql/execution/streaming/memory.scala | 25 + .../sources/TextSocketMicroBatchStream.scala | 5 +-- .../apache/spark/sql/streaming/StreamTest.scala| 8 ++--- 13 files changed, 49 insertions(+), 107 deletions(-) diff --git a/external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaContinuousStream.scala b/external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaContinuousStream.scala index d60ee1c..92686d2 100644 --- a/external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaContinuousStream.scala +++ b/external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaContinuousStream.scala @@ -76,7 +76,7 @@ class KafkaContinuousStream( } override def planInputPartitions(start: Offset): Array[InputPartition] = { -val oldStartPartitionOffsets = KafkaSourceOffset.getPartitionOffsets(start) +val oldStartPartitionOffsets = start.asInstanceOf[KafkaSourceOffset].partitionToOffsets val currentPartitionSet = offsetReader.fetchEarliestOffsets().keySet val newPartitions = currentPartitionSet.diff(oldStartPartitionOffsets.keySet) diff --git a/external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaSourceOffset.scala b/external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaSourceOffset.scala index 8d41c0d..90d7043 100644 --- a/external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaSourceOffset.scala +++ b/external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaSourceOffset.scala @@ -20,14 +20,14 @@ package org.apache.spark.sql.kafka010 import org.apache.kafka.common.TopicPartition import org.apache.spark.sql.execution.streaming.{Offset, SerializedOffset} -import org.apache.spark.sql.sources.v2.reader.streaming.{Offset => OffsetV2, PartitionOffset} +import org.apache.spark.sql.sources.v2.reader.streaming.PartitionOffset /** * An [[Offset]] for the [[KafkaSource]]. This one tracks all partitions of subscribed topics and * their offsets. */ private[kafka010] -case class KafkaSourceOffset(partitionToOffsets: Map[TopicPartition, Long]) extends OffsetV2 { +case class KafkaSourceOffset(partitionToOffsets: Map[TopicPartition, Long]) extends Offset { override val json = JsonUtils.partitionOffsets(partitionToOffsets) } diff --git a/sql/core/src/main/java/org/apache/spark/sql/execution/streaming/Offset.java b/sql/core/src/main/java/org/apache/spark/sql/execution/streaming/Offset.java index 43ad4b3..7c167dc 100644 --- a/sql/core/src/main/java/org/apache/spark/sql/execution/streaming/Offset.java +++ b/sql/core/src/main/java/org/apache/spark/sql/execution/streaming/Offset.java @@ -18,44 +18,10 @@ package org.apache.spark.sql.execution.streaming; /** - * This is an internal, deprecated interface. New source implementations should use the - * org.apache.spark.sql.sources.v2.reader.streaming.Offset class, which is the one that will be - * supported in the long term. + * This class is an alias of {@link org.apache.spark.sql.sources.v2.reader.streaming.Offset}. It's + * internal and deprecated. New streaming data source implementations should use data source v2 API, + * which will be supported in the long term. * * This class will be removed in a future release. */ -public abstract class Offset { -/** - * A JSON-serialized representation of an Offset that is - * used for saving o
svn commit: r33930 - /release/spark/spark-2.4.2/
Author: lixiao Date: Tue May 7 07:21:45 2019 New Revision: 33930 Log: Remove Spark 2.4.2 Removed: release/spark/spark-2.4.2/ - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
svn commit: r33928 - /dev/spark/v2.4.3-rc1-bin/ /release/spark/spark-/
Author: lixiao Date: Tue May 7 07:17:26 2019 New Revision: 33928 Log: Apache Spark 2.4.3 Added: release/spark/spark-/ - copied from r33927, dev/spark/v2.4.3-rc1-bin/ Removed: dev/spark/v2.4.3-rc1-bin/ - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
svn commit: r33929 - in /release/spark: spark-/ spark-2.4.3/
Author: lixiao Date: Tue May 7 07:17:56 2019 New Revision: 33929 Log: Apache Spark 2.4.3 Added: release/spark/spark-2.4.3/ - copied from r33928, release/spark/spark-/ Removed: release/spark/spark-/ - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
svn commit: r33927 - /dev/spark/v2.4.3-rc1-docs/
Author: lixiao Date: Tue May 7 07:15:54 2019 New Revision: 33927 Log: Remove RC artifacts Removed: dev/spark/v2.4.3-rc1-docs/ - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark-website] branch asf-git created (now 7520ea9)
This is an automated email from the ASF dual-hosted git repository. lixiao pushed a change to branch asf-git in repository https://gitbox.apache.org/repos/asf/spark-website.git. at 7520ea9 Add docs for Apache Spark 2.4.3 This branch includes the following new commits: new 7520ea9 Add docs for Apache Spark 2.4.3 The 1 revisions listed above as "new" are entirely new to this repository and will be described in separate emails. The revisions listed as "add" were already present in the repository and have only been added to this reference. - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark-website] 01/01: Add docs for Apache Spark 2.4.3
This is an automated email from the ASF dual-hosted git repository. lixiao pushed a commit to branch asf-git in repository https://gitbox.apache.org/repos/asf/spark-website.git commit 7520ea9da64ae87051349dca99e2f26d171f625e Author: Wenchen Fan AuthorDate: Tue May 7 07:14:07 2019 + Add docs for Apache Spark 2.4.3 --- site/docs/latest | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/site/docs/latest b/site/docs/latest index acdc3f1..630e6ef 12 --- a/site/docs/latest +++ b/site/docs/latest @@ -1 +1 @@ -2.4.2 \ No newline at end of file +site/docs/2.4.3 \ No newline at end of file - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] tag v2.4.3 created (now c3e32bf)
This is an automated email from the ASF dual-hosted git repository. lixiao pushed a change to tag v2.4.3 in repository https://gitbox.apache.org/repos/asf/spark.git. at c3e32bf (commit) No new revisions were added by this update. - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-2.4 updated: [SPARK-27596][SQL] The JDBC 'query' option doesn't work for Oracle database
This is an automated email from the ASF dual-hosted git repository. lixiao pushed a commit to branch branch-2.4 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-2.4 by this push: new 771da83 [SPARK-27596][SQL] The JDBC 'query' option doesn't work for Oracle database 771da83 is described below commit 771da83c34247cb1f75ee13b939ee51baa3a11bb Author: Dilip Biswal AuthorDate: Sun May 5 21:52:23 2019 -0700 [SPARK-27596][SQL] The JDBC 'query' option doesn't work for Oracle database ## What changes were proposed in this pull request? **Description from JIRA** For the JDBC option `query`, we use the identifier name to start with underscore: s"(${subquery}) _SPARK_GEN_JDBC_SUBQUERY_NAME${curId.getAndIncrement()}". This is not supported by Oracle. The Oracle doesn't seem to support identifier name to start with non-alphabet character (unless it is quoted) and has length restrictions as well. [link](https://docs.oracle.com/cd/B19306_01/server.102/b14200/sql_elements008.htm) In this PR, the generated alias name 'SPARK_GEN_JDBC_SUBQUERY_NAME' is fixed to remove "_" prefix and also the alias name is shortened to not exceed the identifier length limit. ## How was this patch tested? Tests are added for MySql, Postgress, Oracle and DB2 to ensure enough coverage. Closes #24532 from dilipbiswal/SPARK-27596. Authored-by: Dilip Biswal Signed-off-by: gatorsmile (cherry picked from commit 6001d476ce663ee6a458535431d30e8213181fcf) Signed-off-by: gatorsmile --- .../spark/sql/jdbc/DB2IntegrationSuite.scala | 26 .../spark/sql/jdbc/MySQLIntegrationSuite.scala | 27 + .../spark/sql/jdbc/OracleIntegrationSuite.scala| 28 ++ .../spark/sql/jdbc/PostgresIntegrationSuite.scala | 26 .../execution/datasources/jdbc/JDBCOptions.scala | 2 +- 5 files changed, 108 insertions(+), 1 deletion(-) diff --git a/external/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/DB2IntegrationSuite.scala b/external/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/DB2IntegrationSuite.scala index f5930bc28..32e56f0 100644 --- a/external/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/DB2IntegrationSuite.scala +++ b/external/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/DB2IntegrationSuite.scala @@ -158,4 +158,30 @@ class DB2IntegrationSuite extends DockerJDBCIntegrationSuite { assert(rows(0).getInt(1) == 20) assert(rows(0).getString(2) == "1") } + + test("query JDBC option") { +val expectedResult = Set( + (42, "fred"), + (17, "dave") +).map { case (x, y) => + Row(Integer.valueOf(x), String.valueOf(y)) +} + +val query = "SELECT x, y FROM tbl WHERE x > 10" +// query option to pass on the query string. +val df = spark.read.format("jdbc") + .option("url", jdbcUrl) + .option("query", query) + .load() +assert(df.collect.toSet === expectedResult) + +// query option in the create table path. +sql( + s""" + |CREATE OR REPLACE TEMPORARY VIEW queryOption + |USING org.apache.spark.sql.jdbc + |OPTIONS (url '$jdbcUrl', query '$query') + """.stripMargin.replaceAll("\n", " ")) +assert(sql("select x, y from queryOption").collect.toSet == expectedResult) + } } diff --git a/external/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/MySQLIntegrationSuite.scala b/external/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/MySQLIntegrationSuite.scala index a70ed98..9cd5c4e 100644 --- a/external/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/MySQLIntegrationSuite.scala +++ b/external/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/MySQLIntegrationSuite.scala @@ -21,6 +21,7 @@ import java.math.BigDecimal import java.sql.{Connection, Date, Timestamp} import java.util.Properties +import org.apache.spark.sql.{Row, SaveMode} import org.apache.spark.tags.DockerTest @DockerTest @@ -152,4 +153,30 @@ class MySQLIntegrationSuite extends DockerJDBCIntegrationSuite { df2.write.jdbc(jdbcUrl, "datescopy", new Properties) df3.write.jdbc(jdbcUrl, "stringscopy", new Properties) } + + test("query JDBC option") { +val expectedResult = Set( + (42, "fred"), + (17, "dave") +).map { case (x, y) => + Row(Integer.valueOf(x), String.valueOf(y)) +} + +val query = "SELECT x, y FROM tbl WHERE x > 10" +// query option to pass on t
[spark] branch master updated (d9bcacf -> 6001d47)
This is an automated email from the ASF dual-hosted git repository. lixiao pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from d9bcacf [SPARK-27629][PYSPARK] Prevent Unpickler from intervening each unpickling add 6001d47 [SPARK-27596][SQL] The JDBC 'query' option doesn't work for Oracle database No new revisions were added by this update. Summary of changes: .../spark/sql/jdbc/DB2IntegrationSuite.scala | 26 .../spark/sql/jdbc/MySQLIntegrationSuite.scala | 27 + .../spark/sql/jdbc/OracleIntegrationSuite.scala| 28 ++ .../spark/sql/jdbc/PostgresIntegrationSuite.scala | 26 .../execution/datasources/jdbc/JDBCOptions.scala | 2 +- 5 files changed, 108 insertions(+), 1 deletion(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-24360][FOLLOW-UP][SQL] Add missing options for sql-migration-guide-hive-compatibility.md
This is an automated email from the ASF dual-hosted git repository. lixiao pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 9a419c3 [SPARK-24360][FOLLOW-UP][SQL] Add missing options for sql-migration-guide-hive-compatibility.md 9a419c3 is described below commit 9a419c37ecd3e7638d35043156ac2e60dabe97df Author: Yuming Wang AuthorDate: Fri May 3 08:42:11 2019 -0700 [SPARK-24360][FOLLOW-UP][SQL] Add missing options for sql-migration-guide-hive-compatibility.md ## What changes were proposed in this pull request? This pr add missing options for `sql-migration-guide-hive-compatibility.md`. ## How was this patch tested? N/A Closes #24520 from wangyum/SPARK-24360. Authored-by: Yuming Wang Signed-off-by: gatorsmile --- docs/sql-migration-guide-hive-compatibility.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/sql-migration-guide-hive-compatibility.md b/docs/sql-migration-guide-hive-compatibility.md index 857e974..5602f53 100644 --- a/docs/sql-migration-guide-hive-compatibility.md +++ b/docs/sql-migration-guide-hive-compatibility.md @@ -25,7 +25,7 @@ license: | Spark SQL is designed to be compatible with the Hive Metastore, SerDes and UDFs. Currently, Hive SerDes and UDFs are based on Hive 1.2.1, and Spark SQL can be connected to different versions of Hive Metastore -(from 0.12.0 to 2.3.4. Also see [Interacting with Different Versions of Hive Metastore](sql-data-sources-hive-tables.html#interacting-with-different-versions-of-hive-metastore)). +(from 0.12.0 to 2.3.4 and 3.1.0 to 3.1.1. Also see [Interacting with Different Versions of Hive Metastore](sql-data-sources-hive-tables.html#interacting-with-different-versions-of-hive-metastore)). Deploying in Existing Hive Warehouses - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-2.4 updated: [SPARK-26048][SPARK-24530][2.4] Cherrypick all the missing commits to 2.4 release script
This is an automated email from the ASF dual-hosted git repository. lixiao pushed a commit to branch branch-2.4 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-2.4 by this push: new e417168 [SPARK-26048][SPARK-24530][2.4] Cherrypick all the missing commits to 2.4 release script e417168 is described below commit e417168ed012190db66a21e626b2b8d2332d6c01 Author: Wenchen Fan AuthorDate: Wed May 1 07:36:00 2019 -0700 [SPARK-26048][SPARK-24530][2.4] Cherrypick all the missing commits to 2.4 release script ## What changes were proposed in this pull request? This PR is to cherry-pick all the missing and relevant commits that were merged to master but not to branch-2.4. Previously, dbtsai used the release script in the branch 2.4 to release 2.4.1. After more investigation, I found it is risky to make a 2.4 release by using the release script in the master branch since the release script has various changes. It could easily introduce unnoticeable issues, like what we did for 2.4.2. Thus, I would cherry-pick all the missing fixes and use the updated release script to release 2.4.3 ## How was this patch tested? N/A Closes #24503 from gatorsmile/upgradeReleaseScript. Lead-authored-by: Wenchen Fan Co-authored-by: gatorsmile Co-authored-by: wright Signed-off-by: gatorsmile --- dev/create-release/do-release-docker.sh | 3 +++ dev/create-release/release-build.sh | 7 +-- dev/create-release/releaseutils.py | 2 +- dev/create-release/spark-rm/Dockerfile | 4 ++-- 4 files changed, 11 insertions(+), 5 deletions(-) diff --git a/dev/create-release/do-release-docker.sh b/dev/create-release/do-release-docker.sh index fa7b73c..c1a122e 100755 --- a/dev/create-release/do-release-docker.sh +++ b/dev/create-release/do-release-docker.sh @@ -135,6 +135,9 @@ if [ -n "$JAVA" ]; then JAVA_VOL="--volume $JAVA:/opt/spark-java" fi +# SPARK-24530: Sphinx must work with python 3 to generate doc correctly. +echo "SPHINXPYTHON=/opt/p35/bin/python" >> $ENVFILE + echo "Building $RELEASE_TAG; output will be at $WORKDIR/output" docker run -ti \ --env-file "$ENVFILE" \ diff --git a/dev/create-release/release-build.sh b/dev/create-release/release-build.sh index 5e65d99..affb4dc 100755 --- a/dev/create-release/release-build.sh +++ b/dev/create-release/release-build.sh @@ -122,6 +122,9 @@ fi PUBLISH_SCALA_2_12=0 SCALA_2_12_PROFILES="-Pscala-2.12" +if [[ $SPARK_VERSION < "3.0." ]]; then + SCALA_2_12_PROFILES="-Pscala-2.12 -Pflume" +fi if [[ $SPARK_VERSION > "2.4" ]]; then PUBLISH_SCALA_2_12=1 fi @@ -327,7 +330,7 @@ if [[ "$1" == "package" ]]; then svn add "svn-spark/${DEST_DIR_NAME}-bin" cd svn-spark -svn ci --username $ASF_USERNAME --password "$ASF_PASSWORD" -m"Apache Spark $SPARK_PACKAGE_VERSION" +svn ci --username $ASF_USERNAME --password "$ASF_PASSWORD" -m"Apache Spark $SPARK_PACKAGE_VERSION" --no-auth-cache cd .. rm -rf svn-spark fi @@ -355,7 +358,7 @@ if [[ "$1" == "docs" ]]; then svn add "svn-spark/${DEST_DIR_NAME}-docs" cd svn-spark -svn ci --username $ASF_USERNAME --password "$ASF_PASSWORD" -m"Apache Spark $SPARK_PACKAGE_VERSION docs" +svn ci --username $ASF_USERNAME --password "$ASF_PASSWORD" -m"Apache Spark $SPARK_PACKAGE_VERSION docs" --no-auth-cache cd .. rm -rf svn-spark fi diff --git a/dev/create-release/releaseutils.py b/dev/create-release/releaseutils.py index f273b33..a5a26ae 100755 --- a/dev/create-release/releaseutils.py +++ b/dev/create-release/releaseutils.py @@ -236,7 +236,7 @@ def translate_component(component, commit_hash, warnings): # The returned components are already filtered and translated def find_components(commit, commit_hash): components = re.findall(r"\[\w*\]", commit.lower()) -components = [translate_component(c, commit_hash) +components = [translate_component(c, commit_hash, []) for c in components if c in known_components] return components diff --git a/dev/create-release/spark-rm/Dockerfile b/dev/create-release/spark-rm/Dockerfile index 15f831c..4231544 100644 --- a/dev/create-release/spark-rm/Dockerfile +++ b/dev/create-release/spark-rm/Dockerfile @@ -62,8 +62,8 @@ RUN echo 'deb http://cran.cnr.Berkeley.edu/bin/linux/ubuntu xenial/' >> /etc/apt pip install $BASE_PIP_PKGS && \ pip install $PIP_PKGS && \ cd && \ - virtualenv -p python3 p35 && \ - . p35/bin/activate && \ + virtualenv -p python3 /opt/p35 && \ + . /opt/p35/bin/activate && \ pip install $BASE_PIP_PKGS && \ pip install $PIP_PKGS && \ # Install R packages and dependencies used when building. - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
svn commit: r33851 - in /dev/spark/v2.4.3-rc1-docs: ./ _site/ _site/api/ _site/api/R/ _site/api/java/ _site/api/java/lib/ _site/api/java/org/ _site/api/java/org/apache/ _site/api/java/org/apache/spark
Author: lixiao Date: Wed May 1 06:16:53 2019 New Revision: 33851 Log: Apache Spark v2.4.3-rc1 docs [This commit notification would consist of 1479 parts, which exceeds the limit of 50 ones, so it was shortened to the summary.] - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
svn commit: r33850 - /dev/spark/v2.4.3-rc1-bin/
Author: lixiao Date: Wed May 1 05:57:50 2019 New Revision: 33850 Log: Apache Spark v2.4.3-rc1 Added: dev/spark/v2.4.3-rc1-bin/ dev/spark/v2.4.3-rc1-bin/SparkR_2.4.3.tar.gz (with props) dev/spark/v2.4.3-rc1-bin/SparkR_2.4.3.tar.gz.asc dev/spark/v2.4.3-rc1-bin/SparkR_2.4.3.tar.gz.sha512 dev/spark/v2.4.3-rc1-bin/pyspark-2.4.3.tar.gz (with props) dev/spark/v2.4.3-rc1-bin/pyspark-2.4.3.tar.gz.asc dev/spark/v2.4.3-rc1-bin/pyspark-2.4.3.tar.gz.sha512 dev/spark/v2.4.3-rc1-bin/spark-2.4.3-bin-hadoop2.6.tgz (with props) dev/spark/v2.4.3-rc1-bin/spark-2.4.3-bin-hadoop2.6.tgz.asc dev/spark/v2.4.3-rc1-bin/spark-2.4.3-bin-hadoop2.6.tgz.sha512 dev/spark/v2.4.3-rc1-bin/spark-2.4.3-bin-hadoop2.7.tgz (with props) dev/spark/v2.4.3-rc1-bin/spark-2.4.3-bin-hadoop2.7.tgz.asc dev/spark/v2.4.3-rc1-bin/spark-2.4.3-bin-hadoop2.7.tgz.sha512 dev/spark/v2.4.3-rc1-bin/spark-2.4.3-bin-without-hadoop-scala-2.12.tgz (with props) dev/spark/v2.4.3-rc1-bin/spark-2.4.3-bin-without-hadoop-scala-2.12.tgz.asc dev/spark/v2.4.3-rc1-bin/spark-2.4.3-bin-without-hadoop-scala-2.12.tgz.sha512 dev/spark/v2.4.3-rc1-bin/spark-2.4.3-bin-without-hadoop.tgz (with props) dev/spark/v2.4.3-rc1-bin/spark-2.4.3-bin-without-hadoop.tgz.asc dev/spark/v2.4.3-rc1-bin/spark-2.4.3-bin-without-hadoop.tgz.sha512 dev/spark/v2.4.3-rc1-bin/spark-2.4.3.tgz (with props) dev/spark/v2.4.3-rc1-bin/spark-2.4.3.tgz.asc dev/spark/v2.4.3-rc1-bin/spark-2.4.3.tgz.sha512 Added: dev/spark/v2.4.3-rc1-bin/SparkR_2.4.3.tar.gz == Binary file - no diff available. Propchange: dev/spark/v2.4.3-rc1-bin/SparkR_2.4.3.tar.gz -- svn:mime-type = application/octet-stream Added: dev/spark/v2.4.3-rc1-bin/SparkR_2.4.3.tar.gz.asc == --- dev/spark/v2.4.3-rc1-bin/SparkR_2.4.3.tar.gz.asc (added) +++ dev/spark/v2.4.3-rc1-bin/SparkR_2.4.3.tar.gz.asc Wed May 1 05:57:50 2019 @@ -0,0 +1,17 @@ +-BEGIN PGP SIGNATURE- +Version: GnuPG v1 + +iQIcBAABAgAGBQJcyS/nAAoJEJb3L3aDDA0beSoP/23koj6zI9FhzSqjUfYGTWaN ++QM672Wvjo0Q3P0imM9FLfNqjK13JLskulsR3slB7NlWg+7uAU1x6GgzWvrD8+qt +RH1wuqPeVLmE9MDYyb655scQWDO/f+Yq1TFfDgnXarUsvStw81HSuSqTx+tf4Gsk +D+Gdop0Z7w0rVSiJgXktl1y51FZNU5QJxi+e6whh+fVhuSZDIN/JweHY2HfwYxN1 +rbRXBC8LwPiFDlZIlnulZjpJw+07wpQPCjUIr2IwGlv1D4E6sES7YhXsWkVo0kl0 +s4npI+TkCV39QXFEqSiZtcFACohdLA5AWAsKUIHpaXahlouxHhmQEt1v2IX1tdQ3 +Yv9If3KV/NDFnf1e3E59jhHrwA5cST8ufKyShJBNb9p6qhXXqNUmBOPgYOZhEe+n +PeJdIU5fY3hz16eXePPf110pExN+csYynR4uzUAO+7TiRJNA2HdBcGoPauaMWlsf +H28uvSVeg5Zp8GSKw/vD+OBnxmfVXWPkSZAmrUsCBvWo5KTvftMTWgx7oS2vCozI +x7AgEmykUdiXOaUh6NOjMH70DovoaCTdx1HD+T1MJwAqwx05kvEhXAnwgcyXEd3B +v/Q1RTyyYfHCgR5rdyZiQRv/34O7tO5d3rL669TyKSSSLX9bg3VoBoSQLUhuKJMA +l9QcWqDnWM6Iy6WEkLkj +=mfID +-END PGP SIGNATURE- Added: dev/spark/v2.4.3-rc1-bin/SparkR_2.4.3.tar.gz.sha512 == --- dev/spark/v2.4.3-rc1-bin/SparkR_2.4.3.tar.gz.sha512 (added) +++ dev/spark/v2.4.3-rc1-bin/SparkR_2.4.3.tar.gz.sha512 Wed May 1 05:57:50 2019 @@ -0,0 +1,3 @@ +SparkR_2.4.3.tar.gz: 662AC057 E12A036E 4EFC7B57 F1067BE1 0618C123 1C01FF5E + 972D4E5D B58DE948 BDC01CE5 476E1553 3C4E768C F3128E47 + 1284689D D0F1202A 4A6CC3EA 823F8F87 Added: dev/spark/v2.4.3-rc1-bin/pyspark-2.4.3.tar.gz == Binary file - no diff available. Propchange: dev/spark/v2.4.3-rc1-bin/pyspark-2.4.3.tar.gz -- svn:mime-type = application/octet-stream Added: dev/spark/v2.4.3-rc1-bin/pyspark-2.4.3.tar.gz.asc == --- dev/spark/v2.4.3-rc1-bin/pyspark-2.4.3.tar.gz.asc (added) +++ dev/spark/v2.4.3-rc1-bin/pyspark-2.4.3.tar.gz.asc Wed May 1 05:57:50 2019 @@ -0,0 +1,17 @@ +-BEGIN PGP SIGNATURE- +Version: GnuPG v1 + +iQIcBAABAgAGBQJcySxrAAoJEJb3L3aDDA0bht0P/3Q9RCpLdy7Dfcyi4zCCEper +xrX6jTJT5MHGhOYHgUoJxbotWkbDjTk81xrE+I+ccDYS7TW0SeC/ngTH85O+sD/a +DVEQ0//dPhHVQNGcWaoojOtI3n80bFEE0XmWj6hoMmT7KbY9GA21iXNM2kjaaETb +GQEDAyeA8S5GSyDd9ez+OvofFo6uJfuorfvQgKF8mNRTrbhNhixGxby7WE+pdlh4 +L1oaohU+dp5ubEhDCVNAi6xXemnhWWYi1zpgxubxYSvQsLoHumBEiiLNZezddvRY +mFuH9dQQX3w5gd5znLGAzp5UPLJ341Fe/jMNAgf8TPPvJzROIw5lYmHtKwkNDCHr ++BSSOFuMejttwLGz8Kd4yJ2gxRtpzZxqzdGJ+fSOJU0mElK3e7SelW44oLEWloAB +pf0IB84eThEQvhL02JM2sl5LIl5HehZTx+EHB3ONwXMcKStoz6RSjl/IqKNjIg/K +0cEgfrdZdYy6UauP0RHMosQ83myQlQrsk04fLd+rlPmPLhAPk6ZqnP9Pe6V9X6xq +i2wn8DSiwTK9NAqGacKLEjNyVl0mTcgE/8cmdqd4e0L9gG+f3XUIPWqKYY1p0q65 +1ypOlYe+KowbV0GHvwX467RQTFjiAwMRGdcFinhbfSUk0d6iLkf29rc3JfhqrroZ +G9iULvEnsvwEmiZoK2Cs +=f/+f +-END PGP SIGNATURE- Added: dev