[spark] branch branch-3.0 updated (233dc12 -> 6f55ed4)
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a change to branch branch-3.0 in repository https://gitbox.apache.org/repos/asf/spark.git. from 233dc12 [SPARK-31290][R] Add back the deprecated R APIs add 6f55ed4 [SPARK-31318][SQL] Split Parquet/Avro configs for rebasing dates/timestamps in read and in write No new revisions were added by this update. Summary of changes: .../apache/spark/sql/avro/AvroDeserializer.scala | 3 +- .../org/apache/spark/sql/avro/AvroSerializer.scala | 3 +- .../org/apache/spark/sql/avro/AvroSuite.scala | 23 +++-- .../org/apache/spark/sql/internal/SQLConf.scala| 40 +- .../parquet/VectorizedColumnReader.java| 2 +- .../datasources/parquet/ParquetRowConverter.scala | 2 +- .../datasources/parquet/ParquetWriteSupport.scala | 3 +- .../benchmark/DateTimeRebaseBenchmark.scala| 4 +-- .../datasources/parquet/ParquetIOSuite.scala | 16 + 9 files changed, 63 insertions(+), 33 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (dba525c -> c5323d2)
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from dba525c [SPARK-31313][K8S][TEST] Add `m01` node name to support Minikube 1.8.x add c5323d2 [SPARK-31318][SQL] Split Parquet/Avro configs for rebasing dates/timestamps in read and in write No new revisions were added by this update. Summary of changes: .../apache/spark/sql/avro/AvroDeserializer.scala | 3 +- .../org/apache/spark/sql/avro/AvroSerializer.scala | 3 +- .../org/apache/spark/sql/avro/AvroSuite.scala | 23 ++-- .../org/apache/spark/sql/internal/SQLConf.scala| 42 +- .../parquet/VectorizedColumnReader.java| 2 +- .../datasources/parquet/ParquetRowConverter.scala | 2 +- .../datasources/parquet/ParquetWriteSupport.scala | 3 +- .../benchmark/DateTimeRebaseBenchmark.scala| 4 +-- .../datasources/parquet/ParquetIOSuite.scala | 16 + 9 files changed, 65 insertions(+), 33 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (fd0b228 -> dba525c)
This is an automated email from the ASF dual-hosted git repository. dbtsai pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from fd0b228 [SPARK-31290][R] Add back the deprecated R APIs add dba525c [SPARK-31313][K8S][TEST] Add `m01` node name to support Minikube 1.8.x No new revisions were added by this update. Summary of changes: .../org/apache/spark/deploy/k8s/integrationtest/PVTestsSuite.scala | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-2.4 updated (e226f68 -> 22e0a5a)
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a change to branch branch-2.4 in repository https://gitbox.apache.org/repos/asf/spark.git. from e226f68 [SPARK-31306][DOCS] update rand() function documentation to indicate exclusive upper bound add 22e0a5a [SPARK-31312][SQL][2.4] Cache Class instance for the UDF instance in HiveFunctionWrapper No new revisions were added by this update. Summary of changes: .../scala/org/apache/spark/sql/hive/HiveShim.scala | 19 ++- .../src/test/noclasspath/TestUDTF-spark-26560.jar | Bin 7462 -> 0 bytes sql/hive/src/test/noclasspath/hive-test-udfs.jar | Bin 0 -> 35660 bytes .../spark/sql/hive/HiveUDFDynamicLoadSuite.scala | 190 + .../spark/sql/hive/execution/SQLQuerySuite.scala | 47 - 5 files changed, 204 insertions(+), 52 deletions(-) delete mode 100644 sql/hive/src/test/noclasspath/TestUDTF-spark-26560.jar create mode 100644 sql/hive/src/test/noclasspath/hive-test-udfs.jar create mode 100644 sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveUDFDynamicLoadSuite.scala - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.0 updated: [SPARK-31290][R] Add back the deprecated R APIs
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch branch-3.0 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.0 by this push: new 233dc12 [SPARK-31290][R] Add back the deprecated R APIs 233dc12 is described below commit 233dc1260af6df7a8e9a689ba5c6fe3e81a5bc1f Author: Huaxin Gao AuthorDate: Wed Apr 1 10:38:03 2020 +0900 [SPARK-31290][R] Add back the deprecated R APIs ### What changes were proposed in this pull request? Add back the deprecated R APIs removed by https://github.com/apache/spark/pull/22843/ and https://github.com/apache/spark/pull/22815. These APIs are - `sparkR.init` - `sparkRSQL.init` - `sparkRHive.init` - `registerTempTable` - `createExternalTable` - `dropTempTable` No need to port the function such as ```r createExternalTable <- function(x, ...) { dispatchFunc("createExternalTable(tableName, path = NULL, source = NULL, ...)", x, ...) } ``` because this was for the backward compatibility when SQLContext exists before assuming from https://github.com/apache/spark/pull/9192, but seems we don't need it anymore since SparkR replaced SQLContext with Spark Session at https://github.com/apache/spark/pull/13635. ### Why are the changes needed? Amend Spark's Semantic Versioning Policy ### Does this PR introduce any user-facing change? Yes The removed R APIs are put back. ### How was this patch tested? Add back the removed tests Closes #28058 from huaxingao/r. Authored-by: Huaxin Gao Signed-off-by: HyukjinKwon (cherry picked from commit fd0b2281272daba590c6bb277688087d0b26053f) Signed-off-by: HyukjinKwon --- R/pkg/NAMESPACE | 7 +++ R/pkg/R/DataFrame.R | 26 ++ R/pkg/R/catalog.R | 54 +++ R/pkg/R/generics.R| 3 ++ R/pkg/R/sparkR.R | 98 +++ R/pkg/tests/fulltests/test_sparkSQL.R | 13 - docs/sparkr-migration-guide.md| 3 +- 7 files changed, 200 insertions(+), 4 deletions(-) diff --git a/R/pkg/NAMESPACE b/R/pkg/NAMESPACE index 7ed2e36..9fd7bb4 100644 --- a/R/pkg/NAMESPACE +++ b/R/pkg/NAMESPACE @@ -28,6 +28,7 @@ importFrom("utils", "download.file", "object.size", "packageVersion", "tail", "u # S3 methods exported export("sparkR.session") +export("sparkR.init") export("sparkR.session.stop") export("sparkR.stop") export("sparkR.conf") @@ -41,6 +42,9 @@ export("sparkR.callJStatic") export("install.spark") +export("sparkRSQL.init", + "sparkRHive.init") + # MLlib integration exportMethods("glm", "spark.glm", @@ -148,6 +152,7 @@ exportMethods("arrange", "printSchema", "randomSplit", "rbind", + "registerTempTable", "rename", "repartition", "repartitionByRange", @@ -420,8 +425,10 @@ export("as.DataFrame", "cacheTable", "clearCache", "createDataFrame", + "createExternalTable", "createTable", "currentDatabase", + "dropTempTable", "dropTempView", "listColumns", "listDatabases", diff --git a/R/pkg/R/DataFrame.R b/R/pkg/R/DataFrame.R index 593d3ca..14d2076 100644 --- a/R/pkg/R/DataFrame.R +++ b/R/pkg/R/DataFrame.R @@ -521,6 +521,32 @@ setMethod("createOrReplaceTempView", invisible(callJMethod(x@sdf, "createOrReplaceTempView", viewName)) }) +#' (Deprecated) Register Temporary Table +#' +#' Registers a SparkDataFrame as a Temporary Table in the SparkSession +#' @param x A SparkDataFrame +#' @param tableName A character vector containing the name of the table +#' +#' @seealso \link{createOrReplaceTempView} +#' @rdname registerTempTable-deprecated +#' @name registerTempTable +#' @aliases registerTempTable,SparkDataFrame,character-method +#' @examples +#'\dontrun{ +#' sparkR.session() +#' path <- "path/to/file.json" +#' df <- read.json(path) +#' registerTempTable(df, "json_df") +#' new_df <- sql("SELECT * FROM json_df") +#'} +#' @note registerTempTable since 1.4.0 +setMethod("registerTempTable", + signature(x = "SparkDataFrame", tableName = "character"), + function(x, tableName) { + .Deprecated("createOrReplaceTempView") + invisible(callJMethod(x@sdf, "createOrReplaceTempView", tableName)) + }) + #' insertInto #' #' Insert the contents of a SparkDataFrame into a table registered in the current SparkSession. diff --git a/R/pkg/R/catalog.R b/R/pkg/R/catalog.R index 7641f8a..275737f 100644 --- a/R/pkg/R/catalog.R +++ b/R/pkg/R/catalog.R @@ -17,6 +17,35 @@ # catalog.R: SparkSession catalog functions +#' (Deprecated) Create an external table +#' +#'
[spark] branch master updated: [SPARK-31290][R] Add back the deprecated R APIs
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new fd0b228 [SPARK-31290][R] Add back the deprecated R APIs fd0b228 is described below commit fd0b2281272daba590c6bb277688087d0b26053f Author: Huaxin Gao AuthorDate: Wed Apr 1 10:38:03 2020 +0900 [SPARK-31290][R] Add back the deprecated R APIs ### What changes were proposed in this pull request? Add back the deprecated R APIs removed by https://github.com/apache/spark/pull/22843/ and https://github.com/apache/spark/pull/22815. These APIs are - `sparkR.init` - `sparkRSQL.init` - `sparkRHive.init` - `registerTempTable` - `createExternalTable` - `dropTempTable` No need to port the function such as ```r createExternalTable <- function(x, ...) { dispatchFunc("createExternalTable(tableName, path = NULL, source = NULL, ...)", x, ...) } ``` because this was for the backward compatibility when SQLContext exists before assuming from https://github.com/apache/spark/pull/9192, but seems we don't need it anymore since SparkR replaced SQLContext with Spark Session at https://github.com/apache/spark/pull/13635. ### Why are the changes needed? Amend Spark's Semantic Versioning Policy ### Does this PR introduce any user-facing change? Yes The removed R APIs are put back. ### How was this patch tested? Add back the removed tests Closes #28058 from huaxingao/r. Authored-by: Huaxin Gao Signed-off-by: HyukjinKwon --- R/pkg/NAMESPACE | 7 +++ R/pkg/R/DataFrame.R | 26 ++ R/pkg/R/catalog.R | 54 +++ R/pkg/R/generics.R| 3 ++ R/pkg/R/sparkR.R | 98 +++ R/pkg/tests/fulltests/test_sparkSQL.R | 13 - docs/sparkr-migration-guide.md| 3 +- 7 files changed, 200 insertions(+), 4 deletions(-) diff --git a/R/pkg/NAMESPACE b/R/pkg/NAMESPACE index 56eceb8..fb879e4 100644 --- a/R/pkg/NAMESPACE +++ b/R/pkg/NAMESPACE @@ -28,6 +28,7 @@ importFrom("utils", "download.file", "object.size", "packageVersion", "tail", "u # S3 methods exported export("sparkR.session") +export("sparkR.init") export("sparkR.session.stop") export("sparkR.stop") export("sparkR.conf") @@ -41,6 +42,9 @@ export("sparkR.callJStatic") export("install.spark") +export("sparkRSQL.init", + "sparkRHive.init") + # MLlib integration exportMethods("glm", "spark.glm", @@ -148,6 +152,7 @@ exportMethods("arrange", "printSchema", "randomSplit", "rbind", + "registerTempTable", "rename", "repartition", "repartitionByRange", @@ -431,8 +436,10 @@ export("as.DataFrame", "cacheTable", "clearCache", "createDataFrame", + "createExternalTable", "createTable", "currentDatabase", + "dropTempTable", "dropTempView", "listColumns", "listDatabases", diff --git a/R/pkg/R/DataFrame.R b/R/pkg/R/DataFrame.R index 593d3ca..14d2076 100644 --- a/R/pkg/R/DataFrame.R +++ b/R/pkg/R/DataFrame.R @@ -521,6 +521,32 @@ setMethod("createOrReplaceTempView", invisible(callJMethod(x@sdf, "createOrReplaceTempView", viewName)) }) +#' (Deprecated) Register Temporary Table +#' +#' Registers a SparkDataFrame as a Temporary Table in the SparkSession +#' @param x A SparkDataFrame +#' @param tableName A character vector containing the name of the table +#' +#' @seealso \link{createOrReplaceTempView} +#' @rdname registerTempTable-deprecated +#' @name registerTempTable +#' @aliases registerTempTable,SparkDataFrame,character-method +#' @examples +#'\dontrun{ +#' sparkR.session() +#' path <- "path/to/file.json" +#' df <- read.json(path) +#' registerTempTable(df, "json_df") +#' new_df <- sql("SELECT * FROM json_df") +#'} +#' @note registerTempTable since 1.4.0 +setMethod("registerTempTable", + signature(x = "SparkDataFrame", tableName = "character"), + function(x, tableName) { + .Deprecated("createOrReplaceTempView") + invisible(callJMethod(x@sdf, "createOrReplaceTempView", tableName)) + }) + #' insertInto #' #' Insert the contents of a SparkDataFrame into a table registered in the current SparkSession. diff --git a/R/pkg/R/catalog.R b/R/pkg/R/catalog.R index 7641f8a..275737f 100644 --- a/R/pkg/R/catalog.R +++ b/R/pkg/R/catalog.R @@ -17,6 +17,35 @@ # catalog.R: SparkSession catalog functions +#' (Deprecated) Create an external table +#' +#' Creates an external table based on the dataset in a data source, +#' Returns a SparkDataFrame associated with the
[spark] branch master updated: [SPARK-31308][PYSPARK] Merging pyFiles to files argument for Non-PySpark applications
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 20fc6fa [SPARK-31308][PYSPARK] Merging pyFiles to files argument for Non-PySpark applications 20fc6fa is described below commit 20fc6fa8398b9dc47b9ae7df52133a306f89b25f Author: Liang-Chi Hsieh AuthorDate: Tue Mar 31 18:08:55 2020 -0700 [SPARK-31308][PYSPARK] Merging pyFiles to files argument for Non-PySpark applications ### What changes were proposed in this pull request? This PR (SPARK-31308) proposed to add python dependencies even it is not Python applications. ### Why are the changes needed? For now, we add `pyFiles` argument to `files` argument only for Python applications, in SparkSubmit. Like the reason in #21420, "for some Spark applications, though they're a java program, they require not only jar dependencies, but also python dependencies.", we need to add `pyFiles` to `files` even it is not Python applications. ### Does this PR introduce any user-facing change? Yes. After this change, for non-PySpark applications, the Python files specified by `pyFiles` are also added to `files` like PySpark applications. ### How was this patch tested? Manually test on jupyter notebook or do `spark-submit` with `--verbose`. ``` Spark config: ... (spark.files,file:/Users/dongjoon/PRS/SPARK-PR-28077/a.py) (spark.submit.deployMode,client) (spark.master,local[*]) ``` Closes #28077 from viirya/pyfile. Lead-authored-by: Liang-Chi Hsieh Co-authored-by: Liang-Chi Hsieh Signed-off-by: Dongjoon Hyun --- core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala | 10 ++ 1 file changed, 6 insertions(+), 4 deletions(-) diff --git a/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala b/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala index 4d67dfa..1271a3d 100644 --- a/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala +++ b/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala @@ -474,10 +474,12 @@ private[spark] class SparkSubmit extends Logging { args.mainClass = "org.apache.spark.deploy.PythonRunner" args.childArgs = ArrayBuffer(localPrimaryResource, localPyFiles) ++ args.childArgs } - if (clusterManager != YARN) { -// The YARN backend handles python files differently, so don't merge the lists. -args.files = mergeFileLists(args.files, args.pyFiles) - } +} + +// Non-PySpark applications can need Python dependencies. +if (deployMode == CLIENT && clusterManager != YARN) { + // The YARN backend handles python files differently, so don't merge the lists. + args.files = mergeFileLists(args.files, args.pyFiles) } if (localPyFiles != null) { - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (1a7f964 -> 5ec1814)
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 1a7f964 [SPARK-31305][SQL][DOCS] Add a page to list all commands in SQL Reference add 5ec1814 [SPARK-31248][CORE][TEST] Fix flaky ExecutorAllocationManagerSuite.interleaving add and remove No new revisions were added by this update. Summary of changes: core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala | 2 +- .../test/scala/org/apache/spark/ExecutorAllocationManagerSuite.scala | 5 - 2 files changed, 5 insertions(+), 2 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.0 updated: [SPARK-31305][SQL][DOCS] Add a page to list all commands in SQL Reference
This is an automated email from the ASF dual-hosted git repository. yamamuro pushed a commit to branch branch-3.0 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.0 by this push: new 01b26c4 [SPARK-31305][SQL][DOCS] Add a page to list all commands in SQL Reference 01b26c4 is described below commit 01b26c49009d8136f1f962e87ce7e35db43533ab Author: Huaxin Gao AuthorDate: Wed Apr 1 08:42:15 2020 +0900 [SPARK-31305][SQL][DOCS] Add a page to list all commands in SQL Reference ### What changes were proposed in this pull request? Add a page to list all commands in SQL Reference... ### Why are the changes needed? so it's easier for user to find a specific command. ### Does this PR introduce any user-facing change? before: ![image](https://user-images.githubusercontent.com/13592258/77938658-ec03e700-726a-11ea-983c-7a559cc0aae2.png) after: ![image](https://user-images.githubusercontent.com/13592258/77937899-d3df9800-7269-11ea-85db-749a9521576a.png) ![image](https://user-images.githubusercontent.com/13592258/77937924-db9f3c80-7269-11ea-9441-7603feee421c.png) Also move ```use database``` from query category to ddl category. ### How was this patch tested? Manually build and check Closes #28074 from huaxingao/list-all. Authored-by: Huaxin Gao Signed-off-by: Takeshi Yamamuro (cherry picked from commit 1a7f9649b67d2108cb14e9e466855dfe52db6d66) Signed-off-by: Takeshi Yamamuro --- docs/_data/menu-sql.yaml | 4 +-- docs/sql-ref-syntax-ddl.md | 1 + docs/sql-ref-syntax.md | 62 +- 3 files changed, 64 insertions(+), 3 deletions(-) diff --git a/docs/_data/menu-sql.yaml b/docs/_data/menu-sql.yaml index 3bf4952..6534c50 100644 --- a/docs/_data/menu-sql.yaml +++ b/docs/_data/menu-sql.yaml @@ -123,6 +123,8 @@ url: sql-ref-syntax-ddl-truncate-table.html - text: REPAIR TABLE url: sql-ref-syntax-ddl-repair-table.html +- text: USE DATABASE + url: sql-ref-syntax-qry-select-usedb.html - text: Data Manipulation Statements url: sql-ref-syntax-dml.html subitems: @@ -152,8 +154,6 @@ url: sql-ref-syntax-qry-select-distribute-by.html - text: LIMIT Clause url: sql-ref-syntax-qry-select-limit.html -- text: USE database - url: sql-ref-syntax-qry-select-usedb.html - text: EXPLAIN url: sql-ref-syntax-qry-explain.html - text: Auxiliary Statements diff --git a/docs/sql-ref-syntax-ddl.md b/docs/sql-ref-syntax-ddl.md index 954020a..ab4e95a 100644 --- a/docs/sql-ref-syntax-ddl.md +++ b/docs/sql-ref-syntax-ddl.md @@ -36,3 +36,4 @@ Data Definition Statements are used to create or modify the structure of databas - [DROP VIEW](sql-ref-syntax-ddl-drop-view.html) - [TRUNCATE TABLE](sql-ref-syntax-ddl-truncate-table.html) - [REPAIR TABLE](sql-ref-syntax-ddl-repair-table.html) +- [USE DATABASE](sql-ref-syntax-qry-select-usedb.html) diff --git a/docs/sql-ref-syntax.md b/docs/sql-ref-syntax.md index 2510278..3db97ac 100644 --- a/docs/sql-ref-syntax.md +++ b/docs/sql-ref-syntax.md @@ -19,4 +19,64 @@ license: | limitations under the License. --- -Spark SQL is Apache Spark's module for working with structured data. The SQL Syntax section describes the SQL syntax in detail along with usage examples when applicable. +Spark SQL is Apache Spark's module for working with structured data. The SQL Syntax section describes the SQL syntax in detail along with usage examples when applicable. This document provides a list of Data Definition and Data Manipulation Statements, as well as Data Retrieval and Auxiliary Statements. + +### DDL Statements +- [ALTER DATABASE](sql-ref-syntax-ddl-alter-database.html) +- [ALTER TABLE](sql-ref-syntax-ddl-alter-table.html) +- [ALTER VIEW](sql-ref-syntax-ddl-alter-view.html) +- [CREATE DATABASE](sql-ref-syntax-ddl-create-database.html) +- [CREATE FUNCTION](sql-ref-syntax-ddl-create-function.html) +- [CREATE TABLE](sql-ref-syntax-ddl-create-table.html) +- [CREATE VIEW](sql-ref-syntax-ddl-create-view.html) +- [DROP DATABASE](sql-ref-syntax-ddl-drop-database.html) +- [DROP FUNCTION](sql-ref-syntax-ddl-drop-function.html) +- [DROP TABLE](sql-ref-syntax-ddl-drop-table.html) +- [DROP VIEW](sql-ref-syntax-ddl-drop-view.html) +- [REPAIR TABLE](sql-ref-syntax-ddl-repair-table.html) +- [TRUNCATE TABLE](sql-ref-syntax-ddl-truncate-table.html) +- [USE DATABASE](sql-ref-syntax-qry-select-usedb.html) + +### DML Statements +- [INSERT INTO](sql-ref-syntax-dml-insert-into.html) +- [INSERT OVERWRITE](sql-ref-syntax-dml-insert-overwrite-table.html) +- [INSERT OVERWRITE DIRECTORY](sql-ref-syntax-dml-insert-overwrite-directory.html) +- [INSERT OVERWRITE DIRECTORY with
[spark] branch master updated: [SPARK-31305][SQL][DOCS] Add a page to list all commands in SQL Reference
This is an automated email from the ASF dual-hosted git repository. yamamuro pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 1a7f964 [SPARK-31305][SQL][DOCS] Add a page to list all commands in SQL Reference 1a7f964 is described below commit 1a7f9649b67d2108cb14e9e466855dfe52db6d66 Author: Huaxin Gao AuthorDate: Wed Apr 1 08:42:15 2020 +0900 [SPARK-31305][SQL][DOCS] Add a page to list all commands in SQL Reference ### What changes were proposed in this pull request? Add a page to list all commands in SQL Reference... ### Why are the changes needed? so it's easier for user to find a specific command. ### Does this PR introduce any user-facing change? before: ![image](https://user-images.githubusercontent.com/13592258/77938658-ec03e700-726a-11ea-983c-7a559cc0aae2.png) after: ![image](https://user-images.githubusercontent.com/13592258/77937899-d3df9800-7269-11ea-85db-749a9521576a.png) ![image](https://user-images.githubusercontent.com/13592258/77937924-db9f3c80-7269-11ea-9441-7603feee421c.png) Also move ```use database``` from query category to ddl category. ### How was this patch tested? Manually build and check Closes #28074 from huaxingao/list-all. Authored-by: Huaxin Gao Signed-off-by: Takeshi Yamamuro --- docs/_data/menu-sql.yaml | 4 +-- docs/sql-ref-syntax-ddl.md | 1 + docs/sql-ref-syntax.md | 62 +- 3 files changed, 64 insertions(+), 3 deletions(-) diff --git a/docs/_data/menu-sql.yaml b/docs/_data/menu-sql.yaml index 3bf4952..6534c50 100644 --- a/docs/_data/menu-sql.yaml +++ b/docs/_data/menu-sql.yaml @@ -123,6 +123,8 @@ url: sql-ref-syntax-ddl-truncate-table.html - text: REPAIR TABLE url: sql-ref-syntax-ddl-repair-table.html +- text: USE DATABASE + url: sql-ref-syntax-qry-select-usedb.html - text: Data Manipulation Statements url: sql-ref-syntax-dml.html subitems: @@ -152,8 +154,6 @@ url: sql-ref-syntax-qry-select-distribute-by.html - text: LIMIT Clause url: sql-ref-syntax-qry-select-limit.html -- text: USE database - url: sql-ref-syntax-qry-select-usedb.html - text: EXPLAIN url: sql-ref-syntax-qry-explain.html - text: Auxiliary Statements diff --git a/docs/sql-ref-syntax-ddl.md b/docs/sql-ref-syntax-ddl.md index 954020a..ab4e95a 100644 --- a/docs/sql-ref-syntax-ddl.md +++ b/docs/sql-ref-syntax-ddl.md @@ -36,3 +36,4 @@ Data Definition Statements are used to create or modify the structure of databas - [DROP VIEW](sql-ref-syntax-ddl-drop-view.html) - [TRUNCATE TABLE](sql-ref-syntax-ddl-truncate-table.html) - [REPAIR TABLE](sql-ref-syntax-ddl-repair-table.html) +- [USE DATABASE](sql-ref-syntax-qry-select-usedb.html) diff --git a/docs/sql-ref-syntax.md b/docs/sql-ref-syntax.md index 2510278..3db97ac 100644 --- a/docs/sql-ref-syntax.md +++ b/docs/sql-ref-syntax.md @@ -19,4 +19,64 @@ license: | limitations under the License. --- -Spark SQL is Apache Spark's module for working with structured data. The SQL Syntax section describes the SQL syntax in detail along with usage examples when applicable. +Spark SQL is Apache Spark's module for working with structured data. The SQL Syntax section describes the SQL syntax in detail along with usage examples when applicable. This document provides a list of Data Definition and Data Manipulation Statements, as well as Data Retrieval and Auxiliary Statements. + +### DDL Statements +- [ALTER DATABASE](sql-ref-syntax-ddl-alter-database.html) +- [ALTER TABLE](sql-ref-syntax-ddl-alter-table.html) +- [ALTER VIEW](sql-ref-syntax-ddl-alter-view.html) +- [CREATE DATABASE](sql-ref-syntax-ddl-create-database.html) +- [CREATE FUNCTION](sql-ref-syntax-ddl-create-function.html) +- [CREATE TABLE](sql-ref-syntax-ddl-create-table.html) +- [CREATE VIEW](sql-ref-syntax-ddl-create-view.html) +- [DROP DATABASE](sql-ref-syntax-ddl-drop-database.html) +- [DROP FUNCTION](sql-ref-syntax-ddl-drop-function.html) +- [DROP TABLE](sql-ref-syntax-ddl-drop-table.html) +- [DROP VIEW](sql-ref-syntax-ddl-drop-view.html) +- [REPAIR TABLE](sql-ref-syntax-ddl-repair-table.html) +- [TRUNCATE TABLE](sql-ref-syntax-ddl-truncate-table.html) +- [USE DATABASE](sql-ref-syntax-qry-select-usedb.html) + +### DML Statements +- [INSERT INTO](sql-ref-syntax-dml-insert-into.html) +- [INSERT OVERWRITE](sql-ref-syntax-dml-insert-overwrite-table.html) +- [INSERT OVERWRITE DIRECTORY](sql-ref-syntax-dml-insert-overwrite-directory.html) +- [INSERT OVERWRITE DIRECTORY with Hive format](sql-ref-syntax-dml-insert-overwrite-directory-hive.html) +- [LOAD](sql-ref-syntax-dml-load.html) + +###
[spark] branch master updated: [SPARK-31304][ML][EXAMPLES] Add examples for ml.stat.ANOVATest
This is an automated email from the ASF dual-hosted git repository. srowen pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new e65c21e [SPARK-31304][ML][EXAMPLES] Add examples for ml.stat.ANOVATest e65c21e is described below commit e65c21e093a643573f7ced4998dd9050557ec328 Author: Qianyang Yu AuthorDate: Tue Mar 31 16:33:26 2020 -0500 [SPARK-31304][ML][EXAMPLES] Add examples for ml.stat.ANOVATest ### What changes were proposed in this pull request? Add ANOVATest example for ml.stat.ANOVATest in python/java/scala ### Why are the changes needed? Improve ML example ### Does this PR introduce any user-facing change? No ### How was this patch tested? manually run the example Closes #28073 from kevinyu98/add-ANOVA-example. Authored-by: Qianyang Yu Signed-off-by: Sean Owen --- .../spark/examples/ml/JavaANOVATestExample.java| 75 ++ examples/src/main/python/ml/anova_test_example.py | 52 +++ .../spark/examples/ml/ANOVATestExample.scala | 63 ++ 3 files changed, 190 insertions(+) diff --git a/examples/src/main/java/org/apache/spark/examples/ml/JavaANOVATestExample.java b/examples/src/main/java/org/apache/spark/examples/ml/JavaANOVATestExample.java new file mode 100644 index 000..3b2de1f --- /dev/null +++ b/examples/src/main/java/org/apache/spark/examples/ml/JavaANOVATestExample.java @@ -0,0 +1,75 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.examples.ml; + +import org.apache.spark.sql.SparkSession; + +// $example on$ +import java.util.Arrays; +import java.util.List; + +import org.apache.spark.ml.linalg.Vectors; +import org.apache.spark.ml.linalg.VectorUDT; +import org.apache.spark.ml.stat.ANOVATest; +import org.apache.spark.sql.Dataset; +import org.apache.spark.sql.Row; +import org.apache.spark.sql.RowFactory; +import org.apache.spark.sql.types.*; +// $example off$ + +/** + * An example for ANOVA testing. + * Run with + * + * bin/run-example ml.JavaANOVATestExample + * + */ +public class JavaANOVATestExample { + + public static void main(String[] args) { +SparkSession spark = SparkSession + .builder() + .appName("JavaANOVATestExample") + .getOrCreate(); + +// $example on$ +List data = Arrays.asList( + RowFactory.create(3.0, Vectors.dense(1.7, 4.4, 7.6, 5.8, 9.6, 2.3)), + RowFactory.create(2.0, Vectors.dense(8.8, 7.3, 5.7, 7.3, 2.2, 4.1)), + RowFactory.create(1.0, Vectors.dense(1.2, 9.5, 2.5, 3.1, 8.7, 2.5)), + RowFactory.create(2.0, Vectors.dense(3.7, 9.2, 6.1, 4.1, 7.5, 3.8)), + RowFactory.create(4.0, Vectors.dense(8.9, 5.2, 7.8, 8.3, 5.2, 3.0)), + RowFactory.create(4.0, Vectors.dense(7.9, 8.5, 9.2, 4.0, 9.4, 2.1)) +); + +StructType schema = new StructType(new StructField[]{ + new StructField("label", DataTypes.DoubleType, false, Metadata.empty()), + new StructField("features", new VectorUDT(), false, Metadata.empty()), +}); + +Dataset df = spark.createDataFrame(data, schema); +Row r = ANOVATest.test(df, "features", "label").head(); +System.out.println("pValues: " + r.get(0).toString()); +System.out.println("degreesOfFreedom: " + r.getList(1).toString()); +System.out.println("fValues: " + r.get(2).toString()); + +// $example off$ + +spark.stop(); + } +} diff --git a/examples/src/main/python/ml/anova_test_example.py b/examples/src/main/python/ml/anova_test_example.py new file mode 100644 index 000..3fffdbd --- /dev/null +++ b/examples/src/main/python/ml/anova_test_example.py @@ -0,0 +1,52 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#
[spark] branch master updated (590b9a0 -> 34c7ec8)
This is an automated email from the ASF dual-hosted git repository. lixiao pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 590b9a0 [SPARK-31010][SQL][FOLLOW-UP] Add Java UDF suggestion in error message of untyped Scala UDF add 34c7ec8 [SPARK-31253][SQL] Add metrics to AQE shuffle reader No new revisions were added by this update. Summary of changes: .../spark/sql/execution/ShuffledRowRDD.scala | 16 ++- .../adaptive/CoalesceShufflePartitions.scala | 6 +- .../adaptive/CustomShuffleReaderExec.scala | 114 ++--- .../adaptive/OptimizeLocalShuffleReader.scala | 15 ++- .../execution/adaptive/OptimizeSkewedJoin.scala| 82 --- .../sql/execution/adaptive/QueryStageExec.scala| 5 + .../execution/CoalesceShufflePartitionsSuite.scala | 23 +++-- .../adaptive/AdaptiveQueryExecSuite.scala | 74 - 8 files changed, 229 insertions(+), 106 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (2a6aa8e -> 590b9a0)
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 2a6aa8e [SPARK-31312][SQL] Cache Class instance for the UDF instance in HiveFunctionWrapper add 590b9a0 [SPARK-31010][SQL][FOLLOW-UP] Add Java UDF suggestion in error message of untyped Scala UDF No new revisions were added by this update. Summary of changes: sql/core/src/main/scala/org/apache/spark/sql/functions.scala | 10 +++--- 1 file changed, 7 insertions(+), 3 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.0 updated: [SPARK-31010][SQL][FOLLOW-UP] Add Java UDF suggestion in error message of untyped Scala UDF
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a commit to branch branch-3.0 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.0 by this push: new 207344d [SPARK-31010][SQL][FOLLOW-UP] Add Java UDF suggestion in error message of untyped Scala UDF 207344d is described below commit 207344d0da86496b377c2c5f5ad613c6d02f4c33 Author: yi.wu AuthorDate: Tue Mar 31 17:35:26 2020 + [SPARK-31010][SQL][FOLLOW-UP] Add Java UDF suggestion in error message of untyped Scala UDF ### What changes were proposed in this pull request? Added Java UDF suggestion in the in error message of untyped Scala UDF. ### Why are the changes needed? To help user migrate their use case from deprecate untyped Scala UDF to other supported UDF. ### Does this PR introduce any user-facing change? No. It haven't been released. ### How was this patch tested? Pass Jenkins. Closes #28070 from Ngone51/spark_31010. Authored-by: yi.wu Signed-off-by: Wenchen Fan (cherry picked from commit 590b9a0132b68d9523e663997def957b2e46dfb1) Signed-off-by: Wenchen Fan --- sql/core/src/main/scala/org/apache/spark/sql/functions.scala | 10 +++--- 1 file changed, 7 insertions(+), 3 deletions(-) diff --git a/sql/core/src/main/scala/org/apache/spark/sql/functions.scala b/sql/core/src/main/scala/org/apache/spark/sql/functions.scala index fd4e77f..782be98 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/functions.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/functions.scala @@ -4841,9 +4841,13 @@ object functions { "information. Spark may blindly pass null to the Scala closure with primitive-type " + "argument, and the closure will see the default value of the Java type for the null " + "argument, e.g. `udf((x: Int) => x, IntegerType)`, the result is 0 for null input. " + -"You could use typed Scala UDF APIs (e.g. `udf((x: Int) => x)`) to avoid this problem, " + -s"or set ${SQLConf.LEGACY_ALLOW_UNTYPED_SCALA_UDF.key} to true and use this API with " + -s"caution." +"To get rid of this error, you could:\n" + +"1. use typed Scala UDF APIs, e.g. `udf((x: Int) => x)`\n" + +"2. use Java UDF APIs, e.g. `udf(new UDF1[String, Integer] { " + +"override def call(s: String): Integer = s.length() }, IntegerType)`, " + +"if input types are all non primitive\n" + +s"3. set ${SQLConf.LEGACY_ALLOW_UNTYPED_SCALA_UDF.key} to true and " + +s"use this API with caution" throw new AnalysisException(errorMsg) } SparkUserDefinedFunction(f, dataType, inputEncoders = Nil) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.0 updated: [SPARK-30775][DOC] Improve the description of executor metrics in the monitoring documentation
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch branch-3.0 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.0 by this push: new ca3887a [SPARK-30775][DOC] Improve the description of executor metrics in the monitoring documentation ca3887a is described below commit ca3887a0de31fa78097ca7ee92ead914a3ce050c Author: Luca Canali AuthorDate: Mon Mar 30 18:00:54 2020 -0700 [SPARK-30775][DOC] Improve the description of executor metrics in the monitoring documentation ### What changes were proposed in this pull request? This PR (SPARK-30775) aims to improve the description of the executor metrics in the monitoring documentation. ### Why are the changes needed? Improve and clarify monitoring documentation by: - adding reference to the Prometheus end point, as implemented in [SPARK-29064] - extending the list and descripion of executor metrics, following up from [SPARK-27157] ### Does this PR introduce any user-facing change? Documentation update. ### How was this patch tested? n.a. Closes #27526 from LucaCanali/docPrometheusMetricsFollowupSpark29064. Authored-by: Luca Canali Signed-off-by: Dongjoon Hyun (cherry picked from commit aa98ac52dbbe3fc2d3b152af9324a71f48439a38) Signed-off-by: Dongjoon Hyun --- docs/monitoring.md | 58 +++--- 1 file changed, 51 insertions(+), 7 deletions(-) diff --git a/docs/monitoring.md b/docs/monitoring.md index ba3f1dc..131cd2a 100644 --- a/docs/monitoring.md +++ b/docs/monitoring.md @@ -689,9 +689,12 @@ A list of the available metrics, with a short description: ### Executor Metrics Executor-level metrics are sent from each executor to the driver as part of the Heartbeat to describe the performance metrics of Executor itself like JVM heap memory, GC information. -Executor metric values and their measured peak values per executor are exposed via the REST API at the end point `/applications/[app-id]/executors`. -In addition, aggregated per-stage peak values of the executor metrics are written to the event log if `spark.eventLog.logStageExecutorMetrics` is true. -Executor metrics are also exposed via the Spark metrics system based on the Dropwizard metrics library. +Executor metric values and their measured memory peak values per executor are exposed via the REST API in JSON format and in Prometheus format. +The JSON end point is exposed at: `/applications/[app-id]/executors`, and the Prometheus endpoint at: `/metrics/executors/prometheus`. +The Prometheus endpoint is conditional to a configuration parameter: `spark.ui.prometheus.enabled=true` (the default is `false`). +In addition, aggregated per-stage peak values of the executor memory metrics are written to the event log if +`spark.eventLog.logStageExecutorMetrics` is true. +Executor memory metrics are also exposed via the Spark metrics system based on the Dropwizard metrics library. A list of the available metrics, with a short description: @@ -699,21 +702,62 @@ A list of the available metrics, with a short description: Short description +rddBlocks +RDD blocks in the block manager of this executor. + + +memoryUsed +Storage memory used by this executor. + + +diskUsed +Disk space used for RDD storage by this executor. + + +totalCores +Number of cores available in this executor. + + +maxTasks +Maximum number of tasks that can run concurrently in this executor. + + +activeTasks +Number of tasks currently executing. + + +failedTasks +Number of tasks that have failed in this executor. + + +completedTasks +Number of tasks that have completed in this executor. + + +totalTasks +Total number of tasks (running, failed and completed) in this executor. + + +totalDuration +Elapsed time the JVM spent executing tasks in this executor. +The value is expressed in milliseconds. + + totalGCTime -Elapsed time the JVM spent in garbage collection summed in this Executor. +Elapsed time the JVM spent in garbage collection summed in this executor. The value is expressed in milliseconds. totalInputBytes -Total input bytes summed in this Executor. +Total input bytes summed in this executor. totalShuffleRead -Total shuffer read bytes summed in this Executor. +Total shuffle read bytes summed in this executor. totalShuffleWrite -Total shuffer write bytes summed in this Executor. +Total shuffle write bytes summed in this executor. maxMemory - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail:
[spark] branch branch-3.0 updated: [SPARK-30775][DOC] Improve the description of executor metrics in the monitoring documentation
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch branch-3.0 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.0 by this push: new ca3887a [SPARK-30775][DOC] Improve the description of executor metrics in the monitoring documentation ca3887a is described below commit ca3887a0de31fa78097ca7ee92ead914a3ce050c Author: Luca Canali AuthorDate: Mon Mar 30 18:00:54 2020 -0700 [SPARK-30775][DOC] Improve the description of executor metrics in the monitoring documentation ### What changes were proposed in this pull request? This PR (SPARK-30775) aims to improve the description of the executor metrics in the monitoring documentation. ### Why are the changes needed? Improve and clarify monitoring documentation by: - adding reference to the Prometheus end point, as implemented in [SPARK-29064] - extending the list and descripion of executor metrics, following up from [SPARK-27157] ### Does this PR introduce any user-facing change? Documentation update. ### How was this patch tested? n.a. Closes #27526 from LucaCanali/docPrometheusMetricsFollowupSpark29064. Authored-by: Luca Canali Signed-off-by: Dongjoon Hyun (cherry picked from commit aa98ac52dbbe3fc2d3b152af9324a71f48439a38) Signed-off-by: Dongjoon Hyun --- docs/monitoring.md | 58 +++--- 1 file changed, 51 insertions(+), 7 deletions(-) diff --git a/docs/monitoring.md b/docs/monitoring.md index ba3f1dc..131cd2a 100644 --- a/docs/monitoring.md +++ b/docs/monitoring.md @@ -689,9 +689,12 @@ A list of the available metrics, with a short description: ### Executor Metrics Executor-level metrics are sent from each executor to the driver as part of the Heartbeat to describe the performance metrics of Executor itself like JVM heap memory, GC information. -Executor metric values and their measured peak values per executor are exposed via the REST API at the end point `/applications/[app-id]/executors`. -In addition, aggregated per-stage peak values of the executor metrics are written to the event log if `spark.eventLog.logStageExecutorMetrics` is true. -Executor metrics are also exposed via the Spark metrics system based on the Dropwizard metrics library. +Executor metric values and their measured memory peak values per executor are exposed via the REST API in JSON format and in Prometheus format. +The JSON end point is exposed at: `/applications/[app-id]/executors`, and the Prometheus endpoint at: `/metrics/executors/prometheus`. +The Prometheus endpoint is conditional to a configuration parameter: `spark.ui.prometheus.enabled=true` (the default is `false`). +In addition, aggregated per-stage peak values of the executor memory metrics are written to the event log if +`spark.eventLog.logStageExecutorMetrics` is true. +Executor memory metrics are also exposed via the Spark metrics system based on the Dropwizard metrics library. A list of the available metrics, with a short description: @@ -699,21 +702,62 @@ A list of the available metrics, with a short description: Short description +rddBlocks +RDD blocks in the block manager of this executor. + + +memoryUsed +Storage memory used by this executor. + + +diskUsed +Disk space used for RDD storage by this executor. + + +totalCores +Number of cores available in this executor. + + +maxTasks +Maximum number of tasks that can run concurrently in this executor. + + +activeTasks +Number of tasks currently executing. + + +failedTasks +Number of tasks that have failed in this executor. + + +completedTasks +Number of tasks that have completed in this executor. + + +totalTasks +Total number of tasks (running, failed and completed) in this executor. + + +totalDuration +Elapsed time the JVM spent executing tasks in this executor. +The value is expressed in milliseconds. + + totalGCTime -Elapsed time the JVM spent in garbage collection summed in this Executor. +Elapsed time the JVM spent in garbage collection summed in this executor. The value is expressed in milliseconds. totalInputBytes -Total input bytes summed in this Executor. +Total input bytes summed in this executor. totalShuffleRead -Total shuffer read bytes summed in this Executor. +Total shuffle read bytes summed in this executor. totalShuffleWrite -Total shuffer write bytes summed in this Executor. +Total shuffle write bytes summed in this executor. maxMemory - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail:
[spark] branch branch-3.0 updated: [SPARK-29574][K8S][FOLLOWUP] Fix bash comparison error in Docker entrypoint.sh
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch branch-3.0 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.0 by this push: new 5a96ee7 [SPARK-29574][K8S][FOLLOWUP] Fix bash comparison error in Docker entrypoint.sh 5a96ee7 is described below commit 5a96ee7619ea07edefd030c66641e6e473a890e0 Author: Đặng Minh Dũng AuthorDate: Mon Mar 30 15:41:57 2020 -0700 [SPARK-29574][K8S][FOLLOWUP] Fix bash comparison error in Docker entrypoint.sh A small change to fix an error in Docker `entrypoint.sh` When spark running on Kubernetes, I got the following logs: ```log + '[' -n ']' + '[' -z ']' ++ /bin/hadoop classpath /opt/entrypoint.sh: line 62: /bin/hadoop: No such file or directory + export SPARK_DIST_CLASSPATH= + SPARK_DIST_CLASSPATH= ``` This is because you are missing some quotes on bash comparisons. No CI Closes #28075 from dungdm93/patch-1. Authored-by: Đặng Minh Dũng Signed-off-by: Dongjoon Hyun (cherry picked from commit 1d0fc9aa85b3ad3326b878de49b748413dee1dd9) Signed-off-by: Dongjoon Hyun --- .../kubernetes/docker/src/main/dockerfiles/spark/entrypoint.sh| 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/entrypoint.sh b/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/entrypoint.sh index 6ee3523..8218c29 100755 --- a/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/entrypoint.sh +++ b/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/entrypoint.sh @@ -58,8 +58,8 @@ fi # If HADOOP_HOME is set and SPARK_DIST_CLASSPATH is not set, set it here so Hadoop jars are available to the executor. # It does not set SPARK_DIST_CLASSPATH if already set, to avoid overriding customizations of this value from elsewhere e.g. Docker/K8s. -if [ -n ${HADOOP_HOME} ] && [ -z ${SPARK_DIST_CLASSPATH} ]; then - export SPARK_DIST_CLASSPATH=$($HADOOP_HOME/bin/hadoop classpath) +if [ -n "${HADOOP_HOME}" ] && [ -z "${SPARK_DIST_CLASSPATH}" ]; then + export SPARK_DIST_CLASSPATH="$($HADOOP_HOME/bin/hadoop classpath)" fi if ! [ -z ${HADOOP_CONF_DIR+x} ]; then - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.0 updated: [SPARK-29574][K8S][FOLLOWUP] Fix bash comparison error in Docker entrypoint.sh
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch branch-3.0 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.0 by this push: new 5a96ee7 [SPARK-29574][K8S][FOLLOWUP] Fix bash comparison error in Docker entrypoint.sh 5a96ee7 is described below commit 5a96ee7619ea07edefd030c66641e6e473a890e0 Author: Đặng Minh Dũng AuthorDate: Mon Mar 30 15:41:57 2020 -0700 [SPARK-29574][K8S][FOLLOWUP] Fix bash comparison error in Docker entrypoint.sh A small change to fix an error in Docker `entrypoint.sh` When spark running on Kubernetes, I got the following logs: ```log + '[' -n ']' + '[' -z ']' ++ /bin/hadoop classpath /opt/entrypoint.sh: line 62: /bin/hadoop: No such file or directory + export SPARK_DIST_CLASSPATH= + SPARK_DIST_CLASSPATH= ``` This is because you are missing some quotes on bash comparisons. No CI Closes #28075 from dungdm93/patch-1. Authored-by: Đặng Minh Dũng Signed-off-by: Dongjoon Hyun (cherry picked from commit 1d0fc9aa85b3ad3326b878de49b748413dee1dd9) Signed-off-by: Dongjoon Hyun --- .../kubernetes/docker/src/main/dockerfiles/spark/entrypoint.sh| 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/entrypoint.sh b/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/entrypoint.sh index 6ee3523..8218c29 100755 --- a/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/entrypoint.sh +++ b/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/entrypoint.sh @@ -58,8 +58,8 @@ fi # If HADOOP_HOME is set and SPARK_DIST_CLASSPATH is not set, set it here so Hadoop jars are available to the executor. # It does not set SPARK_DIST_CLASSPATH if already set, to avoid overriding customizations of this value from elsewhere e.g. Docker/K8s. -if [ -n ${HADOOP_HOME} ] && [ -z ${SPARK_DIST_CLASSPATH} ]; then - export SPARK_DIST_CLASSPATH=$($HADOOP_HOME/bin/hadoop classpath) +if [ -n "${HADOOP_HOME}" ] && [ -z "${SPARK_DIST_CLASSPATH}" ]; then + export SPARK_DIST_CLASSPATH="$($HADOOP_HOME/bin/hadoop classpath)" fi if ! [ -z ${HADOOP_CONF_DIR+x} ]; then - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.0 updated: [SPARK-31312][SQL] Cache Class instance for the UDF instance in HiveFunctionWrapper
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a commit to branch branch-3.0 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.0 by this push: new bd2b6aa [SPARK-31312][SQL] Cache Class instance for the UDF instance in HiveFunctionWrapper bd2b6aa is described below commit bd2b6aa42c8a5472c464bb1ee1a8f59a97f699f9 Author: Jungtaek Lim (HeartSaVioR) AuthorDate: Tue Mar 31 16:17:26 2020 + [SPARK-31312][SQL] Cache Class instance for the UDF instance in HiveFunctionWrapper ### What changes were proposed in this pull request? This patch proposes to cache Class instance for the UDF instance in HiveFunctionWrapper to fix the case where Hive simple UDF is somehow transformed (expression is copied) and evaluated later with another classloader (for the case current thread context classloader is somehow changed). In this case, Spark throws CNFE as of now. It's only occurred for Hive simple UDF, as HiveFunctionWrapper caches the UDF instance whereas it doesn't do for `UDF` type. The comment says Spark has to create instance every time for UDF, so we cannot simply do the same. This patch caches Class instance instead, and switch current thread context classloader to which loads the Class instance. This patch extends the test boundary as well. We only tested with GenericUDTF for SPARK-26560, and this patch actually requires only UDF. But to avoid regression for other types as well, this patch adds all available types (UDF, GenericUDF, AbstractGenericUDAFResolver, UDAF, GenericUDTF) into the boundary of tests. Credit to cloud-fan as he discovered the problem and proposed the solution. ### Why are the changes needed? Above section describes why it's a bug and how it's fixed. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? New UTs added. Closes #28079 from HeartSaVioR/SPARK-31312. Authored-by: Jungtaek Lim (HeartSaVioR) Signed-off-by: Wenchen Fan (cherry picked from commit 2a6aa8e87bec39f6bfec67e151ef8566b75caecd) Signed-off-by: Wenchen Fan --- .../scala/org/apache/spark/sql/hive/HiveShim.scala | 18 +- .../src/test/noclasspath/TestUDTF-spark-26560.jar | Bin 7462 -> 0 bytes sql/hive/src/test/noclasspath/hive-test-udfs.jar | Bin 0 -> 35660 bytes .../spark/sql/hive/HiveUDFDynamicLoadSuite.scala | 190 + .../spark/sql/hive/execution/SQLQuerySuite.scala | 47 - 5 files changed, 203 insertions(+), 52 deletions(-) diff --git a/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveShim.scala b/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveShim.scala index 3beef6b..04a6a8f 100644 --- a/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveShim.scala +++ b/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveShim.scala @@ -118,9 +118,12 @@ private[hive] object HiveShim { * * @param functionClassName UDF class name * @param instance optional UDF instance which contains additional information (for macro) + * @param clazz optional class instance to create UDF instance */ - private[hive] case class HiveFunctionWrapper(var functionClassName: String, -private var instance: AnyRef = null) extends java.io.Externalizable { + private[hive] case class HiveFunctionWrapper( + var functionClassName: String, + private var instance: AnyRef = null, + private var clazz: Class[_ <: AnyRef] = null) extends java.io.Externalizable { // for Serialization def this() = this(null) @@ -232,8 +235,10 @@ private[hive] object HiveShim { in.readFully(functionInBytes) // deserialize the function object via Hive Utilities +clazz = Utils.getContextOrSparkClassLoader.loadClass(functionClassName) + .asInstanceOf[Class[_ <: AnyRef]] instance = deserializePlan[AnyRef](new java.io.ByteArrayInputStream(functionInBytes), - Utils.getContextOrSparkClassLoader.loadClass(functionClassName)) + clazz) } } @@ -241,8 +246,11 @@ private[hive] object HiveShim { if (instance != null) { instance.asInstanceOf[UDFType] } else { -val func = Utils.getContextOrSparkClassLoader - .loadClass(functionClassName).getConstructor().newInstance().asInstanceOf[UDFType] +if (clazz == null) { + clazz = Utils.getContextOrSparkClassLoader.loadClass(functionClassName) +.asInstanceOf[Class[_ <: AnyRef]] +} +val func = clazz.getConstructor().newInstance().asInstanceOf[UDFType] if (!func.isInstanceOf[UDF]) { // We cache the function if it's no the Simple UDF, // as we always have to create new instance for Simple UDF diff --git a/sql/hive/src/test/noclasspath/TestUDTF-spark-26560.jar
[spark] branch master updated: [SPARK-31312][SQL] Cache Class instance for the UDF instance in HiveFunctionWrapper
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 2a6aa8e [SPARK-31312][SQL] Cache Class instance for the UDF instance in HiveFunctionWrapper 2a6aa8e is described below commit 2a6aa8e87bec39f6bfec67e151ef8566b75caecd Author: Jungtaek Lim (HeartSaVioR) AuthorDate: Tue Mar 31 16:17:26 2020 + [SPARK-31312][SQL] Cache Class instance for the UDF instance in HiveFunctionWrapper ### What changes were proposed in this pull request? This patch proposes to cache Class instance for the UDF instance in HiveFunctionWrapper to fix the case where Hive simple UDF is somehow transformed (expression is copied) and evaluated later with another classloader (for the case current thread context classloader is somehow changed). In this case, Spark throws CNFE as of now. It's only occurred for Hive simple UDF, as HiveFunctionWrapper caches the UDF instance whereas it doesn't do for `UDF` type. The comment says Spark has to create instance every time for UDF, so we cannot simply do the same. This patch caches Class instance instead, and switch current thread context classloader to which loads the Class instance. This patch extends the test boundary as well. We only tested with GenericUDTF for SPARK-26560, and this patch actually requires only UDF. But to avoid regression for other types as well, this patch adds all available types (UDF, GenericUDF, AbstractGenericUDAFResolver, UDAF, GenericUDTF) into the boundary of tests. Credit to cloud-fan as he discovered the problem and proposed the solution. ### Why are the changes needed? Above section describes why it's a bug and how it's fixed. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? New UTs added. Closes #28079 from HeartSaVioR/SPARK-31312. Authored-by: Jungtaek Lim (HeartSaVioR) Signed-off-by: Wenchen Fan --- .../scala/org/apache/spark/sql/hive/HiveShim.scala | 18 +- .../src/test/noclasspath/TestUDTF-spark-26560.jar | Bin 7462 -> 0 bytes sql/hive/src/test/noclasspath/hive-test-udfs.jar | Bin 0 -> 35660 bytes .../spark/sql/hive/HiveUDFDynamicLoadSuite.scala | 190 + .../spark/sql/hive/execution/SQLQuerySuite.scala | 47 - 5 files changed, 203 insertions(+), 52 deletions(-) diff --git a/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveShim.scala b/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveShim.scala index 3beef6b..04a6a8f 100644 --- a/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveShim.scala +++ b/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveShim.scala @@ -118,9 +118,12 @@ private[hive] object HiveShim { * * @param functionClassName UDF class name * @param instance optional UDF instance which contains additional information (for macro) + * @param clazz optional class instance to create UDF instance */ - private[hive] case class HiveFunctionWrapper(var functionClassName: String, -private var instance: AnyRef = null) extends java.io.Externalizable { + private[hive] case class HiveFunctionWrapper( + var functionClassName: String, + private var instance: AnyRef = null, + private var clazz: Class[_ <: AnyRef] = null) extends java.io.Externalizable { // for Serialization def this() = this(null) @@ -232,8 +235,10 @@ private[hive] object HiveShim { in.readFully(functionInBytes) // deserialize the function object via Hive Utilities +clazz = Utils.getContextOrSparkClassLoader.loadClass(functionClassName) + .asInstanceOf[Class[_ <: AnyRef]] instance = deserializePlan[AnyRef](new java.io.ByteArrayInputStream(functionInBytes), - Utils.getContextOrSparkClassLoader.loadClass(functionClassName)) + clazz) } } @@ -241,8 +246,11 @@ private[hive] object HiveShim { if (instance != null) { instance.asInstanceOf[UDFType] } else { -val func = Utils.getContextOrSparkClassLoader - .loadClass(functionClassName).getConstructor().newInstance().asInstanceOf[UDFType] +if (clazz == null) { + clazz = Utils.getContextOrSparkClassLoader.loadClass(functionClassName) +.asInstanceOf[Class[_ <: AnyRef]] +} +val func = clazz.getConstructor().newInstance().asInstanceOf[UDFType] if (!func.isInstanceOf[UDF]) { // We cache the function if it's no the Simple UDF, // as we always have to create new instance for Simple UDF diff --git a/sql/hive/src/test/noclasspath/TestUDTF-spark-26560.jar b/sql/hive/src/test/noclasspath/TestUDTF-spark-26560.jar deleted file mode 100644 index b73b17d..000 Binary files
[spark] branch branch-3.0 updated: [SPARK-31230][SQL] Use statement plans in DataFrameWriter(V2)
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a commit to branch branch-3.0 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.0 by this push: new 20bb334 [SPARK-31230][SQL] Use statement plans in DataFrameWriter(V2) 20bb334 is described below commit 20bb33453f85aeb5d2448252a9dd23d3ab85d251 Author: Wenchen Fan AuthorDate: Tue Mar 31 23:19:46 2020 +0800 [SPARK-31230][SQL] Use statement plans in DataFrameWriter(V2) ### What changes were proposed in this pull request? Create statement plans in `DataFrameWriter(V2)`, like the SQL API. ### Why are the changes needed? It's better to leave all the resolution work to the analyzer. ### Does this PR introduce any user-facing change? no ### How was this patch tested? existing tests Closes #27992 from cloud-fan/statement. Authored-by: Wenchen Fan Signed-off-by: Wenchen Fan (cherry picked from commit 8b01473e8bffe349b1ed993b61420d7d68896cd8) Signed-off-by: Wenchen Fan --- .../sql/catalyst/analysis/ResolveCatalogs.scala| 8 ++-- .../spark/sql/catalyst/parser/AstBuilder.scala | 4 +- .../sql/catalyst/plans/logical/statements.scala| 2 + .../apache/spark/sql/connector/InMemoryTable.scala | 1 + .../org/apache/spark/sql/DataFrameWriter.scala | 55 -- .../org/apache/spark/sql/DataFrameWriterV2.scala | 43 - .../catalyst/analysis/ResolveSessionCatalog.scala | 8 ++-- .../execution/command/PlanResolutionSuite.scala| 4 +- 8 files changed, 66 insertions(+), 59 deletions(-) diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveCatalogs.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveCatalogs.scala index 895dfbb..403e4e8 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveCatalogs.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveCatalogs.scala @@ -134,7 +134,7 @@ class ResolveCatalogs(val catalogManager: CatalogManager) ignoreIfExists = c.ifNotExists) case c @ CreateTableAsSelectStatement( - NonSessionCatalogAndTable(catalog, tbl), _, _, _, _, _, _, _, _, _) => + NonSessionCatalogAndTable(catalog, tbl), _, _, _, _, _, _, _, _, _, _) => CreateTableAsSelect( catalog.asTableCatalog, tbl.asIdentifier, @@ -142,7 +142,7 @@ class ResolveCatalogs(val catalogManager: CatalogManager) c.partitioning ++ c.bucketSpec.map(_.asTransform), c.asSelect, convertTableProperties(c.properties, c.options, c.location, c.comment, c.provider), -writeOptions = c.options, +writeOptions = c.writeOptions, ignoreIfExists = c.ifNotExists) case RefreshTableStatement(NonSessionCatalogAndTable(catalog, tbl)) => @@ -161,7 +161,7 @@ class ResolveCatalogs(val catalogManager: CatalogManager) orCreate = c.orCreate) case c @ ReplaceTableAsSelectStatement( - NonSessionCatalogAndTable(catalog, tbl), _, _, _, _, _, _, _, _, _) => + NonSessionCatalogAndTable(catalog, tbl), _, _, _, _, _, _, _, _, _, _) => ReplaceTableAsSelect( catalog.asTableCatalog, tbl.asIdentifier, @@ -169,7 +169,7 @@ class ResolveCatalogs(val catalogManager: CatalogManager) c.partitioning ++ c.bucketSpec.map(_.asTransform), c.asSelect, convertTableProperties(c.properties, c.options, c.location, c.comment, c.provider), -writeOptions = c.options, +writeOptions = c.writeOptions, orCreate = c.orCreate) case DropTableStatement(NonSessionCatalogAndTable(catalog, tbl), ifExists, _) => diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala index 09d316b6..cd4c895 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala @@ -2779,7 +2779,7 @@ class AstBuilder(conf: SQLConf) extends SqlBaseBaseVisitor[AnyRef] with Logging case Some(query) => CreateTableAsSelectStatement( table, query, partitioning, bucketSpec, properties, provider, options, location, comment, - ifNotExists = ifNotExists) + writeOptions = Map.empty, ifNotExists = ifNotExists) case None if temp => // CREATE TEMPORARY TABLE ... USING ... is not supported by the catalyst parser. @@ -2834,7 +2834,7 @@ class AstBuilder(conf: SQLConf) extends SqlBaseBaseVisitor[AnyRef] with Logging case Some(query) => ReplaceTableAsSelectStatement(table, query, partitioning, bucketSpec, properties, - provider,
[spark] branch master updated: [SPARK-31230][SQL] Use statement plans in DataFrameWriter(V2)
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 8b01473 [SPARK-31230][SQL] Use statement plans in DataFrameWriter(V2) 8b01473 is described below commit 8b01473e8bffe349b1ed993b61420d7d68896cd8 Author: Wenchen Fan AuthorDate: Tue Mar 31 23:19:46 2020 +0800 [SPARK-31230][SQL] Use statement plans in DataFrameWriter(V2) ### What changes were proposed in this pull request? Create statement plans in `DataFrameWriter(V2)`, like the SQL API. ### Why are the changes needed? It's better to leave all the resolution work to the analyzer. ### Does this PR introduce any user-facing change? no ### How was this patch tested? existing tests Closes #27992 from cloud-fan/statement. Authored-by: Wenchen Fan Signed-off-by: Wenchen Fan --- .../sql/catalyst/analysis/ResolveCatalogs.scala| 8 ++-- .../spark/sql/catalyst/parser/AstBuilder.scala | 4 +- .../sql/catalyst/plans/logical/statements.scala| 2 + .../apache/spark/sql/connector/InMemoryTable.scala | 1 + .../org/apache/spark/sql/DataFrameWriter.scala | 55 -- .../org/apache/spark/sql/DataFrameWriterV2.scala | 43 - .../catalyst/analysis/ResolveSessionCatalog.scala | 8 ++-- .../execution/command/PlanResolutionSuite.scala| 4 +- 8 files changed, 66 insertions(+), 59 deletions(-) diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveCatalogs.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveCatalogs.scala index 463793e..2a0a944 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveCatalogs.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveCatalogs.scala @@ -156,7 +156,7 @@ class ResolveCatalogs(val catalogManager: CatalogManager) ignoreIfExists = c.ifNotExists) case c @ CreateTableAsSelectStatement( - NonSessionCatalogAndTable(catalog, tbl), _, _, _, _, _, _, _, _, _) => + NonSessionCatalogAndTable(catalog, tbl), _, _, _, _, _, _, _, _, _, _) => CreateTableAsSelect( catalog.asTableCatalog, tbl.asIdentifier, @@ -164,7 +164,7 @@ class ResolveCatalogs(val catalogManager: CatalogManager) c.partitioning ++ c.bucketSpec.map(_.asTransform), c.asSelect, convertTableProperties(c.properties, c.options, c.location, c.comment, c.provider), -writeOptions = c.options, +writeOptions = c.writeOptions, ignoreIfExists = c.ifNotExists) case RefreshTableStatement(NonSessionCatalogAndTable(catalog, tbl)) => @@ -183,7 +183,7 @@ class ResolveCatalogs(val catalogManager: CatalogManager) orCreate = c.orCreate) case c @ ReplaceTableAsSelectStatement( - NonSessionCatalogAndTable(catalog, tbl), _, _, _, _, _, _, _, _, _) => + NonSessionCatalogAndTable(catalog, tbl), _, _, _, _, _, _, _, _, _, _) => ReplaceTableAsSelect( catalog.asTableCatalog, tbl.asIdentifier, @@ -191,7 +191,7 @@ class ResolveCatalogs(val catalogManager: CatalogManager) c.partitioning ++ c.bucketSpec.map(_.asTransform), c.asSelect, convertTableProperties(c.properties, c.options, c.location, c.comment, c.provider), -writeOptions = c.options, +writeOptions = c.writeOptions, orCreate = c.orCreate) case DropTableStatement(NonSessionCatalogAndTable(catalog, tbl), ifExists, _) => diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala index 0f0ee80..cc41863 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala @@ -2779,7 +2779,7 @@ class AstBuilder(conf: SQLConf) extends SqlBaseBaseVisitor[AnyRef] with Logging case Some(query) => CreateTableAsSelectStatement( table, query, partitioning, bucketSpec, properties, provider, options, location, comment, - ifNotExists = ifNotExists) + writeOptions = Map.empty, ifNotExists = ifNotExists) case None if temp => // CREATE TEMPORARY TABLE ... USING ... is not supported by the catalyst parser. @@ -2834,7 +2834,7 @@ class AstBuilder(conf: SQLConf) extends SqlBaseBaseVisitor[AnyRef] with Logging case Some(query) => ReplaceTableAsSelectStatement(table, query, partitioning, bucketSpec, properties, - provider, options, location, comment, orCreate = orCreate) + provider, options, location, comment, writeOptions =
svn commit: r38759 - in /dev/spark/v3.0.0-rc1-docs: ./ _site/ _site/api/ _site/api/R/ _site/api/java/ _site/api/java/lib/ _site/api/java/org/ _site/api/java/org/apache/ _site/api/java/org/apache/parqu
Author: rxin Date: Tue Mar 31 13:45:27 2020 New Revision: 38759 Log: Apache Spark v3.0.0-rc1 docs [This commit notification would consist of 1911 parts, which exceeds the limit of 50 ones, so it was shortened to the summary.] - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.0 updated: [SPARK-31314][CORE] Revert SPARK-29285 to fix shuffle regression caused by creating temporary file eagerly
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a commit to branch branch-3.0 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.0 by this push: new 08bb5f0 [SPARK-31314][CORE] Revert SPARK-29285 to fix shuffle regression caused by creating temporary file eagerly 08bb5f0 is described below commit 08bb5f0ffeb4f5e37417f15931717784db544730 Author: Yuanjian Li AuthorDate: Tue Mar 31 19:01:08 2020 +0800 [SPARK-31314][CORE] Revert SPARK-29285 to fix shuffle regression caused by creating temporary file eagerly ### What changes were proposed in this pull request? This reverts commit 8cf76f8d61b393bb3abd9780421b978e98db8cae. #25962 ### Why are the changes needed? In SPARK-29285, we change to create shuffle temporary eagerly. This is helpful for not to fail the entire task in the scenario of occasional disk failure. But for the applications that many tasks don't actually create shuffle files, it caused overhead. See the below benchmark: Env: Spark local-cluster[2, 4, 19968], each queries run 5 round, each round 5 times. Data: TPC-DS scale=99 generate by spark-tpcds-datagen Results: | | Base | Revert | |-|-|-| | Q20 | Vector(4.096865667, 2.76231748, 2.722007606, 2.514433591, 2.400373579) Median 2.722007606 | Vector(3.763185446, 2.586498463, 2.593472842, 2.320522846, 2.224627274) Median 2.586498463 | | Q33 | Vector(5.872176321, 4.854397586, 4.568787136, 4.393378146, 4.423996818) Median 4.568787136 | Vector(5.38746785, 4.361236877, 4.082311276, 3.867206824, 3.783188024) Median 4.082311276 | | Q52 | Vector(3.978870321, 3.225437871, 3.282411608, 2.869674887, 2.644490664) Median 3.225437871 | Vector(4.000381522, 3.196025108, 3.248787619, 2.767444508, 2.606163423) Median 3.196025108 | | Q56 | Vector(6.238045133, 4.820535173, 4.609965579, 4.313509894, 4.221256227) Median 4.609965579 | Vector(6.241611339, 4.225592467, 4.195202502, 3.757085755, 3.657525982) Median 4.195202502 | ### Does this PR introduce any user-facing change? No ### How was this patch tested? Existing tests. Closes #28072 from xuanyuanking/SPARK-29285-revert. Authored-by: Yuanjian Li Signed-off-by: Wenchen Fan (cherry picked from commit 07c50784d34e10bbfafac7498c0b70c4ec08048a) Signed-off-by: Wenchen Fan --- .../apache/spark/storage/DiskBlockManager.scala| 36 -- .../main/scala/org/apache/spark/util/Utils.scala | 2 +- .../spark/storage/DiskBlockManagerSuite.scala | 43 +- 3 files changed, 10 insertions(+), 71 deletions(-) diff --git a/core/src/main/scala/org/apache/spark/storage/DiskBlockManager.scala b/core/src/main/scala/org/apache/spark/storage/DiskBlockManager.scala index ee43b76..f211394 100644 --- a/core/src/main/scala/org/apache/spark/storage/DiskBlockManager.scala +++ b/core/src/main/scala/org/apache/spark/storage/DiskBlockManager.scala @@ -20,8 +20,6 @@ package org.apache.spark.storage import java.io.{File, IOException} import java.util.UUID -import scala.util.control.NonFatal - import org.apache.spark.SparkConf import org.apache.spark.executor.ExecutorExitCode import org.apache.spark.internal.{config, Logging} @@ -119,38 +117,20 @@ private[spark] class DiskBlockManager(conf: SparkConf, deleteFilesOnStop: Boolea /** Produces a unique block id and File suitable for storing local intermediate results. */ def createTempLocalBlock(): (TempLocalBlockId, File) = { -var blockId = TempLocalBlockId(UUID.randomUUID()) -var tempLocalFile = getFile(blockId) -var count = 0 -while (!canCreateFile(tempLocalFile) && count < Utils.MAX_DIR_CREATION_ATTEMPTS) { - blockId = TempLocalBlockId(UUID.randomUUID()) - tempLocalFile = getFile(blockId) - count += 1 +var blockId = new TempLocalBlockId(UUID.randomUUID()) +while (getFile(blockId).exists()) { + blockId = new TempLocalBlockId(UUID.randomUUID()) } -(blockId, tempLocalFile) +(blockId, getFile(blockId)) } /** Produces a unique block id and File suitable for storing shuffled intermediate results. */ def createTempShuffleBlock(): (TempShuffleBlockId, File) = { -var blockId = TempShuffleBlockId(UUID.randomUUID()) -var tempShuffleFile = getFile(blockId) -var count = 0 -while (!canCreateFile(tempShuffleFile) && count < Utils.MAX_DIR_CREATION_ATTEMPTS) { - blockId = TempShuffleBlockId(UUID.randomUUID()) - tempShuffleFile = getFile(blockId) -
[spark] branch master updated (bb0b416 -> 07c5078)
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from bb0b416 [SPARK-31297][SQL] Speed up dates rebasing add 07c5078 [SPARK-31314][CORE] Revert SPARK-29285 to fix shuffle regression caused by creating temporary file eagerly No new revisions were added by this update. Summary of changes: .../apache/spark/storage/DiskBlockManager.scala| 36 -- .../main/scala/org/apache/spark/util/Utils.scala | 2 +- .../spark/storage/DiskBlockManagerSuite.scala | 43 +- 3 files changed, 10 insertions(+), 71 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.0 updated: [SPARK-31297][SQL] Speed up dates rebasing
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a commit to branch branch-3.0 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.0 by this push: new e7885b8 [SPARK-31297][SQL] Speed up dates rebasing e7885b8 is described below commit e7885b8a6686bc9179f741f1394dbbf7a9e211ef Author: Maxim Gekk AuthorDate: Tue Mar 31 17:38:47 2020 +0800 [SPARK-31297][SQL] Speed up dates rebasing ### What changes were proposed in this pull request? In the PR, I propose to replace current implementation of the `rebaseGregorianToJulianDays()` and `rebaseJulianToGregorianDays()` functions in `DateTimeUtils` by new one which is based on the fact that difference between Proleptic Gregorian and the hybrid (Julian+Gregorian) calendars was changed only 14 times for entire supported range of valid dates `[0001-01-01, -12-31]`: | date | Proleptic Greg. days | Hybrid (Julian+Greg) days | diff| | | ||| |0001-01-01|-719162|-719164|-2| |0100-03-01|-682944|-682945|-1| |0200-03-01|-646420|-646420|0| |0300-03-01|-609896|-609895|1| |0500-03-01|-536847|-536845|2| |0600-03-01|-500323|-500320|3| |0700-03-01|-463799|-463795|4| |0900-03-01|-390750|-390745|5| |1000-03-01|-354226|-354220|6| |1100-03-01|-317702|-317695|7| |1300-03-01|-244653|-244645|8| |1400-03-01|-208129|-208120|9| |1500-03-01|-171605|-171595|10| |1582-10-15|-141427|-141427|0| For the given days since the epoch, the proposed implementation finds the range of days which the input days belongs to, and adds the diff in days between calendars to the input. The result is rebased days since the epoch in the target calendar. For example, if need to rebase -65 days from Proleptic Gregorian calendar to the hybrid calendar. In that case, the input falls to the bucket [-682944, -646420), the diff associated with the range is -1. To get the rebased days in Julian calendar, we should add -1 to -65, and the result is -650001. ### Why are the changes needed? To make dates rebasing faster. ### Does this PR introduce any user-facing change? No, the results should be the same for valid range of the `DATE` type `[0001-01-01, -12-31]`. ### How was this patch tested? - Added 2 tests to `DateTimeUtilsSuite` for the `rebaseGregorianToJulianDays()` and `rebaseJulianToGregorianDays()` functions. The tests check that results of old and new implementation (optimized version) are the same for all supported dates. - Re-run `DateTimeRebaseBenchmark` on: | Item | Description | | | | | Region | us-west-2 (Oregon) | | Instance | r3.xlarge | | AMI | ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1 (ami-06f2f779464715dc5) | | Java | OpenJDK8/11 | Closes #28067 from MaxGekk/optimize-rebasing. Lead-authored-by: Maxim Gekk Co-authored-by: Max Gekk Signed-off-by: Wenchen Fan (cherry picked from commit bb0b416f0b3a2747a420b17d1bf659891bae3274) Signed-off-by: Wenchen Fan --- .../spark/sql/catalyst/util/DateTimeUtils.scala| 79 +++--- .../sql/catalyst/util/DateTimeUtilsSuite.scala | 58 +++- .../DateTimeRebaseBenchmark-jdk11-results.txt | 64 +- .../benchmarks/DateTimeRebaseBenchmark-results.txt | 64 +- 4 files changed, 174 insertions(+), 91 deletions(-) diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala index 2b646cc..44cabe2 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala @@ -1040,6 +1040,44 @@ object DateTimeUtils { } /** + * Rebases days since the epoch from an original to an target calendar, from instance + * from a hybrid (Julian + Gregorian) to Proleptic Gregorian calendar. + * + * It finds the latest switch day which is less than `days`, and adds the difference + * in days associated with the switch day to the given `days`. The function is based + * on linear search which starts from the most recent switch days. This allows to perform + * less comparisons for modern dates. + * + * @param switchDays The days when difference in days between original and target + * calendar was changed. + * @param diffs The differences in days between calendars. + * @param days The number of days since the epoch 1970-01-01 to be rebased to the + * target calendar. + * @return The rebased day + */ + private def rebaseDays(switchDays: Array[Int], diffs: Array[Int], days: Int): Int = { +var i = switchDays.length - 1 +
svn commit: r38754 - /dev/spark/v3.0.0-rc1-bin/
Author: rxin Date: Tue Mar 31 09:57:10 2020 New Revision: 38754 Log: Apache Spark v3.0.0-rc1 Added: dev/spark/v3.0.0-rc1-bin/ dev/spark/v3.0.0-rc1-bin/SparkR_3.0.0.tar.gz (with props) dev/spark/v3.0.0-rc1-bin/SparkR_3.0.0.tar.gz.asc dev/spark/v3.0.0-rc1-bin/SparkR_3.0.0.tar.gz.sha512 dev/spark/v3.0.0-rc1-bin/pyspark-3.0.0.tar.gz (with props) dev/spark/v3.0.0-rc1-bin/pyspark-3.0.0.tar.gz.asc dev/spark/v3.0.0-rc1-bin/pyspark-3.0.0.tar.gz.sha512 dev/spark/v3.0.0-rc1-bin/spark-3.0.0-bin-hadoop2.7-hive1.2.tgz (with props) dev/spark/v3.0.0-rc1-bin/spark-3.0.0-bin-hadoop2.7-hive1.2.tgz.asc dev/spark/v3.0.0-rc1-bin/spark-3.0.0-bin-hadoop2.7-hive1.2.tgz.sha512 dev/spark/v3.0.0-rc1-bin/spark-3.0.0-bin-hadoop2.7.tgz (with props) dev/spark/v3.0.0-rc1-bin/spark-3.0.0-bin-hadoop2.7.tgz.asc dev/spark/v3.0.0-rc1-bin/spark-3.0.0-bin-hadoop2.7.tgz.sha512 dev/spark/v3.0.0-rc1-bin/spark-3.0.0-bin-hadoop3.2.tgz (with props) dev/spark/v3.0.0-rc1-bin/spark-3.0.0-bin-hadoop3.2.tgz.asc dev/spark/v3.0.0-rc1-bin/spark-3.0.0-bin-hadoop3.2.tgz.sha512 dev/spark/v3.0.0-rc1-bin/spark-3.0.0-bin-without-hadoop.tgz (with props) dev/spark/v3.0.0-rc1-bin/spark-3.0.0-bin-without-hadoop.tgz.asc dev/spark/v3.0.0-rc1-bin/spark-3.0.0-bin-without-hadoop.tgz.sha512 dev/spark/v3.0.0-rc1-bin/spark-3.0.0.tgz (with props) dev/spark/v3.0.0-rc1-bin/spark-3.0.0.tgz.asc dev/spark/v3.0.0-rc1-bin/spark-3.0.0.tgz.sha512 Added: dev/spark/v3.0.0-rc1-bin/SparkR_3.0.0.tar.gz == Binary file - no diff available. Propchange: dev/spark/v3.0.0-rc1-bin/SparkR_3.0.0.tar.gz -- svn:mime-type = application/octet-stream Added: dev/spark/v3.0.0-rc1-bin/SparkR_3.0.0.tar.gz.asc == --- dev/spark/v3.0.0-rc1-bin/SparkR_3.0.0.tar.gz.asc (added) +++ dev/spark/v3.0.0-rc1-bin/SparkR_3.0.0.tar.gz.asc Tue Mar 31 09:57:10 2020 @@ -0,0 +1,17 @@ +-BEGIN PGP SIGNATURE- + +iQJEBAABCgAuFiEESovaSObiEqc0YyUC3qlj4uk0fWYFAl6C/0sQHHJ4aW5AYXBh +Y2hlLm9yZwAKCRDeqWPi6TR9ZtCiD/9GtNXfxGR9oh2B4k+fg38uCrloGUYo3Dx9 +eJU6G55fbKtXK24dKlxZQCVDpwLihycnLULcV+/D75vWa4tSoG6n/FTHimCnUJWQ +UkEsxqhWuGi25rUx4VsOQeHPYIP9/2pVGVyanFzRp+yAyldATGG36u3Xv5lqox6b +6pARVwC6FZWKuk1b47xbRfYKUoNTkObhGjcKKyigexqx/nZOp99NP+sVlEqRD/l/ +B7l3kgAVq3XlZKUCkMhWgAHT6rPNkvwBdYZFce9gJHuG75Zw5rQ2hHesEqDOVlC1 +kqJPtpmb2U93ItBF6ArlmXcm+60rLa++B8cyrEsKLIyYxRpHH1bQmLB9TTzDeFpz +e+WWlUiDpC1Lorzvg+44MeOXSj9EhNgqsYypGKhlh6WTN8A+BRzvJRMpDMLElRz6 +lHaceqn9NC4eE5tzcyXAFL+8Y644nCTIZQuND72LvIv7rO0YXq/6yeudM+SDeANU +vscR4LiQ7/a3oSpxoIuA0MjKz6gWUaYFgsb8OuUC4VQPJKQZG+57SOazq1VTlB6/ +Ur8pePIUxU52EmzmIp08ws8v+NOo9pMxw7lyBwpmGX0/ax6p9v1xVcCeXqH4HYvA +9d7a7hZy9yoguAGsVkibSym8e6XITCDoXLb9/HPEhfdyxFgi87DVjKZ84HkyFw9/ +OzHhumSp/Q== +=zl/N +-END PGP SIGNATURE- Added: dev/spark/v3.0.0-rc1-bin/SparkR_3.0.0.tar.gz.sha512 == --- dev/spark/v3.0.0-rc1-bin/SparkR_3.0.0.tar.gz.sha512 (added) +++ dev/spark/v3.0.0-rc1-bin/SparkR_3.0.0.tar.gz.sha512 Tue Mar 31 09:57:10 2020 @@ -0,0 +1,3 @@ +SparkR_3.0.0.tar.gz: C2D9C0A5 E71C5B56 48AC15AA 998ABD06 2FDB4D5C D2B7C344 + B1949A7B 28508364 A9A45767 F2642F17 7EBFF4B0 55823EBD + BE76A2CE 5604660F 62D1654D 8271287B Added: dev/spark/v3.0.0-rc1-bin/pyspark-3.0.0.tar.gz == Binary file - no diff available. Propchange: dev/spark/v3.0.0-rc1-bin/pyspark-3.0.0.tar.gz -- svn:mime-type = application/octet-stream Added: dev/spark/v3.0.0-rc1-bin/pyspark-3.0.0.tar.gz.asc == --- dev/spark/v3.0.0-rc1-bin/pyspark-3.0.0.tar.gz.asc (added) +++ dev/spark/v3.0.0-rc1-bin/pyspark-3.0.0.tar.gz.asc Tue Mar 31 09:57:10 2020 @@ -0,0 +1,17 @@ +-BEGIN PGP SIGNATURE- + +iQJEBAABCgAuFiEESovaSObiEqc0YyUC3qlj4uk0fWYFAl6C/0wQHHJ4aW5AYXBh +Y2hlLm9yZwAKCRDeqWPi6TR9ZkfTD/4zQ5FuCr+giluZHaBnaZy7PAtSkoTjAWKX +8zObXESsoTlIIjHEpBUmUU6O0tZODFOF7Zau9HkftroGurYxpTWE5nX0e//71JuC +smBWLCgAeOlNEdeZUd2zm7pPWJfwRpsOcEfexb+RvaFQriw559Erxb5NoWHFIkg/ +tsjtjitMqLxcMlzZW7A/89zqmrnzBu1vhh/q8STzA0Ub6Jq+JzD4e6yatYAzjRj3 ++Um7+NL+g/2tmweH8f9TtYzQFcowm6DdXi53fWZX55oVc1xBRTNuSnAdCJlkgEPg +nUxEcuXUvHn/NbNNHPBwP6xMKyKqJu8+4vNLzr2ZxaxArPYF2FqTl8sFNxwVBM1Y +PnKun7iZiLq5JqC2OopiDa8FJP0JQkYVyBWAx3BOscsAELfdlZHlPdekcLE6YHHV +pde79YJ0tzUFIdH/Ulw4Jag4Ixunrg+ajmLS8n9ncpX0I81Zv8IJDaBf0cBboFw8 +kTqAvNkcsoGdRn1OiQnlE2IUib/R0fk7MktOyoZpfKzbCzxBZgLTO4FKTbRCydQX +I8UhuRhELHCI7YXJHwbk0Swp6+h36dUQtLxFfD/OZdDQABOK+nEVjNsBIHb7ULDB +pCckj8HBHwaynvNLogS1KJHThW8LEXAmVQFCD39XTNMnhfCUePyzlAC4RPByIFR4
[spark] branch master updated: [SPARK-31297][SQL] Speed up dates rebasing
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new bb0b416 [SPARK-31297][SQL] Speed up dates rebasing bb0b416 is described below commit bb0b416f0b3a2747a420b17d1bf659891bae3274 Author: Maxim Gekk AuthorDate: Tue Mar 31 17:38:47 2020 +0800 [SPARK-31297][SQL] Speed up dates rebasing ### What changes were proposed in this pull request? In the PR, I propose to replace current implementation of the `rebaseGregorianToJulianDays()` and `rebaseJulianToGregorianDays()` functions in `DateTimeUtils` by new one which is based on the fact that difference between Proleptic Gregorian and the hybrid (Julian+Gregorian) calendars was changed only 14 times for entire supported range of valid dates `[0001-01-01, -12-31]`: | date | Proleptic Greg. days | Hybrid (Julian+Greg) days | diff| | | ||| |0001-01-01|-719162|-719164|-2| |0100-03-01|-682944|-682945|-1| |0200-03-01|-646420|-646420|0| |0300-03-01|-609896|-609895|1| |0500-03-01|-536847|-536845|2| |0600-03-01|-500323|-500320|3| |0700-03-01|-463799|-463795|4| |0900-03-01|-390750|-390745|5| |1000-03-01|-354226|-354220|6| |1100-03-01|-317702|-317695|7| |1300-03-01|-244653|-244645|8| |1400-03-01|-208129|-208120|9| |1500-03-01|-171605|-171595|10| |1582-10-15|-141427|-141427|0| For the given days since the epoch, the proposed implementation finds the range of days which the input days belongs to, and adds the diff in days between calendars to the input. The result is rebased days since the epoch in the target calendar. For example, if need to rebase -65 days from Proleptic Gregorian calendar to the hybrid calendar. In that case, the input falls to the bucket [-682944, -646420), the diff associated with the range is -1. To get the rebased days in Julian calendar, we should add -1 to -65, and the result is -650001. ### Why are the changes needed? To make dates rebasing faster. ### Does this PR introduce any user-facing change? No, the results should be the same for valid range of the `DATE` type `[0001-01-01, -12-31]`. ### How was this patch tested? - Added 2 tests to `DateTimeUtilsSuite` for the `rebaseGregorianToJulianDays()` and `rebaseJulianToGregorianDays()` functions. The tests check that results of old and new implementation (optimized version) are the same for all supported dates. - Re-run `DateTimeRebaseBenchmark` on: | Item | Description | | | | | Region | us-west-2 (Oregon) | | Instance | r3.xlarge | | AMI | ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1 (ami-06f2f779464715dc5) | | Java | OpenJDK8/11 | Closes #28067 from MaxGekk/optimize-rebasing. Lead-authored-by: Maxim Gekk Co-authored-by: Max Gekk Signed-off-by: Wenchen Fan --- .../spark/sql/catalyst/util/DateTimeUtils.scala| 79 +++--- .../sql/catalyst/util/DateTimeUtilsSuite.scala | 58 +++- .../DateTimeRebaseBenchmark-jdk11-results.txt | 64 +- .../benchmarks/DateTimeRebaseBenchmark-results.txt | 64 +- 4 files changed, 174 insertions(+), 91 deletions(-) diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala index 268cd19..04994a1 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala @@ -1034,6 +1034,44 @@ object DateTimeUtils { } /** + * Rebases days since the epoch from an original to an target calendar, from instance + * from a hybrid (Julian + Gregorian) to Proleptic Gregorian calendar. + * + * It finds the latest switch day which is less than `days`, and adds the difference + * in days associated with the switch day to the given `days`. The function is based + * on linear search which starts from the most recent switch days. This allows to perform + * less comparisons for modern dates. + * + * @param switchDays The days when difference in days between original and target + * calendar was changed. + * @param diffs The differences in days between calendars. + * @param days The number of days since the epoch 1970-01-01 to be rebased to the + * target calendar. + * @return The rebased day + */ + private def rebaseDays(switchDays: Array[Int], diffs: Array[Int], days: Int): Int = { +var i = switchDays.length - 1 +while (i >= 0 && days < switchDays(i)) { + i -= 1 +} +val rebased = days + diffs(if (i < 0) 0 else i)
svn commit: r38753 - /dev/spark/v3.0.0-rc1-bin/
Author: rxin Date: Tue Mar 31 07:25:15 2020 New Revision: 38753 Log: retry Removed: dev/spark/v3.0.0-rc1-bin/ - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: Revert "[SPARK-30879][DOCS] Refine workflow for building docs"
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 4d4c3e7 Revert "[SPARK-30879][DOCS] Refine workflow for building docs" 4d4c3e7 is described below commit 4d4c3e76f6d1d5ede511c3ff4036b0c458a0a4e3 Author: HyukjinKwon AuthorDate: Tue Mar 31 16:11:59 2020 +0900 Revert "[SPARK-30879][DOCS] Refine workflow for building docs" This reverts commit 7892f88f84acc8c061aaa3d2987f2c8b71e41963. --- .gitignore | 2 -- dev/create-release/do-release-docker.sh | 2 +- dev/create-release/spark-rm/Dockerfile | 61 - docs/README.md | 44 4 files changed, 37 insertions(+), 72 deletions(-) diff --git a/.gitignore b/.gitignore index 60a12e3..198fdee 100644 --- a/.gitignore +++ b/.gitignore @@ -18,8 +18,6 @@ .idea_modules/ .project .pydevproject -.python-version -.ruby-version .scala_dependencies .settings /lib/ diff --git a/dev/create-release/do-release-docker.sh b/dev/create-release/do-release-docker.sh index cda21eb..694a87b 100755 --- a/dev/create-release/do-release-docker.sh +++ b/dev/create-release/do-release-docker.sh @@ -96,7 +96,7 @@ fcreate_secure "$GPG_KEY_FILE" $GPG --export-secret-key --armor "$GPG_KEY" > "$GPG_KEY_FILE" run_silent "Building spark-rm image with tag $IMGTAG..." "docker-build.log" \ - docker build --no-cache -t "spark-rm:$IMGTAG" --build-arg UID=$UID "$SELF/spark-rm" + docker build -t "spark-rm:$IMGTAG" --build-arg UID=$UID "$SELF/spark-rm" # Write the release information to a file with environment variables to be used when running the # image. diff --git a/dev/create-release/spark-rm/Dockerfile b/dev/create-release/spark-rm/Dockerfile index d310aaf..6345168 100644 --- a/dev/create-release/spark-rm/Dockerfile +++ b/dev/create-release/spark-rm/Dockerfile @@ -20,9 +20,9 @@ # Includes: # * Java 8 # * Ivy -# * Python 3.7 -# * Ruby 2.7 +# * Python (2.7.15/3.6.7) # * R-base/R-base-dev (3.6.1) +# * Ruby 2.3 build utilities FROM ubuntu:18.04 @@ -33,11 +33,15 @@ ENV DEBCONF_NONINTERACTIVE_SEEN true # These arguments are just for reuse and not really meant to be customized. ARG APT_INSTALL="apt-get install --no-install-recommends -y" -ARG PIP_PKGS="sphinx==2.3.1 mkdocs==1.0.4 numpy==1.18.1" -ARG GEM_PKGS="jekyll:4.0.0 jekyll-redirect-from:0.16.0 rouge:3.15.0" +ARG BASE_PIP_PKGS="setuptools wheel" +ARG PIP_PKGS="pyopenssl numpy sphinx" # Install extra needed repos and refresh. # - CRAN repo +# - Ruby repo (for doc generation) +# +# This is all in a single "RUN" command so that if anything changes, "apt update" is run to fetch +# the most current package versions (instead of potentially using old versions cached by docker). RUN apt-get clean && apt-get update && $APT_INSTALL gnupg ca-certificates && \ echo 'deb https://cloud.r-project.org/bin/linux/ubuntu bionic-cran35/' >> /etc/apt/sources.list && \ gpg --keyserver keyserver.ubuntu.com --recv-key E298A3A825C0D65DFD57CBB651716619E084DAB9 && \ @@ -46,43 +50,36 @@ RUN apt-get clean && apt-get update && $APT_INSTALL gnupg ca-certificates && \ rm -rf /var/lib/apt/lists/* && \ apt-get clean && \ apt-get update && \ + $APT_INSTALL software-properties-common && \ + apt-add-repository -y ppa:brightbox/ruby-ng && \ + apt-get update && \ # Install openjdk 8. $APT_INSTALL openjdk-8-jdk && \ update-alternatives --set java /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java && \ # Install build / source control tools $APT_INSTALL curl wget git maven ivy subversion make gcc lsof libffi-dev \ -pandoc pandoc-citeproc libssl-dev libcurl4-openssl-dev libxml2-dev - -ENV PATH "$PATH:/root/.pyenv/bin:/root/.pyenv/shims" -RUN curl -L https://github.com/pyenv/pyenv-installer/raw/dd3f7d0914c5b4a416ca71ffabdf2954f2021596/bin/pyenv-installer | bash -RUN $APT_INSTALL libbz2-dev libreadline-dev libsqlite3-dev -RUN pyenv install 3.7.6 -RUN pyenv global 3.7.6 -RUN python --version -RUN pip install --upgrade pip -RUN pip --version -RUN pip install $PIP_PKGS - -ENV PATH "$PATH:/root/.rbenv/bin:/root/.rbenv/shims" -RUN curl -fsSL https://github.com/rbenv/rbenv-installer/raw/108c12307621a0aa06f19799641848dde1987deb/bin/rbenv-installer | bash -RUN rbenv install 2.7.0 -RUN rbenv global 2.7.0 -RUN ruby --version -RUN $APT_INSTALL g++ -RUN gem --version -RUN gem install --no-document $GEM_PKGS - -RUN \ +pandoc pandoc-citeproc libssl-dev libcurl4-openssl-dev libxml2-dev && \ curl -sL https://deb.nodesource.com/setup_11.x | bash && \ - $APT_INSTALL nodejs - -# Install R packages and dependencies used when building. -# R depends on pandoc*, libssl (which are installed above). -RUN \ + $APT_INSTALL nodejs && \ + # Install needed python packages. Use pip for installing packages (for
[spark] branch branch-2.4 updated: [SPARK-31306][DOCS] update rand() function documentation to indicate exclusive upper bound
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch branch-2.4 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-2.4 by this push: new e226f68 [SPARK-31306][DOCS] update rand() function documentation to indicate exclusive upper bound e226f68 is described below commit e226f687c172c63ce9ae6531772af9df124c9454 Author: Ben Ryves AuthorDate: Tue Mar 31 15:16:17 2020 +0900 [SPARK-31306][DOCS] update rand() function documentation to indicate exclusive upper bound ### What changes were proposed in this pull request? A small documentation change to clarify that the `rand()` function produces values in `[0.0, 1.0)`. ### Why are the changes needed? `rand()` uses `Rand()` - which generates values in [0, 1) ([documented here](https://github.com/apache/spark/blob/a1dbcd13a3eeaee50cc1a46e909f9478d6d55177/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/randomExpressions.scala#L71)). The existing documentation suggests that 1.0 is a possible value returned by rand (i.e for a distribution written as `X ~ U(a, b)`, x can be a or b, so `U[0.0, 1.0]` suggests the value returned could include 1.0). ### Does this PR introduce any user-facing change? Only documentation changes. ### How was this patch tested? Documentation changes only. Closes #28071 from Smeb/master. Authored-by: Ben Ryves Signed-off-by: HyukjinKwon --- R/pkg/R/functions.R | 2 +- python/pyspark/sql/functions.py | 2 +- sql/core/src/main/scala/org/apache/spark/sql/functions.scala | 4 ++-- 3 files changed, 4 insertions(+), 4 deletions(-) diff --git a/R/pkg/R/functions.R b/R/pkg/R/functions.R index e914dd3..09b0a21 100644 --- a/R/pkg/R/functions.R +++ b/R/pkg/R/functions.R @@ -2614,7 +2614,7 @@ setMethod("lpad", signature(x = "Column", len = "numeric", pad = "character"), #' @details #' \code{rand}: Generates a random column with independent and identically distributed (i.i.d.) -#' samples from U[0.0, 1.0]. +#' samples uniformly distributed in [0.0, 1.0). #' Note: the function is non-deterministic in general case. #' #' @rdname column_nonaggregate_functions diff --git a/python/pyspark/sql/functions.py b/python/pyspark/sql/functions.py index b964980..c305529 100644 --- a/python/pyspark/sql/functions.py +++ b/python/pyspark/sql/functions.py @@ -553,7 +553,7 @@ def nanvl(col1, col2): @since(1.4) def rand(seed=None): """Generates a random column with independent and identically distributed (i.i.d.) samples -from U[0.0, 1.0]. +uniformly distributed in [0.0, 1.0). .. note:: The function is non-deterministic in general case. diff --git a/sql/core/src/main/scala/org/apache/spark/sql/functions.scala b/sql/core/src/main/scala/org/apache/spark/sql/functions.scala index f419a38..21ad1fd 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/functions.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/functions.scala @@ -1224,7 +1224,7 @@ object functions { /** * Generate a random column with independent and identically distributed (i.i.d.) samples - * from U[0.0, 1.0]. + * uniformly distributed in [0.0, 1.0). * * @note The function is non-deterministic in general case. * @@ -1235,7 +1235,7 @@ object functions { /** * Generate a random column with independent and identically distributed (i.i.d.) samples - * from U[0.0, 1.0]. + * uniformly distributed in [0.0, 1.0). * * @note The function is non-deterministic in general case. * - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-2.4 updated: [SPARK-31306][DOCS] update rand() function documentation to indicate exclusive upper bound
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch branch-2.4 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-2.4 by this push: new e226f68 [SPARK-31306][DOCS] update rand() function documentation to indicate exclusive upper bound e226f68 is described below commit e226f687c172c63ce9ae6531772af9df124c9454 Author: Ben Ryves AuthorDate: Tue Mar 31 15:16:17 2020 +0900 [SPARK-31306][DOCS] update rand() function documentation to indicate exclusive upper bound ### What changes were proposed in this pull request? A small documentation change to clarify that the `rand()` function produces values in `[0.0, 1.0)`. ### Why are the changes needed? `rand()` uses `Rand()` - which generates values in [0, 1) ([documented here](https://github.com/apache/spark/blob/a1dbcd13a3eeaee50cc1a46e909f9478d6d55177/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/randomExpressions.scala#L71)). The existing documentation suggests that 1.0 is a possible value returned by rand (i.e for a distribution written as `X ~ U(a, b)`, x can be a or b, so `U[0.0, 1.0]` suggests the value returned could include 1.0). ### Does this PR introduce any user-facing change? Only documentation changes. ### How was this patch tested? Documentation changes only. Closes #28071 from Smeb/master. Authored-by: Ben Ryves Signed-off-by: HyukjinKwon --- R/pkg/R/functions.R | 2 +- python/pyspark/sql/functions.py | 2 +- sql/core/src/main/scala/org/apache/spark/sql/functions.scala | 4 ++-- 3 files changed, 4 insertions(+), 4 deletions(-) diff --git a/R/pkg/R/functions.R b/R/pkg/R/functions.R index e914dd3..09b0a21 100644 --- a/R/pkg/R/functions.R +++ b/R/pkg/R/functions.R @@ -2614,7 +2614,7 @@ setMethod("lpad", signature(x = "Column", len = "numeric", pad = "character"), #' @details #' \code{rand}: Generates a random column with independent and identically distributed (i.i.d.) -#' samples from U[0.0, 1.0]. +#' samples uniformly distributed in [0.0, 1.0). #' Note: the function is non-deterministic in general case. #' #' @rdname column_nonaggregate_functions diff --git a/python/pyspark/sql/functions.py b/python/pyspark/sql/functions.py index b964980..c305529 100644 --- a/python/pyspark/sql/functions.py +++ b/python/pyspark/sql/functions.py @@ -553,7 +553,7 @@ def nanvl(col1, col2): @since(1.4) def rand(seed=None): """Generates a random column with independent and identically distributed (i.i.d.) samples -from U[0.0, 1.0]. +uniformly distributed in [0.0, 1.0). .. note:: The function is non-deterministic in general case. diff --git a/sql/core/src/main/scala/org/apache/spark/sql/functions.scala b/sql/core/src/main/scala/org/apache/spark/sql/functions.scala index f419a38..21ad1fd 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/functions.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/functions.scala @@ -1224,7 +1224,7 @@ object functions { /** * Generate a random column with independent and identically distributed (i.i.d.) samples - * from U[0.0, 1.0]. + * uniformly distributed in [0.0, 1.0). * * @note The function is non-deterministic in general case. * @@ -1235,7 +1235,7 @@ object functions { /** * Generate a random column with independent and identically distributed (i.i.d.) samples - * from U[0.0, 1.0]. + * uniformly distributed in [0.0, 1.0). * * @note The function is non-deterministic in general case. * - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-31306][DOCS] update rand() function documentation to indicate exclusive upper bound
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new fa37856 [SPARK-31306][DOCS] update rand() function documentation to indicate exclusive upper bound fa37856 is described below commit fa378567105ec9d9bbe30edf4b74b09c3df27658 Author: Ben Ryves AuthorDate: Tue Mar 31 15:16:17 2020 +0900 [SPARK-31306][DOCS] update rand() function documentation to indicate exclusive upper bound ### What changes were proposed in this pull request? A small documentation change to clarify that the `rand()` function produces values in `[0.0, 1.0)`. ### Why are the changes needed? `rand()` uses `Rand()` - which generates values in [0, 1) ([documented here](https://github.com/apache/spark/blob/a1dbcd13a3eeaee50cc1a46e909f9478d6d55177/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/randomExpressions.scala#L71)). The existing documentation suggests that 1.0 is a possible value returned by rand (i.e for a distribution written as `X ~ U(a, b)`, x can be a or b, so `U[0.0, 1.0]` suggests the value returned could include 1.0). ### Does this PR introduce any user-facing change? Only documentation changes. ### How was this patch tested? Documentation changes only. Closes #28071 from Smeb/master. Authored-by: Ben Ryves Signed-off-by: HyukjinKwon --- R/pkg/R/functions.R | 2 +- python/pyspark/sql/functions.py | 2 +- sql/core/src/main/scala/org/apache/spark/sql/functions.scala | 4 ++-- 3 files changed, 4 insertions(+), 4 deletions(-) diff --git a/R/pkg/R/functions.R b/R/pkg/R/functions.R index 3d30ce1..2baf3aa 100644 --- a/R/pkg/R/functions.R +++ b/R/pkg/R/functions.R @@ -2975,7 +2975,7 @@ setMethod("lpad", signature(x = "Column", len = "numeric", pad = "character"), #' @details #' \code{rand}: Generates a random column with independent and identically distributed (i.i.d.) -#' samples from U[0.0, 1.0]. +#' samples uniformly distributed in [0.0, 1.0). #' Note: the function is non-deterministic in general case. #' #' @rdname column_nonaggregate_functions diff --git a/python/pyspark/sql/functions.py b/python/pyspark/sql/functions.py index 4b51dc1..de0d38e 100644 --- a/python/pyspark/sql/functions.py +++ b/python/pyspark/sql/functions.py @@ -652,7 +652,7 @@ def percentile_approx(col, percentage, accuracy=1): @since(1.4) def rand(seed=None): """Generates a random column with independent and identically distributed (i.i.d.) samples -from U[0.0, 1.0]. +uniformly distributed in [0.0, 1.0). .. note:: The function is non-deterministic in general case. diff --git a/sql/core/src/main/scala/org/apache/spark/sql/functions.scala b/sql/core/src/main/scala/org/apache/spark/sql/functions.scala index 1a0244f..8d8638d 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/functions.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/functions.scala @@ -1227,7 +1227,7 @@ object functions { /** * Generate a random column with independent and identically distributed (i.i.d.) samples - * from U[0.0, 1.0]. + * uniformly distributed in [0.0, 1.0). * * @note The function is non-deterministic in general case. * @@ -1238,7 +1238,7 @@ object functions { /** * Generate a random column with independent and identically distributed (i.i.d.) samples - * from U[0.0, 1.0]. + * uniformly distributed in [0.0, 1.0). * * @note The function is non-deterministic in general case. * - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.0 updated: [SPARK-31306][DOCS] update rand() function documentation to indicate exclusive upper bound
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch branch-3.0 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.0 by this push: new 1caca7d [SPARK-31306][DOCS] update rand() function documentation to indicate exclusive upper bound 1caca7d is described below commit 1caca7d97a03ab9ac99597e1ef9fa3890da90743 Author: Ben Ryves AuthorDate: Tue Mar 31 15:16:17 2020 +0900 [SPARK-31306][DOCS] update rand() function documentation to indicate exclusive upper bound ### What changes were proposed in this pull request? A small documentation change to clarify that the `rand()` function produces values in `[0.0, 1.0)`. ### Why are the changes needed? `rand()` uses `Rand()` - which generates values in [0, 1) ([documented here](https://github.com/apache/spark/blob/a1dbcd13a3eeaee50cc1a46e909f9478d6d55177/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/randomExpressions.scala#L71)). The existing documentation suggests that 1.0 is a possible value returned by rand (i.e for a distribution written as `X ~ U(a, b)`, x can be a or b, so `U[0.0, 1.0]` suggests the value returned could include 1.0). ### Does this PR introduce any user-facing change? Only documentation changes. ### How was this patch tested? Documentation changes only. Closes #28071 from Smeb/master. Authored-by: Ben Ryves Signed-off-by: HyukjinKwon (cherry picked from commit fa378567105ec9d9bbe30edf4b74b09c3df27658) Signed-off-by: HyukjinKwon --- R/pkg/R/functions.R | 2 +- python/pyspark/sql/functions.py | 2 +- sql/core/src/main/scala/org/apache/spark/sql/functions.scala | 4 ++-- 3 files changed, 4 insertions(+), 4 deletions(-) diff --git a/R/pkg/R/functions.R b/R/pkg/R/functions.R index d8b0450..173dbc4 100644 --- a/R/pkg/R/functions.R +++ b/R/pkg/R/functions.R @@ -2888,7 +2888,7 @@ setMethod("lpad", signature(x = "Column", len = "numeric", pad = "character"), #' @details #' \code{rand}: Generates a random column with independent and identically distributed (i.i.d.) -#' samples from U[0.0, 1.0]. +#' samples uniformly distributed in [0.0, 1.0). #' Note: the function is non-deterministic in general case. #' #' @rdname column_nonaggregate_functions diff --git a/python/pyspark/sql/functions.py b/python/pyspark/sql/functions.py index 1ade21c..476aab4 100644 --- a/python/pyspark/sql/functions.py +++ b/python/pyspark/sql/functions.py @@ -599,7 +599,7 @@ def nanvl(col1, col2): @since(1.4) def rand(seed=None): """Generates a random column with independent and identically distributed (i.i.d.) samples -from U[0.0, 1.0]. +uniformly distributed in [0.0, 1.0). .. note:: The function is non-deterministic in general case. diff --git a/sql/core/src/main/scala/org/apache/spark/sql/functions.scala b/sql/core/src/main/scala/org/apache/spark/sql/functions.scala index 8a89a3b..fd4e77f 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/functions.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/functions.scala @@ -1204,7 +1204,7 @@ object functions { /** * Generate a random column with independent and identically distributed (i.i.d.) samples - * from U[0.0, 1.0]. + * uniformly distributed in [0.0, 1.0). * * @note The function is non-deterministic in general case. * @@ -1215,7 +1215,7 @@ object functions { /** * Generate a random column with independent and identically distributed (i.i.d.) samples - * from U[0.0, 1.0]. + * uniformly distributed in [0.0, 1.0). * * @note The function is non-deterministic in general case. * - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org