[spark] branch master updated: [SPARK-42800][CONNECT][PYTHON][ML] Implement ml function `{array_to_vector, vector_to_array}`
This is an automated email from the ASF dual-hosted git repository. ruifengz pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 3c7ef7d6135 [SPARK-42800][CONNECT][PYTHON][ML] Implement ml function `{array_to_vector, vector_to_array}` 3c7ef7d6135 is described below commit 3c7ef7d6135a33448e9b08902f4b5582ae2d60c4 Author: Ruifeng Zheng AuthorDate: Thu Mar 16 15:03:56 2023 +0800 [SPARK-42800][CONNECT][PYTHON][ML] Implement ml function `{array_to_vector, vector_to_array}` ### What changes were proposed in this pull request? Implement ml function `{array_to_vector, vector_to_array}` ### Why are the changes needed? function parity ### Does this PR introduce _any_ user-facing change? yes, new functions ### How was this patch tested? added ut and manually check ``` (spark_dev) ➜ spark git:(connect_ml_functions) ✗ bin/pyspark --remote "local[*]" Python 3.9.16 (main, Mar 8 2023, 04:29:24) Type 'copyright', 'credits' or 'license' for more information IPython 8.11.0 -- An enhanced Interactive Python. Type '?' for help. Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). 23/03/15 11:56:27 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version 3.5.0.dev0 /_/ Using Python version 3.9.16 (main, Mar 8 2023 04:29:24) Client connected to the Spark Connect server at localhost SparkSession available as 'spark'. In [1]: In [1]: query = """ ...: SELECT * FROM VALUES ...: (1, 4, ARRAY(1.0, 2.0, 3.0)), ...: (1, 2, ARRAY(-1.0, -2.0, -3.0)) ...: AS tab(a, b, c) ...: """ In [2]: cdf = spark.sql(query) In [3]: from pyspark.sql.connect.ml import functions as CF In [4]: cdf1 = cdf.select("a", CF.array_to_vector(cdf.c).alias("d")) In [5]: cdf1.show() +---++ (0 + 1) / 1] | a| d| +---++ | 1| [1.0,2.0,3.0]| | 1|[-1.0,-2.0,-3.0]| +---++ In [6]: cdf1.schema Out[6]: StructType([StructField('a', IntegerType(), False), StructField('d', VectorUDT(), True)]) In [7]: cdf1.select(CF.vector_to_array(cdf1.d)) Out[7]: DataFrame[UDF(d): array] In [8]: cdf1.select(CF.vector_to_array(cdf1.d)).show() +--+ |UDF(d)| +--+ | [1.0, 2.0, 3.0]| |[-1.0, -2.0, -3.0]| +--+ In [9]: cdf1.select(CF.vector_to_array(cdf1.d)).schema Out[9]: StructType([StructField('UDF(d)', ArrayType(DoubleType(), False), False)]) ``` Closes #40432 from zhengruifeng/connect_ml_functions. Authored-by: Ruifeng Zheng Signed-off-by: Ruifeng Zheng --- .../sql/connect/planner/SparkConnectPlanner.scala | 42 dev/sparktestsupport/modules.py| 92 + .../main/scala/org/apache/spark/ml/functions.scala | 6 +- python/pyspark/ml/connect/__init__.py | 18 python/pyspark/ml/connect/functions.py | 76 ++ python/pyspark/ml/functions.py | 9 ++ .../ml/tests/connect/test_connect_function.py | 113 + python/pyspark/ml/util.py | 36 ++- 8 files changed, 346 insertions(+), 46 deletions(-) diff --git a/connector/connect/server/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala b/connector/connect/server/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala index a057bd8d6c1..20db252c057 100644 --- a/connector/connect/server/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala +++ b/connector/connect/server/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala @@ -30,6 +30,7 @@ import org.apache.spark.connect.proto import org.apache.spark.connect.proto.{ExecutePlanResponse, SqlCommand} import org.apache.spark.connect.proto.ExecutePlanResponse.SqlCommandResult import org.apache.spark.connect.proto.Parse.ParseFormat +import org.apache.spark.ml.{functions => MLFunctions} import org.apache.spark.sql.{Column, Dataset, Encoders, SparkSession} import org.apache.spark.sql.catalyst.{expressions, AliasIdentifier, FunctionIdentifier} import org.apache.spark.sql.catalyst.analysis.{GlobalTempView, LocalTempView, MultiAlias, ParameterizedQuery
[spark] branch master updated: [SPARK-41233][SQL][PYTHON] Add `array_prepend` function
This is an automated email from the ASF dual-hosted git repository. ruifengz pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 3dd629629ab [SPARK-41233][SQL][PYTHON] Add `array_prepend` function 3dd629629ab is described below commit 3dd629629ab151688b82a3aa66e1b5fa568afbfa Author: Navin Viswanath AuthorDate: Thu Mar 16 17:51:33 2023 +0800 [SPARK-41233][SQL][PYTHON] Add `array_prepend` function ### What changes were proposed in this pull request? Adds a new array function array_prepend to catalyst. ### Why are the changes needed? This adds a function that exists in many SQL implementations, specifically Snowflake: https://docs.snowflake.com/en/developer-guide/snowpark/reference/python/api/snowflake.snowpark.functions.array_prepend.html ### Does this PR introduce _any_ user-facing change? Yes. ### How was this patch tested? Added unit tests. Closes #38947 from navinvishy/array-prepend. Lead-authored-by: Navin Viswanath Co-authored-by: navinvishy Signed-off-by: Ruifeng Zheng --- .../source/reference/pyspark.sql/functions.rst | 1 + python/pyspark/sql/functions.py| 30 + .../sql/catalyst/analysis/FunctionRegistry.scala | 1 + .../expressions/collectionOperations.scala | 146 + .../expressions/CollectionExpressionsSuite.scala | 44 +++ .../scala/org/apache/spark/sql/functions.scala | 10 ++ .../sql-functions/sql-expression-schema.md | 3 +- .../src/test/resources/sql-tests/inputs/array.sql | 11 ++ .../resources/sql-tests/results/ansi/array.sql.out | 72 ++ .../test/resources/sql-tests/results/array.sql.out | 72 ++ .../apache/spark/sql/DataFrameFunctionsSuite.scala | 68 ++ 11 files changed, 457 insertions(+), 1 deletion(-) diff --git a/python/docs/source/reference/pyspark.sql/functions.rst b/python/docs/source/reference/pyspark.sql/functions.rst index 70fc04ef9cf..cbc46e1fae1 100644 --- a/python/docs/source/reference/pyspark.sql/functions.rst +++ b/python/docs/source/reference/pyspark.sql/functions.rst @@ -159,6 +159,7 @@ Collection Functions array_sort array_insert array_remove +array_prepend array_distinct array_intersect array_union diff --git a/python/pyspark/sql/functions.py b/python/pyspark/sql/functions.py index 051fd52a13c..1f02be3ad21 100644 --- a/python/pyspark/sql/functions.py +++ b/python/pyspark/sql/functions.py @@ -7631,6 +7631,36 @@ def get(col: "ColumnOrName", index: Union["ColumnOrName", int]) -> Column: return _invoke_function_over_columns("get", col, index) +@try_remote_functions +def array_prepend(col: "ColumnOrName", value: Any) -> Column: +""" +Collection function: Returns an array containing element as +well as all elements from array. The new element is positioned +at the beginning of the array. + +.. versionadded:: 3.5.0 + +Parameters +-- +col : :class:`~pyspark.sql.Column` or str +name of column containing array +value : +a literal value, or a :class:`~pyspark.sql.Column` expression. + +Returns +--- +:class:`~pyspark.sql.Column` +an array excluding given value. + +Examples + +>>> df = spark.createDataFrame([([2, 3, 4],), ([],)], ['data']) +>>> df.select(array_prepend(df.data, 1)).collect() +[Row(array_prepend(data, 1)=[1, 2, 3, 4]), Row(array_prepend(data, 1)=[1])] +""" +return _invoke_function_over_columns("array_prepend", col, lit(value)) + + @try_remote_functions def array_remove(col: "ColumnOrName", element: Any) -> Column: """ diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala index ad82a836199..aca73741c63 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala @@ -697,6 +697,7 @@ object FunctionRegistry { expression[Sequence]("sequence"), expression[ArrayRepeat]("array_repeat"), expression[ArrayRemove]("array_remove"), +expression[ArrayPrepend]("array_prepend"), expression[ArrayDistinct]("array_distinct"), expression[ArrayTransform]("transform"), expression[MapFilter]("map_filter"), diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala index 289859d420b..2ccb3a6d0cd 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala +++ b/sql/catalyst/src/main/scala
[spark] branch master updated: [SPARK-42819][SS] Add support for setting max_write_buffer_number and write_buffer_size for RocksDB used in streaming
This is an automated email from the ASF dual-hosted git repository. kabhwan pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new bafb953f32e [SPARK-42819][SS] Add support for setting max_write_buffer_number and write_buffer_size for RocksDB used in streaming bafb953f32e is described below commit bafb953f32e4f7c04f7f163354ec997bae8aa0e6 Author: Anish Shrigondekar AuthorDate: Thu Mar 16 19:52:16 2023 +0900 [SPARK-42819][SS] Add support for setting max_write_buffer_number and write_buffer_size for RocksDB used in streaming ### What changes were proposed in this pull request? Add support for setting max_write_buffer_number and write_buffer_size for RocksDB used in streaming ### Why are the changes needed? We need these settings in order to control memory tuning for RocksDB. We already expose settings for blockCache size. However, these 2 settings are missing. This change proposes to add them. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added unit tests and docs in the guide doc RocksDBSuite ``` [info] Run completed in 59 seconds, 336 milliseconds. [info] Total number of tests run: 27 [info] Suites: completed 1, aborted 0 [info] Tests: succeeded 27, failed 0, canceled 0, ignored 0, pending 0 [info] All tests passed. [success] Total time: 165 s (02:45), completed Mar 15, 2023, 11:24:17 PM ``` RocksDBStateStoreSuite ``` [info] Run completed in 1 minute, 16 seconds. [info] Total number of tests run: 73 [info] Suites: completed 1, aborted 0 [info] Tests: succeeded 73, failed 0, canceled 0, ignored 0, pending 0 [info] All tests passed. ``` Closes #40455 from anishshri-db/task/SPARK-42819. Authored-by: Anish Shrigondekar Signed-off-by: Jungtaek Lim --- docs/structured-streaming-programming-guide.md | 10 ++ .../sql/execution/streaming/state/RocksDB.scala| 41 -- .../streaming/state/RocksDBStateStoreSuite.scala | 4 +++ .../execution/streaming/state/RocksDBSuite.scala | 27 ++ 4 files changed, 79 insertions(+), 3 deletions(-) diff --git a/docs/structured-streaming-programming-guide.md b/docs/structured-streaming-programming-guide.md index a71c774f328..486bed7184f 100644 --- a/docs/structured-streaming-programming-guide.md +++ b/docs/structured-streaming-programming-guide.md @@ -2289,6 +2289,16 @@ Here are the configs regarding to RocksDB instance of the state store provider: Whether we track the total number of rows in state store. Please refer the details in Performance-aspect considerations. True + +spark.sql.streaming.stateStore.rocksdb.writeBufferSizeMB +The maximum size of MemTable in RocksDB. Value of -1 means that RocksDB internal default values will be used +-1 + + +spark.sql.streaming.stateStore.rocksdb.maxWriteBufferNumber +The maximum number of MemTables in RocksDB, both active and immutable. Value of -1 means that RocksDB internal default values will be used +-1 + # Performance-aspect considerations diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDB.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDB.scala index 89872afb80e..363cc2b5c46 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDB.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDB.scala @@ -72,7 +72,21 @@ class RocksDB( tableFormatConfig.setFilterPolicy(bloomFilter) tableFormatConfig.setFormatVersion(conf.formatVersion) - private val dbOptions = new Options() // options to open the RocksDB + private val columnFamilyOptions = new ColumnFamilyOptions() + + private val dbOptions = +new Options(new DBOptions(), columnFamilyOptions) // options to open the RocksDB + + // Set RocksDB options around MemTable memory usage. By default, we let RocksDB + // use its internal default values for these settings. + if (conf.writeBufferSizeMB > 0L) { +columnFamilyOptions.setWriteBufferSize(conf.writeBufferSizeMB * 1024 * 1024) + } + + if (conf.maxWriteBufferNumber > 0L) { +columnFamilyOptions.setMaxWriteBufferNumber(conf.maxWriteBufferNumber) + } + dbOptions.setCreateIfMissing(true) dbOptions.setTableFormatConfig(tableFormatConfig) dbOptions.setMaxOpenFiles(conf.maxOpenFiles) @@ -558,7 +572,9 @@ case class RocksDBConf( resetStatsOnLoad : Boolean, formatVersion: Int, trackTotalNumberOfRows: Boolean, -maxOpenFiles: Int) +maxOpenFiles: Int, +writeBufferSizeMB: Long, +maxWriteBufferNumber: Int) object RocksDBConf { /** Common prefix of all confs in SQLConf that affects RocksDB */ @@ -609,6 +625,15 @@ object Ro
[spark] branch master updated: Revert "[SPARK-41233][SQL][PYTHON] Add `array_prepend` function"
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new baf90206d04 Revert "[SPARK-41233][SQL][PYTHON] Add `array_prepend` function" baf90206d04 is described below commit baf90206d04738e63ea71f63d86668a7dc7c8f9a Author: Hyukjin Kwon AuthorDate: Thu Mar 16 20:35:47 2023 +0900 Revert "[SPARK-41233][SQL][PYTHON] Add `array_prepend` function" This reverts commit 3dd629629ab151688b82a3aa66e1b5fa568afbfa. --- .../source/reference/pyspark.sql/functions.rst | 1 - python/pyspark/sql/functions.py| 30 - .../sql/catalyst/analysis/FunctionRegistry.scala | 1 - .../expressions/collectionOperations.scala | 146 - .../expressions/CollectionExpressionsSuite.scala | 44 --- .../scala/org/apache/spark/sql/functions.scala | 10 -- .../sql-functions/sql-expression-schema.md | 3 +- .../src/test/resources/sql-tests/inputs/array.sql | 11 -- .../resources/sql-tests/results/ansi/array.sql.out | 72 -- .../test/resources/sql-tests/results/array.sql.out | 72 -- .../apache/spark/sql/DataFrameFunctionsSuite.scala | 68 -- 11 files changed, 1 insertion(+), 457 deletions(-) diff --git a/python/docs/source/reference/pyspark.sql/functions.rst b/python/docs/source/reference/pyspark.sql/functions.rst index cbc46e1fae1..70fc04ef9cf 100644 --- a/python/docs/source/reference/pyspark.sql/functions.rst +++ b/python/docs/source/reference/pyspark.sql/functions.rst @@ -159,7 +159,6 @@ Collection Functions array_sort array_insert array_remove -array_prepend array_distinct array_intersect array_union diff --git a/python/pyspark/sql/functions.py b/python/pyspark/sql/functions.py index 1f02be3ad21..051fd52a13c 100644 --- a/python/pyspark/sql/functions.py +++ b/python/pyspark/sql/functions.py @@ -7631,36 +7631,6 @@ def get(col: "ColumnOrName", index: Union["ColumnOrName", int]) -> Column: return _invoke_function_over_columns("get", col, index) -@try_remote_functions -def array_prepend(col: "ColumnOrName", value: Any) -> Column: -""" -Collection function: Returns an array containing element as -well as all elements from array. The new element is positioned -at the beginning of the array. - -.. versionadded:: 3.5.0 - -Parameters --- -col : :class:`~pyspark.sql.Column` or str -name of column containing array -value : -a literal value, or a :class:`~pyspark.sql.Column` expression. - -Returns ---- -:class:`~pyspark.sql.Column` -an array excluding given value. - -Examples - ->>> df = spark.createDataFrame([([2, 3, 4],), ([],)], ['data']) ->>> df.select(array_prepend(df.data, 1)).collect() -[Row(array_prepend(data, 1)=[1, 2, 3, 4]), Row(array_prepend(data, 1)=[1])] -""" -return _invoke_function_over_columns("array_prepend", col, lit(value)) - - @try_remote_functions def array_remove(col: "ColumnOrName", element: Any) -> Column: """ diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala index aca73741c63..ad82a836199 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala @@ -697,7 +697,6 @@ object FunctionRegistry { expression[Sequence]("sequence"), expression[ArrayRepeat]("array_repeat"), expression[ArrayRemove]("array_remove"), -expression[ArrayPrepend]("array_prepend"), expression[ArrayDistinct]("array_distinct"), expression[ArrayTransform]("transform"), expression[MapFilter]("map_filter"), diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala index 2ccb3a6d0cd..289859d420b 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala @@ -1399,152 +1399,6 @@ case class ArrayContains(left: Expression, right: Expression) copy(left = newLeft, right = newRight) } -// scalastyle:off line.size.limit -@ExpressionDescription( - usage = """ - _FUNC_(array, element) - Add the element at the beginning of the array passed as first - argument. Type of element should be the same as the type of the elements of the array. - Null element is also prepended to the array. But if the array passed is NULL - output is NULL -""
[spark] branch master updated: [SPARK-42817][CORE] Logging the shuffle service name once in ApplicationMaster
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new f025d5eb1c2 [SPARK-42817][CORE] Logging the shuffle service name once in ApplicationMaster f025d5eb1c2 is described below commit f025d5eb1c2c9a6f7933679aa80752e806df9d2a Author: Chandni Singh AuthorDate: Thu Mar 16 14:27:31 2023 -0700 [SPARK-42817][CORE] Logging the shuffle service name once in ApplicationMaster ### What changes were proposed in this pull request? Removed the logging of shuffle service name multiple times in the driver log. It gets logged everytime a new executor is allocated. ### Why are the changes needed? This is needed because currently the driver logs gets polluted by these logs: ``` 22/08/03 20:42:07 INFO ExecutorRunnable: Initializing service data for shuffle service using name 'spark_shuffle_311' 22/08/03 20:42:07 INFO ExecutorRunnable: Initializing service data for shuffle service using name 'spark_shuffle_311' 22/08/03 20:42:07 INFO ExecutorRunnable: Initializing service data for shuffle service using name 'spark_shuffle_311' 22/08/03 20:42:07 INFO ExecutorRunnable: Initializing service data for shuffle service using name 'spark_shuffle_311' 22/08/03 20:42:07 INFO ExecutorRunnable: Initializing service data for shuffle service using name 'spark_shuffle_311' 22/08/03 20:42:07 INFO ExecutorRunnable: Initializing service data for shuffle service using name 'spark_shuffle_311' 22/08/03 20:42:07 INFO ExecutorRunnable: Initializing service data for shuffle service using name 'spark_shuffle_311' 22/08/03 20:42:07 INFO ExecutorRunnable: Initializing service data for shuffle service using name 'spark_shuffle_311' 22/08/03 20:42:07 INFO ExecutorRunnable: Initializing service data for shuffle service using name 'spark_shuffle_311' 22/08/03 20:42:07 INFO ExecutorRunnable: Initializing service data for shuffle service using name 'spark_shuffle_311' 22/08/03 20:42:07 INFO ExecutorRunnable: Initializing service data for shuffle service using name 'spark_shuffle_311' 22/08/03 20:42:07 INFO ExecutorRunnable: Initializing service data for shuffle service using name 'spark_shuffle_311' 22/08/03 20:42:07 INFO ExecutorRunnable: Initializing service data for shuffle service using name 'spark_shuffle_311' ``` ### Does this PR introduce _any_ user-facing change? Yes, the shuffle service name will be just logged once in the driver. ### How was this patch tested? Tested manually since it just changes the logging. With this see this logged in the driver logs: `23/03/15 16:50:54 INFO ApplicationMaster: Initializing service data for shuffle service using name 'spark_shuffle_311'` Closes #40448 from otterc/SPARK-42817. Authored-by: Chandni Singh Signed-off-by: Dongjoon Hyun --- .../main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala | 5 - .../main/scala/org/apache/spark/deploy/yarn/ExecutorRunnable.scala | 1 - 2 files changed, 4 insertions(+), 2 deletions(-) diff --git a/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala b/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala index 252c84a1cd4..8bf31a9286e 100644 --- a/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala +++ b/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala @@ -498,7 +498,10 @@ private[spark] class ApplicationMaster( // that when the driver sends an initial executor request (e.g. after an AM restart), // the allocator is ready to service requests. rpcEnv.setupEndpoint("YarnAM", new AMEndpoint(rpcEnv, driverRef)) - +if (_sparkConf.get(SHUFFLE_SERVICE_ENABLED)) { + logInfo("Initializing service data for shuffle service using name '" + +s"${_sparkConf.get(SHUFFLE_SERVICE_NAME)}'") +} allocator.allocateResources() val ms = MetricsSystem.createMetricsSystem(MetricsSystemInstances.APPLICATION_MASTER, sparkConf) val prefix = _sparkConf.get(YARN_METRICS_NAMESPACE).getOrElse(appId) diff --git a/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ExecutorRunnable.scala b/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ExecutorRunnable.scala index 0148b6f3c95..1f3121ed224 100644 --- a/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ExecutorRunnable.scala +++ b/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ExecutorRunnable.scala @@ -115,7 +115,6 @@ private[yarn] class ExecutorRunnable( ByteBuffer.allocate(0) } val serviceName = sparkConf.get(SHUFFLE_SERVICE_NAME) - logInfo(s"Initializing service
[spark] branch branch-3.4 updated: [SPARK-42817][CORE] Logging the shuffle service name once in ApplicationMaster
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch branch-3.4 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.4 by this push: new 62d6a3b998a [SPARK-42817][CORE] Logging the shuffle service name once in ApplicationMaster 62d6a3b998a is described below commit 62d6a3b998a1ad341e4d0870e0cc00d46be96c6e Author: Chandni Singh AuthorDate: Thu Mar 16 14:27:31 2023 -0700 [SPARK-42817][CORE] Logging the shuffle service name once in ApplicationMaster ### What changes were proposed in this pull request? Removed the logging of shuffle service name multiple times in the driver log. It gets logged everytime a new executor is allocated. ### Why are the changes needed? This is needed because currently the driver logs gets polluted by these logs: ``` 22/08/03 20:42:07 INFO ExecutorRunnable: Initializing service data for shuffle service using name 'spark_shuffle_311' 22/08/03 20:42:07 INFO ExecutorRunnable: Initializing service data for shuffle service using name 'spark_shuffle_311' 22/08/03 20:42:07 INFO ExecutorRunnable: Initializing service data for shuffle service using name 'spark_shuffle_311' 22/08/03 20:42:07 INFO ExecutorRunnable: Initializing service data for shuffle service using name 'spark_shuffle_311' 22/08/03 20:42:07 INFO ExecutorRunnable: Initializing service data for shuffle service using name 'spark_shuffle_311' 22/08/03 20:42:07 INFO ExecutorRunnable: Initializing service data for shuffle service using name 'spark_shuffle_311' 22/08/03 20:42:07 INFO ExecutorRunnable: Initializing service data for shuffle service using name 'spark_shuffle_311' 22/08/03 20:42:07 INFO ExecutorRunnable: Initializing service data for shuffle service using name 'spark_shuffle_311' 22/08/03 20:42:07 INFO ExecutorRunnable: Initializing service data for shuffle service using name 'spark_shuffle_311' 22/08/03 20:42:07 INFO ExecutorRunnable: Initializing service data for shuffle service using name 'spark_shuffle_311' 22/08/03 20:42:07 INFO ExecutorRunnable: Initializing service data for shuffle service using name 'spark_shuffle_311' 22/08/03 20:42:07 INFO ExecutorRunnable: Initializing service data for shuffle service using name 'spark_shuffle_311' 22/08/03 20:42:07 INFO ExecutorRunnable: Initializing service data for shuffle service using name 'spark_shuffle_311' ``` ### Does this PR introduce _any_ user-facing change? Yes, the shuffle service name will be just logged once in the driver. ### How was this patch tested? Tested manually since it just changes the logging. With this see this logged in the driver logs: `23/03/15 16:50:54 INFO ApplicationMaster: Initializing service data for shuffle service using name 'spark_shuffle_311'` Closes #40448 from otterc/SPARK-42817. Authored-by: Chandni Singh Signed-off-by: Dongjoon Hyun (cherry picked from commit f025d5eb1c2c9a6f7933679aa80752e806df9d2a) Signed-off-by: Dongjoon Hyun --- .../main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala | 5 - .../main/scala/org/apache/spark/deploy/yarn/ExecutorRunnable.scala | 1 - 2 files changed, 4 insertions(+), 2 deletions(-) diff --git a/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala b/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala index 9815fa6df8a..73deaf7a028 100644 --- a/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala +++ b/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala @@ -498,7 +498,10 @@ private[spark] class ApplicationMaster( // that when the driver sends an initial executor request (e.g. after an AM restart), // the allocator is ready to service requests. rpcEnv.setupEndpoint("YarnAM", new AMEndpoint(rpcEnv, driverRef)) - +if (_sparkConf.get(SHUFFLE_SERVICE_ENABLED)) { + logInfo("Initializing service data for shuffle service using name '" + +s"${_sparkConf.get(SHUFFLE_SERVICE_NAME)}'") +} allocator.allocateResources() val ms = MetricsSystem.createMetricsSystem(MetricsSystemInstances.APPLICATION_MASTER, sparkConf) val prefix = _sparkConf.get(YARN_METRICS_NAMESPACE).getOrElse(appId) diff --git a/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ExecutorRunnable.scala b/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ExecutorRunnable.scala index 0148b6f3c95..1f3121ed224 100644 --- a/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ExecutorRunnable.scala +++ b/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ExecutorRunnable.scala @@ -115,7 +115,6 @@ private[yarn] class ExecutorRunnable( ByteBuffer.allocat
[spark] branch master updated: [SPARK-42826][PS][DOCS] Add migration notes for update to supported pandas version
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 1b40565f99b [SPARK-42826][PS][DOCS] Add migration notes for update to supported pandas version 1b40565f99b is described below commit 1b40565f99b1a3888be46cfdf673b60198bf47a2 Author: itholic AuthorDate: Fri Mar 17 09:43:10 2023 +0900 [SPARK-42826][PS][DOCS] Add migration notes for update to supported pandas version ### What changes were proposed in this pull request? This PR proposes to add a migration note for update to supported pandas version. ### Why are the changes needed? Some APIs have been deprecated or removed from SPARK-42593 to follow pandas 2.0. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manual review is required. Closes #40459 from itholic/SPARK-42826. Lead-authored-by: itholic Co-authored-by: Haejoon Lee <44108233+itho...@users.noreply.github.com> Signed-off-by: Hyukjin Kwon --- python/docs/source/migration_guide/pyspark_upgrade.rst | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/python/docs/source/migration_guide/pyspark_upgrade.rst b/python/docs/source/migration_guide/pyspark_upgrade.rst index 0a590548684..d06475f9b36 100644 --- a/python/docs/source/migration_guide/pyspark_upgrade.rst +++ b/python/docs/source/migration_guide/pyspark_upgrade.rst @@ -33,6 +33,7 @@ Upgrading from PySpark 3.3 to 3.4 * In Spark 3.4, the ``Series.concat`` sort parameter will be respected to follow pandas 1.4 behaviors. * In Spark 3.4, the ``DataFrame.__setitem__`` will make a copy and replace pre-existing arrays, which will NOT be over-written to follow pandas 1.4 behaviors. * In Spark 3.4, the ``SparkSession.sql`` and the Pandas on Spark API ``sql`` have got new parameter ``args`` which provides binding of named parameters to their SQL literals. +* In Spark 3.4, Pandas API on Spark follows for the pandas 2.0, and some APIs were deprecated or removed in Spark 3.4 according to the changes made in pandas 2.0. Please refer to the [release notes of pandas](https://pandas.pydata.org/docs/dev/whatsnew/) for more details. Upgrading from PySpark 3.2 to 3.3 @@ -108,4 +109,4 @@ Upgrading from PySpark 1.4 to 1.5 Upgrading from PySpark 1.0-1.2 to 1.3 - -* When using DataTypes in Python you will need to construct them (i.e. ``StringType()``) instead of referencing a singleton. \ No newline at end of file +* When using DataTypes in Python you will need to construct them (i.e. ``StringType()``) instead of referencing a singleton. - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.4 updated: [SPARK-42826][PS][DOCS] Add migration notes for update to supported pandas version
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch branch-3.4 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.4 by this push: new 833599c6e27 [SPARK-42826][PS][DOCS] Add migration notes for update to supported pandas version 833599c6e27 is described below commit 833599c6e27c98556d596c7d938c432e039e85ba Author: itholic AuthorDate: Fri Mar 17 09:43:10 2023 +0900 [SPARK-42826][PS][DOCS] Add migration notes for update to supported pandas version ### What changes were proposed in this pull request? This PR proposes to add a migration note for update to supported pandas version. ### Why are the changes needed? Some APIs have been deprecated or removed from SPARK-42593 to follow pandas 2.0. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manual review is required. Closes #40459 from itholic/SPARK-42826. Lead-authored-by: itholic Co-authored-by: Haejoon Lee <44108233+itho...@users.noreply.github.com> Signed-off-by: Hyukjin Kwon (cherry picked from commit 1b40565f99b1a3888be46cfdf673b60198bf47a2) Signed-off-by: Hyukjin Kwon --- python/docs/source/migration_guide/pyspark_upgrade.rst | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/python/docs/source/migration_guide/pyspark_upgrade.rst b/python/docs/source/migration_guide/pyspark_upgrade.rst index 0a590548684..d06475f9b36 100644 --- a/python/docs/source/migration_guide/pyspark_upgrade.rst +++ b/python/docs/source/migration_guide/pyspark_upgrade.rst @@ -33,6 +33,7 @@ Upgrading from PySpark 3.3 to 3.4 * In Spark 3.4, the ``Series.concat`` sort parameter will be respected to follow pandas 1.4 behaviors. * In Spark 3.4, the ``DataFrame.__setitem__`` will make a copy and replace pre-existing arrays, which will NOT be over-written to follow pandas 1.4 behaviors. * In Spark 3.4, the ``SparkSession.sql`` and the Pandas on Spark API ``sql`` have got new parameter ``args`` which provides binding of named parameters to their SQL literals. +* In Spark 3.4, Pandas API on Spark follows for the pandas 2.0, and some APIs were deprecated or removed in Spark 3.4 according to the changes made in pandas 2.0. Please refer to the [release notes of pandas](https://pandas.pydata.org/docs/dev/whatsnew/) for more details. Upgrading from PySpark 3.2 to 3.3 @@ -108,4 +109,4 @@ Upgrading from PySpark 1.4 to 1.5 Upgrading from PySpark 1.0-1.2 to 1.3 - -* When using DataTypes in Python you will need to construct them (i.e. ``StringType()``) instead of referencing a singleton. \ No newline at end of file +* When using DataTypes in Python you will need to construct them (i.e. ``StringType()``) instead of referencing a singleton. - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-42824][CONNECT][PYTHON] Provide a clear error message for unsupported JVM attributes
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new deac4813044 [SPARK-42824][CONNECT][PYTHON] Provide a clear error message for unsupported JVM attributes deac4813044 is described below commit deac481304489f9b8ecd24ec6f3aed1e0c0d75eb Author: itholic AuthorDate: Fri Mar 17 11:13:00 2023 +0900 [SPARK-42824][CONNECT][PYTHON] Provide a clear error message for unsupported JVM attributes ### What changes were proposed in this pull request? This pull request proposes an improvement to the error message when trying to access a JVM attribute that is not supported in Spark Connect. Specifically, it adds a more informative error message that clearly indicates which attribute is not supported due to Spark Connect's lack of dependency on the JVM. ### Why are the changes needed? Currently, when attempting to access an unsupported JVM attribute in Spark Connect, the error message is not very clear, making it difficult for users to understand the root cause of the issue. This improvement aims to provide more helpful information to users to address this problem as below: **Before** ```python >>> spark._jsc Traceback (most recent call last): File "", line 1, in AttributeError: 'SparkSession' object has no attribute '_jsc' ``` **After** ```python >>> spark._jsc Traceback (most recent call last): File "", line 1, in File "/Users/haejoon.lee/Desktop/git_store/spark/python/pyspark/sql/connect/session.py", line 490, in _jsc raise PySparkAttributeError( pyspark.errors.exceptions.base.PySparkAttributeError: [JVM_ATTRIBUTE_NOT_SUPPORTED] Attribute `_jsc` is not supported in Spark Connect as it depends on the JVM. If you need to use this attribute, use the original PySpark instead of Spark Connect. ``` ### Does this PR introduce _any_ user-facing change? This PR does not introduce any user-facing change in terms of functionality. However, it improves the error message, which could potentially affect the user experience in a positive way. ### How was this patch tested? This patch was tested by adding new unit tests that specifically target the error message related to unsupported JVM attributes. The tests were run locally on a development environment. Closes #40458 from itholic/SPARK-42824. Authored-by: itholic Signed-off-by: Hyukjin Kwon --- python/pyspark/errors/__init__.py | 2 + python/pyspark/errors/error_classes.py | 5 +++ python/pyspark/errors/exceptions/base.py | 6 +++ python/pyspark/sql/connect/column.py | 12 +- python/pyspark/sql/connect/dataframe.py| 6 ++- python/pyspark/sql/connect/readwriter.py | 7 python/pyspark/sql/connect/session.py | 26 .../sql/tests/connect/test_connect_basic.py| 49 +- 8 files changed, 110 insertions(+), 3 deletions(-) diff --git a/python/pyspark/errors/__init__.py b/python/pyspark/errors/__init__.py index 95da7ca2aa8..94117fc5160 100644 --- a/python/pyspark/errors/__init__.py +++ b/python/pyspark/errors/__init__.py @@ -31,6 +31,7 @@ from pyspark.errors.exceptions.base import ( # noqa: F401 SparkUpgradeException, PySparkTypeError, PySparkValueError, +PySparkAttributeError, ) @@ -47,4 +48,5 @@ __all__ = [ "SparkUpgradeException", "PySparkTypeError", "PySparkValueError", +"PySparkAttributeError", ] diff --git a/python/pyspark/errors/error_classes.py b/python/pyspark/errors/error_classes.py index 8c0f79f7d5a..dda1f5a1f84 100644 --- a/python/pyspark/errors/error_classes.py +++ b/python/pyspark/errors/error_classes.py @@ -39,6 +39,11 @@ ERROR_CLASSES_JSON = """ "Function `` should return Column, got ." ] }, + "JVM_ATTRIBUTE_NOT_SUPPORTED" : { +"message" : [ + "Attribute `` is not supported in Spark Connect as it depends on the JVM. If you need to use this attribute, do not use Spark Connect when creating your session." +] + }, "NOT_BOOL" : { "message" : [ "Argument `` should be a bool, got ." diff --git a/python/pyspark/errors/exceptions/base.py b/python/pyspark/errors/exceptions/base.py index 6e67039374d..fa66b80ac3a 100644 --- a/python/pyspark/errors/exceptions/base.py +++ b/python/pyspark/errors/exceptions/base.py @@ -160,3 +160,9 @@ class PySparkTypeError(PySparkException, TypeError): """ Wrapper class for TypeError to support error classes. """ + + +class PySparkAttributeError(PySparkException, AttributeError): +""" +Wrapper class for AttributeError to support error classes. +""" diff --git a/python/pyspark/sql/connect/column.py
[spark] branch branch-3.4 updated: [SPARK-42824][CONNECT][PYTHON] Provide a clear error message for unsupported JVM attributes
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch branch-3.4 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.4 by this push: new ca75340d607 [SPARK-42824][CONNECT][PYTHON] Provide a clear error message for unsupported JVM attributes ca75340d607 is described below commit ca75340d607f13932c9a49081ae9effcee5ac3f7 Author: itholic AuthorDate: Fri Mar 17 11:13:00 2023 +0900 [SPARK-42824][CONNECT][PYTHON] Provide a clear error message for unsupported JVM attributes ### What changes were proposed in this pull request? This pull request proposes an improvement to the error message when trying to access a JVM attribute that is not supported in Spark Connect. Specifically, it adds a more informative error message that clearly indicates which attribute is not supported due to Spark Connect's lack of dependency on the JVM. ### Why are the changes needed? Currently, when attempting to access an unsupported JVM attribute in Spark Connect, the error message is not very clear, making it difficult for users to understand the root cause of the issue. This improvement aims to provide more helpful information to users to address this problem as below: **Before** ```python >>> spark._jsc Traceback (most recent call last): File "", line 1, in AttributeError: 'SparkSession' object has no attribute '_jsc' ``` **After** ```python >>> spark._jsc Traceback (most recent call last): File "", line 1, in File "/Users/haejoon.lee/Desktop/git_store/spark/python/pyspark/sql/connect/session.py", line 490, in _jsc raise PySparkAttributeError( pyspark.errors.exceptions.base.PySparkAttributeError: [JVM_ATTRIBUTE_NOT_SUPPORTED] Attribute `_jsc` is not supported in Spark Connect as it depends on the JVM. If you need to use this attribute, use the original PySpark instead of Spark Connect. ``` ### Does this PR introduce _any_ user-facing change? This PR does not introduce any user-facing change in terms of functionality. However, it improves the error message, which could potentially affect the user experience in a positive way. ### How was this patch tested? This patch was tested by adding new unit tests that specifically target the error message related to unsupported JVM attributes. The tests were run locally on a development environment. Closes #40458 from itholic/SPARK-42824. Authored-by: itholic Signed-off-by: Hyukjin Kwon (cherry picked from commit deac481304489f9b8ecd24ec6f3aed1e0c0d75eb) Signed-off-by: Hyukjin Kwon --- python/pyspark/errors/__init__.py | 2 + python/pyspark/errors/error_classes.py | 5 +++ python/pyspark/errors/exceptions/base.py | 6 +++ python/pyspark/sql/connect/column.py | 12 +- python/pyspark/sql/connect/dataframe.py| 6 ++- python/pyspark/sql/connect/readwriter.py | 7 python/pyspark/sql/connect/session.py | 26 .../sql/tests/connect/test_connect_basic.py| 49 +- 8 files changed, 110 insertions(+), 3 deletions(-) diff --git a/python/pyspark/errors/__init__.py b/python/pyspark/errors/__init__.py index 95da7ca2aa8..94117fc5160 100644 --- a/python/pyspark/errors/__init__.py +++ b/python/pyspark/errors/__init__.py @@ -31,6 +31,7 @@ from pyspark.errors.exceptions.base import ( # noqa: F401 SparkUpgradeException, PySparkTypeError, PySparkValueError, +PySparkAttributeError, ) @@ -47,4 +48,5 @@ __all__ = [ "SparkUpgradeException", "PySparkTypeError", "PySparkValueError", +"PySparkAttributeError", ] diff --git a/python/pyspark/errors/error_classes.py b/python/pyspark/errors/error_classes.py index 8c0f79f7d5a..dda1f5a1f84 100644 --- a/python/pyspark/errors/error_classes.py +++ b/python/pyspark/errors/error_classes.py @@ -39,6 +39,11 @@ ERROR_CLASSES_JSON = """ "Function `` should return Column, got ." ] }, + "JVM_ATTRIBUTE_NOT_SUPPORTED" : { +"message" : [ + "Attribute `` is not supported in Spark Connect as it depends on the JVM. If you need to use this attribute, do not use Spark Connect when creating your session." +] + }, "NOT_BOOL" : { "message" : [ "Argument `` should be a bool, got ." diff --git a/python/pyspark/errors/exceptions/base.py b/python/pyspark/errors/exceptions/base.py index 6e67039374d..fa66b80ac3a 100644 --- a/python/pyspark/errors/exceptions/base.py +++ b/python/pyspark/errors/exceptions/base.py @@ -160,3 +160,9 @@ class PySparkTypeError(PySparkException, TypeError): """ Wrapper class for TypeError to support error classes. """ + + +class PySparkAttributeError(PySparkException, AttributeError): +""" +Wrapp
[spark] branch master updated: [SPARK-42823][SQL] `spark-sql` shell supports multipart namespaces for initialization
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 2000d5f8db8 [SPARK-42823][SQL] `spark-sql` shell supports multipart namespaces for initialization 2000d5f8db8 is described below commit 2000d5f8db838db62967a45d574728a8bf2aaf6b Author: Kent Yao AuthorDate: Thu Mar 16 20:29:16 2023 -0700 [SPARK-42823][SQL] `spark-sql` shell supports multipart namespaces for initialization ### What changes were proposed in this pull request? Currently, we only support initializing spark-sql shell with a single-part schema, which also must be forced to the session catalog. case 1, specifying catalog field for v1sessioncatalog ```sql bin/spark-sql --database spark_catalog.default Exception in thread "main" org.apache.spark.sql.catalyst.analysis.NoSuchDatabaseException: Database 'spark_catalog.default' not found ``` case 2, setting the default catalog to another one ```sql bin/spark-sql -c spark.sql.defaultCatalog=testcat -c spark.sql.catalog.testcat=org.apache.spark.sql.execution.datasources.v2.jdbc.JDBCTableCatalog -c spark.sql.catalog.testcat.url='jdbc:derby:memory:testcat;create=true' -c spark.sql.catalog.testcat.driver=org.apache.derby.jdbc.AutoloadedDriver -c spark.sql.catalogImplementation=in-memory --database SYS 23/03/16 18:40:49 WARN ObjectStore: Failed to get database sys, returning NoSuchObjectException Exception in thread "main" org.apache.spark.sql.catalyst.analysis.NoSuchDatabaseException: Database 'sys' not found ``` In this PR, we switch to use-statement to support multipart namespaces, which helps us resovle to catalog correctly. ### Why are the changes needed? Make spark-sql shell better support the v2 catalog framework. ### Does this PR introduce _any_ user-facing change? Yes, `--database` option supports multipart namespaces and works for v2 catalogs now. And you will see this behavior on spark web ui. ### How was this patch tested? new ut Closes #40457 from yaooqinn/SPARK-42823. Authored-by: Kent Yao Signed-off-by: Dongjoon Hyun --- .../sql/hive/thriftserver/SparkSQLCLIDriver.scala | 15 ++--- .../spark/sql/hive/thriftserver/CliSuite.scala | 26 ++ 2 files changed, 33 insertions(+), 8 deletions(-) diff --git a/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLCLIDriver.scala b/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLCLIDriver.scala index 51b314ad2c1..22df4e67440 100644 --- a/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLCLIDriver.scala +++ b/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLCLIDriver.scala @@ -201,14 +201,6 @@ private[hive] object SparkSQLCLIDriver extends Logging { case e: UnsupportedEncodingException => exit(ERROR_PATH_NOT_FOUND) } -if (sessionState.database != null) { - SparkSQLEnv.sqlContext.sessionState.catalog.setCurrentDatabase( -s"${sessionState.database}") -} - -// Execute -i init files (always in silent mode) -cli.processInitFiles(sessionState) - // We don't propagate hive.metastore.warehouse.dir, because it might has been adjusted in // [[SharedState.loadHiveConfFile]] based on the user specified or default values of // spark.sql.warehouse.dir and hive.metastore.warehouse.dir. @@ -216,6 +208,13 @@ private[hive] object SparkSQLCLIDriver extends Logging { SparkSQLEnv.sqlContext.setConf(k, v) } +if (sessionState.database != null) { + SparkSQLEnv.sqlContext.sql(s"USE ${sessionState.database}") +} + +// Execute -i init files (always in silent mode) +cli.processInitFiles(sessionState) + cli.printMasterAndAppId if (sessionState.execString != null) { diff --git a/sql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/CliSuite.scala b/sql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/CliSuite.scala index 5413635ba47..651c6b7aafb 100644 --- a/sql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/CliSuite.scala +++ b/sql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/CliSuite.scala @@ -35,6 +35,7 @@ import org.apache.hadoop.hive.ql.session.SessionState import org.apache.spark.{ErrorMessageFormat, SparkConf, SparkContext, SparkFunSuite} import org.apache.spark.ProcessTestUtils.ProcessOutputCapturer import org.apache.spark.deploy.SparkHadoopUtil +import org.apache.spark.sql.execution.datasources.v2.jdbc.JDBCTableCatalog import org.apache.spark.sql.hive.HiveUtils import org.apache.spark.sql.hive.Hive
[spark] branch branch-3.4 updated: [SPARK-42823][SQL] `spark-sql` shell supports multipart namespaces for initialization
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch branch-3.4 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.4 by this push: new c29cf34bfc6 [SPARK-42823][SQL] `spark-sql` shell supports multipart namespaces for initialization c29cf34bfc6 is described below commit c29cf34bfc694cd3d959c82a25adf251975f0817 Author: Kent Yao AuthorDate: Thu Mar 16 20:29:16 2023 -0700 [SPARK-42823][SQL] `spark-sql` shell supports multipart namespaces for initialization ### What changes were proposed in this pull request? Currently, we only support initializing spark-sql shell with a single-part schema, which also must be forced to the session catalog. case 1, specifying catalog field for v1sessioncatalog ```sql bin/spark-sql --database spark_catalog.default Exception in thread "main" org.apache.spark.sql.catalyst.analysis.NoSuchDatabaseException: Database 'spark_catalog.default' not found ``` case 2, setting the default catalog to another one ```sql bin/spark-sql -c spark.sql.defaultCatalog=testcat -c spark.sql.catalog.testcat=org.apache.spark.sql.execution.datasources.v2.jdbc.JDBCTableCatalog -c spark.sql.catalog.testcat.url='jdbc:derby:memory:testcat;create=true' -c spark.sql.catalog.testcat.driver=org.apache.derby.jdbc.AutoloadedDriver -c spark.sql.catalogImplementation=in-memory --database SYS 23/03/16 18:40:49 WARN ObjectStore: Failed to get database sys, returning NoSuchObjectException Exception in thread "main" org.apache.spark.sql.catalyst.analysis.NoSuchDatabaseException: Database 'sys' not found ``` In this PR, we switch to use-statement to support multipart namespaces, which helps us resovle to catalog correctly. ### Why are the changes needed? Make spark-sql shell better support the v2 catalog framework. ### Does this PR introduce _any_ user-facing change? Yes, `--database` option supports multipart namespaces and works for v2 catalogs now. And you will see this behavior on spark web ui. ### How was this patch tested? new ut Closes #40457 from yaooqinn/SPARK-42823. Authored-by: Kent Yao Signed-off-by: Dongjoon Hyun (cherry picked from commit 2000d5f8db838db62967a45d574728a8bf2aaf6b) Signed-off-by: Dongjoon Hyun --- .../sql/hive/thriftserver/SparkSQLCLIDriver.scala | 15 ++--- .../spark/sql/hive/thriftserver/CliSuite.scala | 26 ++ 2 files changed, 33 insertions(+), 8 deletions(-) diff --git a/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLCLIDriver.scala b/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLCLIDriver.scala index 51b314ad2c1..22df4e67440 100644 --- a/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLCLIDriver.scala +++ b/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLCLIDriver.scala @@ -201,14 +201,6 @@ private[hive] object SparkSQLCLIDriver extends Logging { case e: UnsupportedEncodingException => exit(ERROR_PATH_NOT_FOUND) } -if (sessionState.database != null) { - SparkSQLEnv.sqlContext.sessionState.catalog.setCurrentDatabase( -s"${sessionState.database}") -} - -// Execute -i init files (always in silent mode) -cli.processInitFiles(sessionState) - // We don't propagate hive.metastore.warehouse.dir, because it might has been adjusted in // [[SharedState.loadHiveConfFile]] based on the user specified or default values of // spark.sql.warehouse.dir and hive.metastore.warehouse.dir. @@ -216,6 +208,13 @@ private[hive] object SparkSQLCLIDriver extends Logging { SparkSQLEnv.sqlContext.setConf(k, v) } +if (sessionState.database != null) { + SparkSQLEnv.sqlContext.sql(s"USE ${sessionState.database}") +} + +// Execute -i init files (always in silent mode) +cli.processInitFiles(sessionState) + cli.printMasterAndAppId if (sessionState.execString != null) { diff --git a/sql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/CliSuite.scala b/sql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/CliSuite.scala index 5413635ba47..651c6b7aafb 100644 --- a/sql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/CliSuite.scala +++ b/sql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/CliSuite.scala @@ -35,6 +35,7 @@ import org.apache.hadoop.hive.ql.session.SessionState import org.apache.spark.{ErrorMessageFormat, SparkConf, SparkContext, SparkFunSuite} import org.apache.spark.ProcessTestUtils.ProcessOutputCapturer import org.apache.spark.deploy.SparkHadoopUtil +import org.apache.spark.sql.execution.data
[spark] branch master updated (2000d5f8db8 -> c2380868ad9)
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from 2000d5f8db8 [SPARK-42823][SQL] `spark-sql` shell supports multipart namespaces for initialization add c2380868ad9 [SPARK-42814][BUILD] Upgrade maven plugins to latest versions No new revisions were added by this update. Summary of changes: pom.xml | 12 ++-- sql/hive/pom.xml | 1 - 2 files changed, 6 insertions(+), 7 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.4 updated (c29cf34bfc6 -> 5bd7b09bbc5)
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a change to branch branch-3.4 in repository https://gitbox.apache.org/repos/asf/spark.git from c29cf34bfc6 [SPARK-42823][SQL] `spark-sql` shell supports multipart namespaces for initialization add 5bd7b09bbc5 [SPARK-42778][SQL][3.4] QueryStageExec should respect supportsRowBased No new revisions were added by this update. Summary of changes: .../scala/org/apache/spark/sql/execution/adaptive/QueryStageExec.scala | 1 + 1 file changed, 1 insertion(+) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org