date:20230316

[spark] branch master updated: [SPARK-42800][CONNECT][PYTHON][ML] Implement ml function `{array_to_vector, vector_to_array}`

2023-03-16 Thread ruifengz

This is an automated email from the ASF dual-hosted git repository.

ruifengz pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 3c7ef7d6135 [SPARK-42800][CONNECT][PYTHON][ML] Implement ml function 
`{array_to_vector, vector_to_array}`
3c7ef7d6135 is described below

commit 3c7ef7d6135a33448e9b08902f4b5582ae2d60c4
Author: Ruifeng Zheng 
AuthorDate: Thu Mar 16 15:03:56 2023 +0800

[SPARK-42800][CONNECT][PYTHON][ML] Implement ml function `{array_to_vector, 
vector_to_array}`

### What changes were proposed in this pull request?
Implement ml function `{array_to_vector, vector_to_array}`

### Why are the changes needed?
function parity

### Does this PR introduce _any_ user-facing change?
yes, new functions

### How was this patch tested?
added ut and manually check

```
(spark_dev) ➜  spark git:(connect_ml_functions) ✗ bin/pyspark --remote 
"local[*]"
Python 3.9.16 (main, Mar  8 2023, 04:29:24)
Type 'copyright', 'credits' or 'license' for more information
IPython 8.11.0 -- An enhanced Interactive Python. Type '?' for help.
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
setLogLevel(newLevel).
23/03/15 11:56:27 WARN NativeCodeLoader: Unable to load native-hadoop 
library for your platform... using builtin-java classes where applicable
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 3.5.0.dev0
  /_/

Using Python version 3.9.16 (main, Mar  8 2023 04:29:24)
Client connected to the Spark Connect server at localhost
SparkSession available as 'spark'.

In [1]:

In [1]: query = """
   ...: SELECT * FROM VALUES
   ...: (1, 4, ARRAY(1.0, 2.0, 3.0)),
   ...: (1, 2, ARRAY(-1.0, -2.0, -3.0))
   ...: AS tab(a, b, c)
   ...: """

In [2]: cdf = spark.sql(query)

In [3]: from pyspark.sql.connect.ml import functions as CF

In [4]: cdf1 = cdf.select("a", CF.array_to_vector(cdf.c).alias("d"))

In [5]: cdf1.show()
+---++  (0 + 1) 
/ 1]
|  a|   d|
+---++
|  1|   [1.0,2.0,3.0]|
|  1|[-1.0,-2.0,-3.0]|
+---++

In [6]: cdf1.schema
Out[6]: StructType([StructField('a', IntegerType(), False), 
StructField('d', VectorUDT(), True)])

In [7]: cdf1.select(CF.vector_to_array(cdf1.d))
Out[7]: DataFrame[UDF(d): array]

In [8]: cdf1.select(CF.vector_to_array(cdf1.d)).show()
+--+
|UDF(d)|
+--+
|   [1.0, 2.0, 3.0]|
|[-1.0, -2.0, -3.0]|
+--+

In [9]: cdf1.select(CF.vector_to_array(cdf1.d)).schema
Out[9]: StructType([StructField('UDF(d)', ArrayType(DoubleType(), False), 
False)])

```

Closes #40432 from zhengruifeng/connect_ml_functions.

Authored-by: Ruifeng Zheng 
Signed-off-by: Ruifeng Zheng 
---
 .../sql/connect/planner/SparkConnectPlanner.scala  |  42 
 dev/sparktestsupport/modules.py|  92 +
 .../main/scala/org/apache/spark/ml/functions.scala |   6 +-
 python/pyspark/ml/connect/__init__.py  |  18 
 python/pyspark/ml/connect/functions.py |  76 ++
 python/pyspark/ml/functions.py |   9 ++
 .../ml/tests/connect/test_connect_function.py  | 113 +
 python/pyspark/ml/util.py  |  36 ++-
 8 files changed, 346 insertions(+), 46 deletions(-)

diff --git 
a/connector/connect/server/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala
 
b/connector/connect/server/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala
index a057bd8d6c1..20db252c057 100644
--- 
a/connector/connect/server/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala
+++ 
b/connector/connect/server/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala
@@ -30,6 +30,7 @@ import org.apache.spark.connect.proto
 import org.apache.spark.connect.proto.{ExecutePlanResponse, SqlCommand}
 import org.apache.spark.connect.proto.ExecutePlanResponse.SqlCommandResult
 import org.apache.spark.connect.proto.Parse.ParseFormat
+import org.apache.spark.ml.{functions => MLFunctions}
 import org.apache.spark.sql.{Column, Dataset, Encoders, SparkSession}
 import org.apache.spark.sql.catalyst.{expressions, AliasIdentifier, 
FunctionIdentifier}
 import org.apache.spark.sql.catalyst.analysis.{GlobalTempView, LocalTempView, 
MultiAlias, ParameterizedQuery

[spark] branch master updated: [SPARK-41233][SQL][PYTHON] Add `array_prepend` function

2023-03-16 Thread ruifengz

This is an automated email from the ASF dual-hosted git repository.

ruifengz pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 3dd629629ab [SPARK-41233][SQL][PYTHON] Add `array_prepend` function
3dd629629ab is described below

commit 3dd629629ab151688b82a3aa66e1b5fa568afbfa
Author: Navin Viswanath 
AuthorDate: Thu Mar 16 17:51:33 2023 +0800

[SPARK-41233][SQL][PYTHON] Add `array_prepend` function

### What changes were proposed in this pull request?
Adds a new array function array_prepend to catalyst.

### Why are the changes needed?
This adds a function that exists in many SQL implementations, specifically 
Snowflake: 
https://docs.snowflake.com/en/developer-guide/snowpark/reference/python/api/snowflake.snowpark.functions.array_prepend.html

### Does this PR introduce _any_ user-facing change?
Yes.

### How was this patch tested?
Added unit tests.

Closes #38947 from navinvishy/array-prepend.

Lead-authored-by: Navin Viswanath 
Co-authored-by: navinvishy 
Signed-off-by: Ruifeng Zheng 
---
 .../source/reference/pyspark.sql/functions.rst |   1 +
 python/pyspark/sql/functions.py|  30 +
 .../sql/catalyst/analysis/FunctionRegistry.scala   |   1 +
 .../expressions/collectionOperations.scala | 146 +
 .../expressions/CollectionExpressionsSuite.scala   |  44 +++
 .../scala/org/apache/spark/sql/functions.scala |  10 ++
 .../sql-functions/sql-expression-schema.md |   3 +-
 .../src/test/resources/sql-tests/inputs/array.sql  |  11 ++
 .../resources/sql-tests/results/ansi/array.sql.out |  72 ++
 .../test/resources/sql-tests/results/array.sql.out |  72 ++
 .../apache/spark/sql/DataFrameFunctionsSuite.scala |  68 ++
 11 files changed, 457 insertions(+), 1 deletion(-)

diff --git a/python/docs/source/reference/pyspark.sql/functions.rst 
b/python/docs/source/reference/pyspark.sql/functions.rst
index 70fc04ef9cf..cbc46e1fae1 100644
--- a/python/docs/source/reference/pyspark.sql/functions.rst
+++ b/python/docs/source/reference/pyspark.sql/functions.rst
@@ -159,6 +159,7 @@ Collection Functions
 array_sort
 array_insert
 array_remove
+array_prepend
 array_distinct
 array_intersect
 array_union
diff --git a/python/pyspark/sql/functions.py b/python/pyspark/sql/functions.py
index 051fd52a13c..1f02be3ad21 100644
--- a/python/pyspark/sql/functions.py
+++ b/python/pyspark/sql/functions.py
@@ -7631,6 +7631,36 @@ def get(col: "ColumnOrName", index: 
Union["ColumnOrName", int]) -> Column:
 return _invoke_function_over_columns("get", col, index)
 
 
+@try_remote_functions
+def array_prepend(col: "ColumnOrName", value: Any) -> Column:
+"""
+Collection function: Returns an array containing element as
+well as all elements from array. The new element is positioned
+at the beginning of the array.
+
+.. versionadded:: 3.5.0
+
+Parameters
+--
+col : :class:`~pyspark.sql.Column` or str
+name of column containing array
+value :
+a literal value, or a :class:`~pyspark.sql.Column` expression.
+
+Returns
+---
+:class:`~pyspark.sql.Column`
+an array excluding given value.
+
+Examples
+
+>>> df = spark.createDataFrame([([2, 3, 4],), ([],)], ['data'])
+>>> df.select(array_prepend(df.data, 1)).collect()
+[Row(array_prepend(data, 1)=[1, 2, 3, 4]), Row(array_prepend(data, 1)=[1])]
+"""
+return _invoke_function_over_columns("array_prepend", col, lit(value))
+
+
 @try_remote_functions
 def array_remove(col: "ColumnOrName", element: Any) -> Column:
 """
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala
index ad82a836199..aca73741c63 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala
@@ -697,6 +697,7 @@ object FunctionRegistry {
 expression[Sequence]("sequence"),
 expression[ArrayRepeat]("array_repeat"),
 expression[ArrayRemove]("array_remove"),
+expression[ArrayPrepend]("array_prepend"),
 expression[ArrayDistinct]("array_distinct"),
 expression[ArrayTransform]("transform"),
 expression[MapFilter]("map_filter"),
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala
index 289859d420b..2ccb3a6d0cd 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala
+++ 
b/sql/catalyst/src/main/scala

[spark] branch master updated: [SPARK-42819][SS] Add support for setting max_write_buffer_number and write_buffer_size for RocksDB used in streaming

2023-03-16 Thread kabhwan

This is an automated email from the ASF dual-hosted git repository.

kabhwan pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new bafb953f32e [SPARK-42819][SS] Add support for setting 
max_write_buffer_number and  write_buffer_size for RocksDB used in streaming
bafb953f32e is described below

commit bafb953f32e4f7c04f7f163354ec997bae8aa0e6
Author: Anish Shrigondekar 
AuthorDate: Thu Mar 16 19:52:16 2023 +0900

[SPARK-42819][SS] Add support for setting max_write_buffer_number and  
write_buffer_size for RocksDB used in streaming

### What changes were proposed in this pull request?
Add support for setting max_write_buffer_number and write_buffer_size for 
RocksDB used in streaming

### Why are the changes needed?
We need these settings in order to control memory tuning for RocksDB. We 
already expose settings for blockCache size. However, these 2 settings are 
missing. This change proposes to add them.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Added unit tests and docs in the guide doc
RocksDBSuite
```
[info] Run completed in 59 seconds, 336 milliseconds.
[info] Total number of tests run: 27
[info] Suites: completed 1, aborted 0
[info] Tests: succeeded 27, failed 0, canceled 0, ignored 0, pending 0
[info] All tests passed.
[success] Total time: 165 s (02:45), completed Mar 15, 2023, 11:24:17 PM
```
RocksDBStateStoreSuite
```
[info] Run completed in 1 minute, 16 seconds.
[info] Total number of tests run: 73
[info] Suites: completed 1, aborted 0
[info] Tests: succeeded 73, failed 0, canceled 0, ignored 0, pending 0
[info] All tests passed.
```

Closes #40455 from anishshri-db/task/SPARK-42819.

Authored-by: Anish Shrigondekar 
Signed-off-by: Jungtaek Lim 
---
 docs/structured-streaming-programming-guide.md | 10 ++
 .../sql/execution/streaming/state/RocksDB.scala| 41 --
 .../streaming/state/RocksDBStateStoreSuite.scala   |  4 +++
 .../execution/streaming/state/RocksDBSuite.scala   | 27 ++
 4 files changed, 79 insertions(+), 3 deletions(-)

diff --git a/docs/structured-streaming-programming-guide.md 
b/docs/structured-streaming-programming-guide.md
index a71c774f328..486bed7184f 100644
--- a/docs/structured-streaming-programming-guide.md
+++ b/docs/structured-streaming-programming-guide.md
@@ -2289,6 +2289,16 @@ Here are the configs regarding to RocksDB instance of 
the state store provider:
 Whether we track the total number of rows in state store. Please refer 
the details in Performance-aspect 
considerations.
 True
   
+  
+spark.sql.streaming.stateStore.rocksdb.writeBufferSizeMB
+The maximum size of MemTable in RocksDB. Value of -1 means that 
RocksDB internal default values will be used
+-1
+  
+  
+spark.sql.streaming.stateStore.rocksdb.maxWriteBufferNumber
+The maximum number of MemTables in RocksDB, both active and immutable. 
Value of -1 means that RocksDB internal default values will be used
+-1
+  
 
 
 # Performance-aspect considerations
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDB.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDB.scala
index 89872afb80e..363cc2b5c46 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDB.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDB.scala
@@ -72,7 +72,21 @@ class RocksDB(
   tableFormatConfig.setFilterPolicy(bloomFilter)
   tableFormatConfig.setFormatVersion(conf.formatVersion)
 
-  private val dbOptions = new Options() // options to open the RocksDB
+  private val columnFamilyOptions = new ColumnFamilyOptions()
+
+  private val dbOptions =
+new Options(new DBOptions(), columnFamilyOptions) // options to open the 
RocksDB
+
+  // Set RocksDB options around MemTable memory usage. By default, we let 
RocksDB
+  // use its internal default values for these settings.
+  if (conf.writeBufferSizeMB > 0L) {
+columnFamilyOptions.setWriteBufferSize(conf.writeBufferSizeMB * 1024 * 
1024)
+  }
+
+  if (conf.maxWriteBufferNumber > 0L) {
+columnFamilyOptions.setMaxWriteBufferNumber(conf.maxWriteBufferNumber)
+  }
+
   dbOptions.setCreateIfMissing(true)
   dbOptions.setTableFormatConfig(tableFormatConfig)
   dbOptions.setMaxOpenFiles(conf.maxOpenFiles)
@@ -558,7 +572,9 @@ case class RocksDBConf(
 resetStatsOnLoad : Boolean,
 formatVersion: Int,
 trackTotalNumberOfRows: Boolean,
-maxOpenFiles: Int)
+maxOpenFiles: Int,
+writeBufferSizeMB: Long,
+maxWriteBufferNumber: Int)
 
 object RocksDBConf {
   /** Common prefix of all confs in SQLConf that affects RocksDB */
@@ -609,6 +625,15 @@ object Ro

[spark] branch master updated: Revert "[SPARK-41233][SQL][PYTHON] Add `array_prepend` function"

2023-03-16 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new baf90206d04 Revert "[SPARK-41233][SQL][PYTHON] Add `array_prepend` 
function"
baf90206d04 is described below

commit baf90206d04738e63ea71f63d86668a7dc7c8f9a
Author: Hyukjin Kwon 
AuthorDate: Thu Mar 16 20:35:47 2023 +0900

Revert "[SPARK-41233][SQL][PYTHON] Add `array_prepend` function"

This reverts commit 3dd629629ab151688b82a3aa66e1b5fa568afbfa.
---
 .../source/reference/pyspark.sql/functions.rst |   1 -
 python/pyspark/sql/functions.py|  30 -
 .../sql/catalyst/analysis/FunctionRegistry.scala   |   1 -
 .../expressions/collectionOperations.scala | 146 -
 .../expressions/CollectionExpressionsSuite.scala   |  44 ---
 .../scala/org/apache/spark/sql/functions.scala |  10 --
 .../sql-functions/sql-expression-schema.md |   3 +-
 .../src/test/resources/sql-tests/inputs/array.sql  |  11 --
 .../resources/sql-tests/results/ansi/array.sql.out |  72 --
 .../test/resources/sql-tests/results/array.sql.out |  72 --
 .../apache/spark/sql/DataFrameFunctionsSuite.scala |  68 --
 11 files changed, 1 insertion(+), 457 deletions(-)

diff --git a/python/docs/source/reference/pyspark.sql/functions.rst 
b/python/docs/source/reference/pyspark.sql/functions.rst
index cbc46e1fae1..70fc04ef9cf 100644
--- a/python/docs/source/reference/pyspark.sql/functions.rst
+++ b/python/docs/source/reference/pyspark.sql/functions.rst
@@ -159,7 +159,6 @@ Collection Functions
 array_sort
 array_insert
 array_remove
-array_prepend
 array_distinct
 array_intersect
 array_union
diff --git a/python/pyspark/sql/functions.py b/python/pyspark/sql/functions.py
index 1f02be3ad21..051fd52a13c 100644
--- a/python/pyspark/sql/functions.py
+++ b/python/pyspark/sql/functions.py
@@ -7631,36 +7631,6 @@ def get(col: "ColumnOrName", index: 
Union["ColumnOrName", int]) -> Column:
 return _invoke_function_over_columns("get", col, index)
 
 
-@try_remote_functions
-def array_prepend(col: "ColumnOrName", value: Any) -> Column:
-"""
-Collection function: Returns an array containing element as
-well as all elements from array. The new element is positioned
-at the beginning of the array.
-
-.. versionadded:: 3.5.0
-
-Parameters
---
-col : :class:`~pyspark.sql.Column` or str
-name of column containing array
-value :
-a literal value, or a :class:`~pyspark.sql.Column` expression.
-
-Returns
----
-:class:`~pyspark.sql.Column`
-an array excluding given value.
-
-Examples
-
->>> df = spark.createDataFrame([([2, 3, 4],), ([],)], ['data'])
->>> df.select(array_prepend(df.data, 1)).collect()
-[Row(array_prepend(data, 1)=[1, 2, 3, 4]), Row(array_prepend(data, 1)=[1])]
-"""
-return _invoke_function_over_columns("array_prepend", col, lit(value))
-
-
 @try_remote_functions
 def array_remove(col: "ColumnOrName", element: Any) -> Column:
 """
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala
index aca73741c63..ad82a836199 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala
@@ -697,7 +697,6 @@ object FunctionRegistry {
 expression[Sequence]("sequence"),
 expression[ArrayRepeat]("array_repeat"),
 expression[ArrayRemove]("array_remove"),
-expression[ArrayPrepend]("array_prepend"),
 expression[ArrayDistinct]("array_distinct"),
 expression[ArrayTransform]("transform"),
 expression[MapFilter]("map_filter"),
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala
index 2ccb3a6d0cd..289859d420b 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala
@@ -1399,152 +1399,6 @@ case class ArrayContains(left: Expression, right: 
Expression)
 copy(left = newLeft, right = newRight)
 }
 
-// scalastyle:off line.size.limit
-@ExpressionDescription(
-  usage = """
-  _FUNC_(array, element) - Add the element at the beginning of the array 
passed as first
-  argument. Type of element should be the same as the type of the elements 
of the array.
-  Null element is also prepended to the array. But if the array passed is 
NULL
-  output is NULL
-""

[spark] branch master updated: [SPARK-42817][CORE] Logging the shuffle service name once in ApplicationMaster

2023-03-16 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new f025d5eb1c2 [SPARK-42817][CORE] Logging the shuffle service name once 
in ApplicationMaster
f025d5eb1c2 is described below

commit f025d5eb1c2c9a6f7933679aa80752e806df9d2a
Author: Chandni Singh 
AuthorDate: Thu Mar 16 14:27:31 2023 -0700

[SPARK-42817][CORE] Logging the shuffle service name once in 
ApplicationMaster

### What changes were proposed in this pull request?
Removed the logging of shuffle service name multiple times in the driver 
log. It gets logged everytime a new executor is allocated.

### Why are the changes needed?
This is needed because currently the driver logs gets polluted by these 
logs:
```
22/08/03 20:42:07 INFO ExecutorRunnable: Initializing service data for 
shuffle service using name 'spark_shuffle_311'
22/08/03 20:42:07 INFO ExecutorRunnable: Initializing service data for 
shuffle service using name 'spark_shuffle_311'
22/08/03 20:42:07 INFO ExecutorRunnable: Initializing service data for 
shuffle service using name 'spark_shuffle_311'
22/08/03 20:42:07 INFO ExecutorRunnable: Initializing service data for 
shuffle service using name 'spark_shuffle_311'
22/08/03 20:42:07 INFO ExecutorRunnable: Initializing service data for 
shuffle service using name 'spark_shuffle_311'
22/08/03 20:42:07 INFO ExecutorRunnable: Initializing service data for 
shuffle service using name 'spark_shuffle_311'
22/08/03 20:42:07 INFO ExecutorRunnable: Initializing service data for 
shuffle service using name 'spark_shuffle_311'
22/08/03 20:42:07 INFO ExecutorRunnable: Initializing service data for 
shuffle service using name 'spark_shuffle_311'
22/08/03 20:42:07 INFO ExecutorRunnable: Initializing service data for 
shuffle service using name 'spark_shuffle_311'
22/08/03 20:42:07 INFO ExecutorRunnable: Initializing service data for 
shuffle service using name 'spark_shuffle_311'
22/08/03 20:42:07 INFO ExecutorRunnable: Initializing service data for 
shuffle service using name 'spark_shuffle_311'
22/08/03 20:42:07 INFO ExecutorRunnable: Initializing service data for 
shuffle service using name 'spark_shuffle_311'
22/08/03 20:42:07 INFO ExecutorRunnable: Initializing service data for 
shuffle service using name 'spark_shuffle_311'
```

### Does this PR introduce _any_ user-facing change?
Yes, the shuffle service name will be just logged once in the driver.

### How was this patch tested?
Tested manually since it just changes the logging.
With this see this logged in the driver logs:
`23/03/15 16:50:54 INFO  ApplicationMaster: Initializing service data for 
shuffle service using name 'spark_shuffle_311'`

Closes #40448 from otterc/SPARK-42817.

Authored-by: Chandni Singh 
Signed-off-by: Dongjoon Hyun 
---
 .../main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala  | 5 -
 .../main/scala/org/apache/spark/deploy/yarn/ExecutorRunnable.scala   | 1 -
 2 files changed, 4 insertions(+), 2 deletions(-)

diff --git 
a/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala
 
b/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala
index 252c84a1cd4..8bf31a9286e 100644
--- 
a/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala
+++ 
b/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala
@@ -498,7 +498,10 @@ private[spark] class ApplicationMaster(
 // that when the driver sends an initial executor request (e.g. after an 
AM restart),
 // the allocator is ready to service requests.
 rpcEnv.setupEndpoint("YarnAM", new AMEndpoint(rpcEnv, driverRef))
-
+if (_sparkConf.get(SHUFFLE_SERVICE_ENABLED)) {
+  logInfo("Initializing service data for shuffle service using name '" +
+s"${_sparkConf.get(SHUFFLE_SERVICE_NAME)}'")
+}
 allocator.allocateResources()
 val ms = 
MetricsSystem.createMetricsSystem(MetricsSystemInstances.APPLICATION_MASTER, 
sparkConf)
 val prefix = _sparkConf.get(YARN_METRICS_NAMESPACE).getOrElse(appId)
diff --git 
a/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ExecutorRunnable.scala
 
b/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ExecutorRunnable.scala
index 0148b6f3c95..1f3121ed224 100644
--- 
a/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ExecutorRunnable.scala
+++ 
b/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ExecutorRunnable.scala
@@ -115,7 +115,6 @@ private[yarn] class ExecutorRunnable(
   ByteBuffer.allocate(0)
 }
   val serviceName = sparkConf.get(SHUFFLE_SERVICE_NAME)
-  logInfo(s"Initializing service

[spark] branch branch-3.4 updated: [SPARK-42817][CORE] Logging the shuffle service name once in ApplicationMaster

2023-03-16 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch branch-3.4
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.4 by this push:
 new 62d6a3b998a [SPARK-42817][CORE] Logging the shuffle service name once 
in ApplicationMaster
62d6a3b998a is described below

commit 62d6a3b998a1ad341e4d0870e0cc00d46be96c6e
Author: Chandni Singh 
AuthorDate: Thu Mar 16 14:27:31 2023 -0700

[SPARK-42817][CORE] Logging the shuffle service name once in 
ApplicationMaster

### What changes were proposed in this pull request?
Removed the logging of shuffle service name multiple times in the driver 
log. It gets logged everytime a new executor is allocated.

### Why are the changes needed?
This is needed because currently the driver logs gets polluted by these 
logs:
```
22/08/03 20:42:07 INFO ExecutorRunnable: Initializing service data for 
shuffle service using name 'spark_shuffle_311'
22/08/03 20:42:07 INFO ExecutorRunnable: Initializing service data for 
shuffle service using name 'spark_shuffle_311'
22/08/03 20:42:07 INFO ExecutorRunnable: Initializing service data for 
shuffle service using name 'spark_shuffle_311'
22/08/03 20:42:07 INFO ExecutorRunnable: Initializing service data for 
shuffle service using name 'spark_shuffle_311'
22/08/03 20:42:07 INFO ExecutorRunnable: Initializing service data for 
shuffle service using name 'spark_shuffle_311'
22/08/03 20:42:07 INFO ExecutorRunnable: Initializing service data for 
shuffle service using name 'spark_shuffle_311'
22/08/03 20:42:07 INFO ExecutorRunnable: Initializing service data for 
shuffle service using name 'spark_shuffle_311'
22/08/03 20:42:07 INFO ExecutorRunnable: Initializing service data for 
shuffle service using name 'spark_shuffle_311'
22/08/03 20:42:07 INFO ExecutorRunnable: Initializing service data for 
shuffle service using name 'spark_shuffle_311'
22/08/03 20:42:07 INFO ExecutorRunnable: Initializing service data for 
shuffle service using name 'spark_shuffle_311'
22/08/03 20:42:07 INFO ExecutorRunnable: Initializing service data for 
shuffle service using name 'spark_shuffle_311'
22/08/03 20:42:07 INFO ExecutorRunnable: Initializing service data for 
shuffle service using name 'spark_shuffle_311'
22/08/03 20:42:07 INFO ExecutorRunnable: Initializing service data for 
shuffle service using name 'spark_shuffle_311'
```

### Does this PR introduce _any_ user-facing change?
Yes, the shuffle service name will be just logged once in the driver.

### How was this patch tested?
Tested manually since it just changes the logging.
With this see this logged in the driver logs:
`23/03/15 16:50:54 INFO  ApplicationMaster: Initializing service data for 
shuffle service using name 'spark_shuffle_311'`

Closes #40448 from otterc/SPARK-42817.

Authored-by: Chandni Singh 
Signed-off-by: Dongjoon Hyun 
(cherry picked from commit f025d5eb1c2c9a6f7933679aa80752e806df9d2a)
Signed-off-by: Dongjoon Hyun 
---
 .../main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala  | 5 -
 .../main/scala/org/apache/spark/deploy/yarn/ExecutorRunnable.scala   | 1 -
 2 files changed, 4 insertions(+), 2 deletions(-)

diff --git 
a/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala
 
b/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala
index 9815fa6df8a..73deaf7a028 100644
--- 
a/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala
+++ 
b/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala
@@ -498,7 +498,10 @@ private[spark] class ApplicationMaster(
 // that when the driver sends an initial executor request (e.g. after an 
AM restart),
 // the allocator is ready to service requests.
 rpcEnv.setupEndpoint("YarnAM", new AMEndpoint(rpcEnv, driverRef))
-
+if (_sparkConf.get(SHUFFLE_SERVICE_ENABLED)) {
+  logInfo("Initializing service data for shuffle service using name '" +
+s"${_sparkConf.get(SHUFFLE_SERVICE_NAME)}'")
+}
 allocator.allocateResources()
 val ms = 
MetricsSystem.createMetricsSystem(MetricsSystemInstances.APPLICATION_MASTER, 
sparkConf)
 val prefix = _sparkConf.get(YARN_METRICS_NAMESPACE).getOrElse(appId)
diff --git 
a/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ExecutorRunnable.scala
 
b/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ExecutorRunnable.scala
index 0148b6f3c95..1f3121ed224 100644
--- 
a/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ExecutorRunnable.scala
+++ 
b/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ExecutorRunnable.scala
@@ -115,7 +115,6 @@ private[yarn] class ExecutorRunnable(
   ByteBuffer.allocat

[spark] branch master updated: [SPARK-42826][PS][DOCS] Add migration notes for update to supported pandas version

2023-03-16 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 1b40565f99b [SPARK-42826][PS][DOCS] Add migration notes for update to 
supported pandas version
1b40565f99b is described below

commit 1b40565f99b1a3888be46cfdf673b60198bf47a2
Author: itholic 
AuthorDate: Fri Mar 17 09:43:10 2023 +0900

[SPARK-42826][PS][DOCS] Add migration notes for update to supported pandas 
version

### What changes were proposed in this pull request?

This PR proposes to add a migration note for update to supported pandas 
version.

### Why are the changes needed?

Some APIs have been deprecated or removed from SPARK-42593 to follow pandas 
2.0.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Manual review is required.

Closes #40459 from itholic/SPARK-42826.

Lead-authored-by: itholic 
Co-authored-by: Haejoon Lee <44108233+itho...@users.noreply.github.com>
Signed-off-by: Hyukjin Kwon 
---
 python/docs/source/migration_guide/pyspark_upgrade.rst | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/python/docs/source/migration_guide/pyspark_upgrade.rst 
b/python/docs/source/migration_guide/pyspark_upgrade.rst
index 0a590548684..d06475f9b36 100644
--- a/python/docs/source/migration_guide/pyspark_upgrade.rst
+++ b/python/docs/source/migration_guide/pyspark_upgrade.rst
@@ -33,6 +33,7 @@ Upgrading from PySpark 3.3 to 3.4
 * In Spark 3.4, the ``Series.concat`` sort parameter will be respected to 
follow pandas 1.4 behaviors.
 * In Spark 3.4, the ``DataFrame.__setitem__`` will make a copy and replace 
pre-existing arrays, which will NOT be over-written to follow pandas 1.4 
behaviors.
 * In Spark 3.4, the ``SparkSession.sql`` and the Pandas on Spark API ``sql`` 
have got new parameter ``args`` which provides binding of named parameters to 
their SQL literals.
+* In Spark 3.4, Pandas API on Spark follows for the pandas 2.0, and some APIs 
were deprecated or removed in Spark 3.4 according to the changes made in pandas 
2.0. Please refer to the [release notes of 
pandas](https://pandas.pydata.org/docs/dev/whatsnew/) for more details.
 
 
 Upgrading from PySpark 3.2 to 3.3
@@ -108,4 +109,4 @@ Upgrading from PySpark 1.4 to 1.5
 Upgrading from PySpark 1.0-1.2 to 1.3
 -
 
-* When using DataTypes in Python you will need to construct them (i.e. 
``StringType()``) instead of referencing a singleton.
\ No newline at end of file
+* When using DataTypes in Python you will need to construct them (i.e. 
``StringType()``) instead of referencing a singleton.


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch branch-3.4 updated: [SPARK-42826][PS][DOCS] Add migration notes for update to supported pandas version

2023-03-16 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch branch-3.4
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.4 by this push:
 new 833599c6e27 [SPARK-42826][PS][DOCS] Add migration notes for update to 
supported pandas version
833599c6e27 is described below

commit 833599c6e27c98556d596c7d938c432e039e85ba
Author: itholic 
AuthorDate: Fri Mar 17 09:43:10 2023 +0900

[SPARK-42826][PS][DOCS] Add migration notes for update to supported pandas 
version

### What changes were proposed in this pull request?

This PR proposes to add a migration note for update to supported pandas 
version.

### Why are the changes needed?

Some APIs have been deprecated or removed from SPARK-42593 to follow pandas 
2.0.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Manual review is required.

Closes #40459 from itholic/SPARK-42826.

Lead-authored-by: itholic 
Co-authored-by: Haejoon Lee <44108233+itho...@users.noreply.github.com>
Signed-off-by: Hyukjin Kwon 
(cherry picked from commit 1b40565f99b1a3888be46cfdf673b60198bf47a2)
Signed-off-by: Hyukjin Kwon 
---
 python/docs/source/migration_guide/pyspark_upgrade.rst | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/python/docs/source/migration_guide/pyspark_upgrade.rst 
b/python/docs/source/migration_guide/pyspark_upgrade.rst
index 0a590548684..d06475f9b36 100644
--- a/python/docs/source/migration_guide/pyspark_upgrade.rst
+++ b/python/docs/source/migration_guide/pyspark_upgrade.rst
@@ -33,6 +33,7 @@ Upgrading from PySpark 3.3 to 3.4
 * In Spark 3.4, the ``Series.concat`` sort parameter will be respected to 
follow pandas 1.4 behaviors.
 * In Spark 3.4, the ``DataFrame.__setitem__`` will make a copy and replace 
pre-existing arrays, which will NOT be over-written to follow pandas 1.4 
behaviors.
 * In Spark 3.4, the ``SparkSession.sql`` and the Pandas on Spark API ``sql`` 
have got new parameter ``args`` which provides binding of named parameters to 
their SQL literals.
+* In Spark 3.4, Pandas API on Spark follows for the pandas 2.0, and some APIs 
were deprecated or removed in Spark 3.4 according to the changes made in pandas 
2.0. Please refer to the [release notes of 
pandas](https://pandas.pydata.org/docs/dev/whatsnew/) for more details.
 
 
 Upgrading from PySpark 3.2 to 3.3
@@ -108,4 +109,4 @@ Upgrading from PySpark 1.4 to 1.5
 Upgrading from PySpark 1.0-1.2 to 1.3
 -
 
-* When using DataTypes in Python you will need to construct them (i.e. 
``StringType()``) instead of referencing a singleton.
\ No newline at end of file
+* When using DataTypes in Python you will need to construct them (i.e. 
``StringType()``) instead of referencing a singleton.


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated: [SPARK-42824][CONNECT][PYTHON] Provide a clear error message for unsupported JVM attributes

2023-03-16 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new deac4813044 [SPARK-42824][CONNECT][PYTHON] Provide a clear error 
message for unsupported JVM attributes
deac4813044 is described below

commit deac481304489f9b8ecd24ec6f3aed1e0c0d75eb
Author: itholic 
AuthorDate: Fri Mar 17 11:13:00 2023 +0900

[SPARK-42824][CONNECT][PYTHON] Provide a clear error message for 
unsupported JVM attributes

### What changes were proposed in this pull request?

This pull request proposes an improvement to the error message when trying 
to access a JVM attribute that is not supported in Spark Connect. Specifically, 
it adds a more informative error message that clearly indicates which attribute 
is not supported due to Spark Connect's lack of dependency on the JVM.

### Why are the changes needed?

Currently, when attempting to access an unsupported JVM attribute in Spark 
Connect, the error message is not very clear, making it difficult for users to 
understand the root cause of the issue. This improvement aims to provide more 
helpful information to users to address this problem as below:

**Before**
```python
>>> spark._jsc
Traceback (most recent call last):
  File "", line 1, in 
AttributeError: 'SparkSession' object has no attribute '_jsc'
```

**After**
```python
>>> spark._jsc
Traceback (most recent call last):
  File "", line 1, in 
  File 
"/Users/haejoon.lee/Desktop/git_store/spark/python/pyspark/sql/connect/session.py",
 line 490, in _jsc
raise PySparkAttributeError(
pyspark.errors.exceptions.base.PySparkAttributeError: 
[JVM_ATTRIBUTE_NOT_SUPPORTED] Attribute `_jsc` is not supported in Spark 
Connect as it depends on the JVM. If you need to use this attribute, use the 
original PySpark instead of Spark Connect.
```

### Does this PR introduce _any_ user-facing change?

This PR does not introduce any user-facing change in terms of 
functionality. However, it improves the error message, which could potentially 
affect the user experience in a positive way.

### How was this patch tested?

This patch was tested by adding new unit tests that specifically target the 
error message related to unsupported JVM attributes. The tests were run locally 
on a development environment.

Closes #40458 from itholic/SPARK-42824.

Authored-by: itholic 
Signed-off-by: Hyukjin Kwon 
---
 python/pyspark/errors/__init__.py  |  2 +
 python/pyspark/errors/error_classes.py |  5 +++
 python/pyspark/errors/exceptions/base.py   |  6 +++
 python/pyspark/sql/connect/column.py   | 12 +-
 python/pyspark/sql/connect/dataframe.py|  6 ++-
 python/pyspark/sql/connect/readwriter.py   |  7 
 python/pyspark/sql/connect/session.py  | 26 
 .../sql/tests/connect/test_connect_basic.py| 49 +-
 8 files changed, 110 insertions(+), 3 deletions(-)

diff --git a/python/pyspark/errors/__init__.py 
b/python/pyspark/errors/__init__.py
index 95da7ca2aa8..94117fc5160 100644
--- a/python/pyspark/errors/__init__.py
+++ b/python/pyspark/errors/__init__.py
@@ -31,6 +31,7 @@ from pyspark.errors.exceptions.base import (  # noqa: F401
 SparkUpgradeException,
 PySparkTypeError,
 PySparkValueError,
+PySparkAttributeError,
 )
 
 
@@ -47,4 +48,5 @@ __all__ = [
 "SparkUpgradeException",
 "PySparkTypeError",
 "PySparkValueError",
+"PySparkAttributeError",
 ]
diff --git a/python/pyspark/errors/error_classes.py 
b/python/pyspark/errors/error_classes.py
index 8c0f79f7d5a..dda1f5a1f84 100644
--- a/python/pyspark/errors/error_classes.py
+++ b/python/pyspark/errors/error_classes.py
@@ -39,6 +39,11 @@ ERROR_CLASSES_JSON = """
   "Function `` should return Column, got ."
 ]
   },
+  "JVM_ATTRIBUTE_NOT_SUPPORTED" : {
+"message" : [
+  "Attribute `` is not supported in Spark Connect as it depends 
on the JVM. If you need to use this attribute, do not use Spark Connect when 
creating your session."
+]
+  },
   "NOT_BOOL" : {
 "message" : [
   "Argument `` should be a bool, got ."
diff --git a/python/pyspark/errors/exceptions/base.py 
b/python/pyspark/errors/exceptions/base.py
index 6e67039374d..fa66b80ac3a 100644
--- a/python/pyspark/errors/exceptions/base.py
+++ b/python/pyspark/errors/exceptions/base.py
@@ -160,3 +160,9 @@ class PySparkTypeError(PySparkException, TypeError):
 """
 Wrapper class for TypeError to support error classes.
 """
+
+
+class PySparkAttributeError(PySparkException, AttributeError):
+"""
+Wrapper class for AttributeError to support error classes.
+"""
diff --git a/python/pyspark/sql/connect/column.py

[spark] branch branch-3.4 updated: [SPARK-42824][CONNECT][PYTHON] Provide a clear error message for unsupported JVM attributes

2023-03-16 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch branch-3.4
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.4 by this push:
 new ca75340d607 [SPARK-42824][CONNECT][PYTHON] Provide a clear error 
message for unsupported JVM attributes
ca75340d607 is described below

commit ca75340d607f13932c9a49081ae9effcee5ac3f7
Author: itholic 
AuthorDate: Fri Mar 17 11:13:00 2023 +0900

[SPARK-42824][CONNECT][PYTHON] Provide a clear error message for 
unsupported JVM attributes

### What changes were proposed in this pull request?

This pull request proposes an improvement to the error message when trying 
to access a JVM attribute that is not supported in Spark Connect. Specifically, 
it adds a more informative error message that clearly indicates which attribute 
is not supported due to Spark Connect's lack of dependency on the JVM.

### Why are the changes needed?

Currently, when attempting to access an unsupported JVM attribute in Spark 
Connect, the error message is not very clear, making it difficult for users to 
understand the root cause of the issue. This improvement aims to provide more 
helpful information to users to address this problem as below:

**Before**
```python
>>> spark._jsc
Traceback (most recent call last):
  File "", line 1, in 
AttributeError: 'SparkSession' object has no attribute '_jsc'
```

**After**
```python
>>> spark._jsc
Traceback (most recent call last):
  File "", line 1, in 
  File 
"/Users/haejoon.lee/Desktop/git_store/spark/python/pyspark/sql/connect/session.py",
 line 490, in _jsc
raise PySparkAttributeError(
pyspark.errors.exceptions.base.PySparkAttributeError: 
[JVM_ATTRIBUTE_NOT_SUPPORTED] Attribute `_jsc` is not supported in Spark 
Connect as it depends on the JVM. If you need to use this attribute, use the 
original PySpark instead of Spark Connect.
```

### Does this PR introduce _any_ user-facing change?

This PR does not introduce any user-facing change in terms of 
functionality. However, it improves the error message, which could potentially 
affect the user experience in a positive way.

### How was this patch tested?

This patch was tested by adding new unit tests that specifically target the 
error message related to unsupported JVM attributes. The tests were run locally 
on a development environment.

Closes #40458 from itholic/SPARK-42824.

Authored-by: itholic 
Signed-off-by: Hyukjin Kwon 
(cherry picked from commit deac481304489f9b8ecd24ec6f3aed1e0c0d75eb)
Signed-off-by: Hyukjin Kwon 
---
 python/pyspark/errors/__init__.py  |  2 +
 python/pyspark/errors/error_classes.py |  5 +++
 python/pyspark/errors/exceptions/base.py   |  6 +++
 python/pyspark/sql/connect/column.py   | 12 +-
 python/pyspark/sql/connect/dataframe.py|  6 ++-
 python/pyspark/sql/connect/readwriter.py   |  7 
 python/pyspark/sql/connect/session.py  | 26 
 .../sql/tests/connect/test_connect_basic.py| 49 +-
 8 files changed, 110 insertions(+), 3 deletions(-)

diff --git a/python/pyspark/errors/__init__.py 
b/python/pyspark/errors/__init__.py
index 95da7ca2aa8..94117fc5160 100644
--- a/python/pyspark/errors/__init__.py
+++ b/python/pyspark/errors/__init__.py
@@ -31,6 +31,7 @@ from pyspark.errors.exceptions.base import (  # noqa: F401
 SparkUpgradeException,
 PySparkTypeError,
 PySparkValueError,
+PySparkAttributeError,
 )
 
 
@@ -47,4 +48,5 @@ __all__ = [
 "SparkUpgradeException",
 "PySparkTypeError",
 "PySparkValueError",
+"PySparkAttributeError",
 ]
diff --git a/python/pyspark/errors/error_classes.py 
b/python/pyspark/errors/error_classes.py
index 8c0f79f7d5a..dda1f5a1f84 100644
--- a/python/pyspark/errors/error_classes.py
+++ b/python/pyspark/errors/error_classes.py
@@ -39,6 +39,11 @@ ERROR_CLASSES_JSON = """
   "Function `` should return Column, got ."
 ]
   },
+  "JVM_ATTRIBUTE_NOT_SUPPORTED" : {
+"message" : [
+  "Attribute `` is not supported in Spark Connect as it depends 
on the JVM. If you need to use this attribute, do not use Spark Connect when 
creating your session."
+]
+  },
   "NOT_BOOL" : {
 "message" : [
   "Argument `` should be a bool, got ."
diff --git a/python/pyspark/errors/exceptions/base.py 
b/python/pyspark/errors/exceptions/base.py
index 6e67039374d..fa66b80ac3a 100644
--- a/python/pyspark/errors/exceptions/base.py
+++ b/python/pyspark/errors/exceptions/base.py
@@ -160,3 +160,9 @@ class PySparkTypeError(PySparkException, TypeError):
 """
 Wrapper class for TypeError to support error classes.
 """
+
+
+class PySparkAttributeError(PySparkException, AttributeError):
+"""
+Wrapp

[spark] branch master updated: [SPARK-42823][SQL] `spark-sql` shell supports multipart namespaces for initialization

2023-03-16 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 2000d5f8db8 [SPARK-42823][SQL] `spark-sql` shell supports multipart 
namespaces for initialization
2000d5f8db8 is described below

commit 2000d5f8db838db62967a45d574728a8bf2aaf6b
Author: Kent Yao 
AuthorDate: Thu Mar 16 20:29:16 2023 -0700

[SPARK-42823][SQL] `spark-sql` shell supports multipart namespaces for 
initialization

### What changes were proposed in this pull request?

Currently, we only support initializing spark-sql shell with a single-part 
schema, which also must be forced to the session catalog.

 case 1, specifying catalog field for v1sessioncatalog
```sql
bin/spark-sql --database spark_catalog.default

Exception in thread "main" 
org.apache.spark.sql.catalyst.analysis.NoSuchDatabaseException: Database 
'spark_catalog.default' not found
```

 case 2, setting the default catalog to another one

```sql
bin/spark-sql -c spark.sql.defaultCatalog=testcat -c 
spark.sql.catalog.testcat=org.apache.spark.sql.execution.datasources.v2.jdbc.JDBCTableCatalog
 -c spark.sql.catalog.testcat.url='jdbc:derby:memory:testcat;create=true' -c 
spark.sql.catalog.testcat.driver=org.apache.derby.jdbc.AutoloadedDriver -c 
spark.sql.catalogImplementation=in-memory  --database SYS
23/03/16 18:40:49 WARN ObjectStore: Failed to get database sys, returning 
NoSuchObjectException
Exception in thread "main" 
org.apache.spark.sql.catalyst.analysis.NoSuchDatabaseException: Database 'sys' 
not found

```
In this PR, we switch to use-statement to support multipart namespaces, 
which helps us resovle
to catalog correctly.

### Why are the changes needed?

Make spark-sql shell better support the v2 catalog framework.

### Does this PR introduce _any_ user-facing change?

Yes, `--database` option supports multipart namespaces and works for v2 
catalogs now. And you will see this behavior on spark web ui.

### How was this patch tested?

new ut

Closes #40457 from yaooqinn/SPARK-42823.

Authored-by: Kent Yao 
Signed-off-by: Dongjoon Hyun 
---
 .../sql/hive/thriftserver/SparkSQLCLIDriver.scala  | 15 ++---
 .../spark/sql/hive/thriftserver/CliSuite.scala | 26 ++
 2 files changed, 33 insertions(+), 8 deletions(-)

diff --git 
a/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLCLIDriver.scala
 
b/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLCLIDriver.scala
index 51b314ad2c1..22df4e67440 100644
--- 
a/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLCLIDriver.scala
+++ 
b/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLCLIDriver.scala
@@ -201,14 +201,6 @@ private[hive] object SparkSQLCLIDriver extends Logging {
   case e: UnsupportedEncodingException => exit(ERROR_PATH_NOT_FOUND)
 }
 
-if (sessionState.database != null) {
-  SparkSQLEnv.sqlContext.sessionState.catalog.setCurrentDatabase(
-s"${sessionState.database}")
-}
-
-// Execute -i init files (always in silent mode)
-cli.processInitFiles(sessionState)
-
 // We don't propagate hive.metastore.warehouse.dir, because it might has 
been adjusted in
 // [[SharedState.loadHiveConfFile]] based on the user specified or default 
values of
 // spark.sql.warehouse.dir and hive.metastore.warehouse.dir.
@@ -216,6 +208,13 @@ private[hive] object SparkSQLCLIDriver extends Logging {
   SparkSQLEnv.sqlContext.setConf(k, v)
 }
 
+if (sessionState.database != null) {
+  SparkSQLEnv.sqlContext.sql(s"USE ${sessionState.database}")
+}
+
+// Execute -i init files (always in silent mode)
+cli.processInitFiles(sessionState)
+
 cli.printMasterAndAppId
 
 if (sessionState.execString != null) {
diff --git 
a/sql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/CliSuite.scala
 
b/sql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/CliSuite.scala
index 5413635ba47..651c6b7aafb 100644
--- 
a/sql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/CliSuite.scala
+++ 
b/sql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/CliSuite.scala
@@ -35,6 +35,7 @@ import org.apache.hadoop.hive.ql.session.SessionState
 import org.apache.spark.{ErrorMessageFormat, SparkConf, SparkContext, 
SparkFunSuite}
 import org.apache.spark.ProcessTestUtils.ProcessOutputCapturer
 import org.apache.spark.deploy.SparkHadoopUtil
+import org.apache.spark.sql.execution.datasources.v2.jdbc.JDBCTableCatalog
 import org.apache.spark.sql.hive.HiveUtils
 import org.apache.spark.sql.hive.Hive

[spark] branch branch-3.4 updated: [SPARK-42823][SQL] `spark-sql` shell supports multipart namespaces for initialization

2023-03-16 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch branch-3.4
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.4 by this push:
 new c29cf34bfc6 [SPARK-42823][SQL] `spark-sql` shell supports multipart 
namespaces for initialization
c29cf34bfc6 is described below

commit c29cf34bfc694cd3d959c82a25adf251975f0817
Author: Kent Yao 
AuthorDate: Thu Mar 16 20:29:16 2023 -0700

[SPARK-42823][SQL] `spark-sql` shell supports multipart namespaces for 
initialization

### What changes were proposed in this pull request?

Currently, we only support initializing spark-sql shell with a single-part 
schema, which also must be forced to the session catalog.

 case 1, specifying catalog field for v1sessioncatalog
```sql
bin/spark-sql --database spark_catalog.default

Exception in thread "main" 
org.apache.spark.sql.catalyst.analysis.NoSuchDatabaseException: Database 
'spark_catalog.default' not found
```

 case 2, setting the default catalog to another one

```sql
bin/spark-sql -c spark.sql.defaultCatalog=testcat -c 
spark.sql.catalog.testcat=org.apache.spark.sql.execution.datasources.v2.jdbc.JDBCTableCatalog
 -c spark.sql.catalog.testcat.url='jdbc:derby:memory:testcat;create=true' -c 
spark.sql.catalog.testcat.driver=org.apache.derby.jdbc.AutoloadedDriver -c 
spark.sql.catalogImplementation=in-memory  --database SYS
23/03/16 18:40:49 WARN ObjectStore: Failed to get database sys, returning 
NoSuchObjectException
Exception in thread "main" 
org.apache.spark.sql.catalyst.analysis.NoSuchDatabaseException: Database 'sys' 
not found

```
In this PR, we switch to use-statement to support multipart namespaces, 
which helps us resovle
to catalog correctly.

### Why are the changes needed?

Make spark-sql shell better support the v2 catalog framework.

### Does this PR introduce _any_ user-facing change?

Yes, `--database` option supports multipart namespaces and works for v2 
catalogs now. And you will see this behavior on spark web ui.

### How was this patch tested?

new ut

Closes #40457 from yaooqinn/SPARK-42823.

Authored-by: Kent Yao 
Signed-off-by: Dongjoon Hyun 
(cherry picked from commit 2000d5f8db838db62967a45d574728a8bf2aaf6b)
Signed-off-by: Dongjoon Hyun 
---
 .../sql/hive/thriftserver/SparkSQLCLIDriver.scala  | 15 ++---
 .../spark/sql/hive/thriftserver/CliSuite.scala | 26 ++
 2 files changed, 33 insertions(+), 8 deletions(-)

diff --git 
a/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLCLIDriver.scala
 
b/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLCLIDriver.scala
index 51b314ad2c1..22df4e67440 100644
--- 
a/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLCLIDriver.scala
+++ 
b/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLCLIDriver.scala
@@ -201,14 +201,6 @@ private[hive] object SparkSQLCLIDriver extends Logging {
   case e: UnsupportedEncodingException => exit(ERROR_PATH_NOT_FOUND)
 }
 
-if (sessionState.database != null) {
-  SparkSQLEnv.sqlContext.sessionState.catalog.setCurrentDatabase(
-s"${sessionState.database}")
-}
-
-// Execute -i init files (always in silent mode)
-cli.processInitFiles(sessionState)
-
 // We don't propagate hive.metastore.warehouse.dir, because it might has 
been adjusted in
 // [[SharedState.loadHiveConfFile]] based on the user specified or default 
values of
 // spark.sql.warehouse.dir and hive.metastore.warehouse.dir.
@@ -216,6 +208,13 @@ private[hive] object SparkSQLCLIDriver extends Logging {
   SparkSQLEnv.sqlContext.setConf(k, v)
 }
 
+if (sessionState.database != null) {
+  SparkSQLEnv.sqlContext.sql(s"USE ${sessionState.database}")
+}
+
+// Execute -i init files (always in silent mode)
+cli.processInitFiles(sessionState)
+
 cli.printMasterAndAppId
 
 if (sessionState.execString != null) {
diff --git 
a/sql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/CliSuite.scala
 
b/sql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/CliSuite.scala
index 5413635ba47..651c6b7aafb 100644
--- 
a/sql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/CliSuite.scala
+++ 
b/sql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/CliSuite.scala
@@ -35,6 +35,7 @@ import org.apache.hadoop.hive.ql.session.SessionState
 import org.apache.spark.{ErrorMessageFormat, SparkConf, SparkContext, 
SparkFunSuite}
 import org.apache.spark.ProcessTestUtils.ProcessOutputCapturer
 import org.apache.spark.deploy.SparkHadoopUtil
+import org.apache.spark.sql.execution.data

[spark] branch master updated (2000d5f8db8 -> c2380868ad9)

2023-03-16 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from 2000d5f8db8 [SPARK-42823][SQL] `spark-sql` shell supports multipart 
namespaces for initialization
 add c2380868ad9 [SPARK-42814][BUILD] Upgrade maven plugins to latest 
versions

No new revisions were added by this update.

Summary of changes:
 pom.xml  | 12 ++--
 sql/hive/pom.xml |  1 -
 2 files changed, 6 insertions(+), 7 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch branch-3.4 updated (c29cf34bfc6 -> 5bd7b09bbc5)

2023-03-16 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a change to branch branch-3.4
in repository https://gitbox.apache.org/repos/asf/spark.git


from c29cf34bfc6 [SPARK-42823][SQL] `spark-sql` shell supports multipart 
namespaces for initialization
 add 5bd7b09bbc5 [SPARK-42778][SQL][3.4] QueryStageExec should respect 
supportsRowBased

No new revisions were added by this update.

Summary of changes:
 .../scala/org/apache/spark/sql/execution/adaptive/QueryStageExec.scala   | 1 +
 1 file changed, 1 insertion(+)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated: [SPARK-42800][CONNECT][PYTHON][ML] Implement ml function `{array_to_vector, vector_to_array}`

[spark] branch master updated: [SPARK-41233][SQL][PYTHON] Add `array_prepend` function

[spark] branch master updated: [SPARK-42819][SS] Add support for setting max_write_buffer_number and write_buffer_size for RocksDB used in streaming

[spark] branch master updated: Revert "[SPARK-41233][SQL][PYTHON] Add `array_prepend` function"

[spark] branch master updated: [SPARK-42817][CORE] Logging the shuffle service name once in ApplicationMaster

[spark] branch branch-3.4 updated: [SPARK-42817][CORE] Logging the shuffle service name once in ApplicationMaster

[spark] branch master updated: [SPARK-42826][PS][DOCS] Add migration notes for update to supported pandas version

[spark] branch branch-3.4 updated: [SPARK-42826][PS][DOCS] Add migration notes for update to supported pandas version

[spark] branch master updated: [SPARK-42824][CONNECT][PYTHON] Provide a clear error message for unsupported JVM attributes

[spark] branch branch-3.4 updated: [SPARK-42824][CONNECT][PYTHON] Provide a clear error message for unsupported JVM attributes

[spark] branch master updated: [SPARK-42823][SQL] `spark-sql` shell supports multipart namespaces for initialization

[spark] branch branch-3.4 updated: [SPARK-42823][SQL] `spark-sql` shell supports multipart namespaces for initialization

[spark] branch master updated (2000d5f8db8 -> c2380868ad9)

[spark] branch branch-3.4 updated (c29cf34bfc6 -> 5bd7b09bbc5)

14 matches

Site Navigation

Mail list logo

Footer information