[spark] branch master updated: [SPARK-42712][PYTHON][DOC] Improve docstring of mapInPandas and mapInArrow

2023-03-08 Thread gurwls223
This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new bacab6a7576 [SPARK-42712][PYTHON][DOC] Improve docstring of 
mapInPandas and mapInArrow
bacab6a7576 is described below

commit bacab6a7576967ec0871d55ebfc0ef81673321b9
Author: Xinrong Meng 
AuthorDate: Wed Mar 8 19:37:41 2023 +0900

[SPARK-42712][PYTHON][DOC] Improve docstring of mapInPandas and mapInArrow

### What changes were proposed in this pull request?
Improve docstring of mapInPandas and mapInArrow

### Why are the changes needed?
For readability. We call out they are not scalar - the input and output of 
the function might be of different sizes.

### Does this PR introduce _any_ user-facing change?
No. Doc change only.

### How was this patch tested?
Existing tests.

Closes #40330 from xinrong-meng/doc.

Lead-authored-by: Xinrong Meng 
Co-authored-by: Hyukjin Kwon 
Signed-off-by: Hyukjin Kwon 
---
 python/pyspark/sql/pandas/map_ops.py | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/python/pyspark/sql/pandas/map_ops.py 
b/python/pyspark/sql/pandas/map_ops.py
index 2184fdce52d..1370cad33b1 100644
--- a/python/pyspark/sql/pandas/map_ops.py
+++ b/python/pyspark/sql/pandas/map_ops.py
@@ -44,7 +44,8 @@ class PandasMapOpsMixin:
 together as an iterator of `pandas.DataFrame`\\s to the function and 
the
 returned iterator of `pandas.DataFrame`\\s are combined as a 
:class:`DataFrame`.
 Each `pandas.DataFrame` size can be controlled by
-`spark.sql.execution.arrow.maxRecordsPerBatch`.
+`spark.sql.execution.arrow.maxRecordsPerBatch`. The size of the 
function's input and
+output can be different.
 
 .. versionadded:: 3.0.0
 
@@ -108,7 +109,8 @@ class PandasMapOpsMixin:
 together as an iterator of `pyarrow.RecordBatch`\\s to the function 
and the
 returned iterator of `pyarrow.RecordBatch`\\s are combined as a 
:class:`DataFrame`.
 Each `pyarrow.RecordBatch` size can be controlled by
-`spark.sql.execution.arrow.maxRecordsPerBatch`.
+`spark.sql.execution.arrow.maxRecordsPerBatch`. The size of the 
function's input and
+output can be different.
 
 .. versionadded:: 3.3.0
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch branch-3.4 updated: [SPARK-42712][PYTHON][DOC] Improve docstring of mapInPandas and mapInArrow

2023-03-08 Thread gurwls223
This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch branch-3.4
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.4 by this push:
 new 9c10cd96a07 [SPARK-42712][PYTHON][DOC] Improve docstring of 
mapInPandas and mapInArrow
9c10cd96a07 is described below

commit 9c10cd96a0742b6b1e2d5f7e7485f7b0b30a
Author: Xinrong Meng 
AuthorDate: Wed Mar 8 19:37:41 2023 +0900

[SPARK-42712][PYTHON][DOC] Improve docstring of mapInPandas and mapInArrow

### What changes were proposed in this pull request?
Improve docstring of mapInPandas and mapInArrow

### Why are the changes needed?
For readability. We call out they are not scalar - the input and output of 
the function might be of different sizes.

### Does this PR introduce _any_ user-facing change?
No. Doc change only.

### How was this patch tested?
Existing tests.

Closes #40330 from xinrong-meng/doc.

Lead-authored-by: Xinrong Meng 
Co-authored-by: Hyukjin Kwon 
Signed-off-by: Hyukjin Kwon 
(cherry picked from commit bacab6a7576967ec0871d55ebfc0ef81673321b9)
Signed-off-by: Hyukjin Kwon 
---
 python/pyspark/sql/pandas/map_ops.py | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/python/pyspark/sql/pandas/map_ops.py 
b/python/pyspark/sql/pandas/map_ops.py
index 2184fdce52d..1370cad33b1 100644
--- a/python/pyspark/sql/pandas/map_ops.py
+++ b/python/pyspark/sql/pandas/map_ops.py
@@ -44,7 +44,8 @@ class PandasMapOpsMixin:
 together as an iterator of `pandas.DataFrame`\\s to the function and 
the
 returned iterator of `pandas.DataFrame`\\s are combined as a 
:class:`DataFrame`.
 Each `pandas.DataFrame` size can be controlled by
-`spark.sql.execution.arrow.maxRecordsPerBatch`.
+`spark.sql.execution.arrow.maxRecordsPerBatch`. The size of the 
function's input and
+output can be different.
 
 .. versionadded:: 3.0.0
 
@@ -108,7 +109,8 @@ class PandasMapOpsMixin:
 together as an iterator of `pyarrow.RecordBatch`\\s to the function 
and the
 returned iterator of `pyarrow.RecordBatch`\\s are combined as a 
:class:`DataFrame`.
 Each `pyarrow.RecordBatch` size can be controlled by
-`spark.sql.execution.arrow.maxRecordsPerBatch`.
+`spark.sql.execution.arrow.maxRecordsPerBatch`. The size of the 
function's input and
+output can be different.
 
 .. versionadded:: 3.3.0
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated: [SPARK-42266][PYTHON] Remove the parent directory in shell.py execution when IPython is used

2023-03-08 Thread gurwls223
This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 8e83ab7de6b [SPARK-42266][PYTHON] Remove the parent directory in 
shell.py execution when IPython is used
8e83ab7de6b is described below

commit 8e83ab7de6b362df37741ba2ec944d53de95c51c
Author: Hyukjin Kwon 
AuthorDate: Wed Mar 8 19:38:49 2023 +0900

[SPARK-42266][PYTHON] Remove the parent directory in shell.py execution 
when IPython is used

### What changes were proposed in this pull request?

This PR proposes to remove the parent directory in `shell.py` execution 
when IPython is used.

This is a general issue for PySpark shell specifically with IPython - 
IPython temporarily adds the parent directory of the script into the Python 
path (`sys.path`), which results in searching packages under `pyspark` 
directory. For example, `import pandas` attempts to import `pyspark.pandas`.

So far, we haven't had such cases within PySpark itself importing code 
path, but Spark Connect now has the case via checking dependency checking 
(which attempts to import pandas) which exposes the actual problem.

Running it with IPython can easily reproduce the error:

```bash
PYSPARK_PYTHON=ipython bin/pyspark --remote "local[*]"
```

### Why are the changes needed?

To make PySpark shell properly import other packages even when the names 
conflict with subpackages (e.g., `pyspark.pandas` vs `pandas`)

### Does this PR introduce _any_ user-facing change?

No to the end users:
- Because this path is only inserted for `shell.py` execution, and 
thankfully we didn't have such relative import case so far.
- It fixes the issue in the unreleased, Spark Connect.

### How was this patch tested?

Manually tested.

```bash
PYSPARK_PYTHON=ipython bin/pyspark --remote "local[*]"
```

**Before:**

```
Python 3.9.16 | packaged by conda-forge | (main, Feb  1 2023, 21:42:20)
Type 'copyright', 'credits' or 'license' for more information
IPython 8.10.0 -- An enhanced Interactive Python. Type '?' for help.
/.../spark/python/pyspark/shell.py:45: UserWarning: Failed to initialize 
Spark session.
  warnings.warn("Failed to initialize Spark session.")
Traceback (most recent call last):
  File "/.../spark/python/pyspark/shell.py", line 40, in 
spark = SparkSession.builder.getOrCreate()
  File "/.../spark/python/pyspark/sql/session.py", line 437, in getOrCreate
from pyspark.sql.connect.session import SparkSession as 
RemoteSparkSession
  File "/.../spark/python/pyspark/sql/connect/session.py", line 19, in 

check_dependencies(__name__, __file__)
  File "/.../spark/python/pyspark/sql/connect/utils.py", line 33, in 
check_dependencies
require_minimum_pandas_version()
  File "/.../spark/python/pyspark/sql/pandas/utils.py", line 27, in 
require_minimum_pandas_version
import pandas
  File "/.../spark/python/pyspark/pandas/__init__.py", line 29, in 
from pyspark.pandas.missing.general_functions import 
MissingPandasLikeGeneralFunctions
  File "/.../spark/python/pyspark/pandas/__init__.py", line 34, in 
require_minimum_pandas_version()
  File "/.../spark/python/pyspark/sql/pandas/utils.py", line 37, in 
require_minimum_pandas_version
if LooseVersion(pandas.__version__) < 
LooseVersion(minimum_pandas_version):
AttributeError: partially initialized module 'pandas' has no attribute 
'__version__' (most likely due to a circular import)
...
```

**After:**

```
Python 3.9.16 | packaged by conda-forge | (main, Feb  1 2023, 21:42:20)
Type 'copyright', 'credits' or 'license' for more information
IPython 8.10.0 -- An enhanced Interactive Python. Type '?' for help.
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
setLogLevel(newLevel).
23/03/08 13:30:51 WARN NativeCodeLoader: Unable to load native-hadoop 
library for your platform... using builtin-java classes where applicable
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 3.5.0.dev0
  /_/

Using Python version 3.9.16 (main, Feb  1 2023 21:42:20)
Client connected to the Spark Connect server at localhost
SparkSession available as 'spark'.

In [1]:
```

Closes #40327 from HyukjinKwon/SPARK-42266.

Authored-by: Hyukjin Kwon 
Signed-off-by: Hyukjin Kwon 
---
 python/pyspark/shell.py | 12 
 1 file changed, 12 insertions(+)

diff --git a/python/pyspark/shell.py b/python/pyspark/shell.py
index 8613d2d09ea..86f576a3029 100644
--- a

[spark] branch branch-3.4 updated: [SPARK-42266][PYTHON] Remove the parent directory in shell.py execution when IPython is used

2023-03-08 Thread gurwls223
This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch branch-3.4
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.4 by this push:
 new fa0f0e6879d [SPARK-42266][PYTHON] Remove the parent directory in 
shell.py execution when IPython is used
fa0f0e6879d is described below

commit fa0f0e6879d2c98389e0eeecf7b8294f7a8ee3ed
Author: Hyukjin Kwon 
AuthorDate: Wed Mar 8 19:38:49 2023 +0900

[SPARK-42266][PYTHON] Remove the parent directory in shell.py execution 
when IPython is used

### What changes were proposed in this pull request?

This PR proposes to remove the parent directory in `shell.py` execution 
when IPython is used.

This is a general issue for PySpark shell specifically with IPython - 
IPython temporarily adds the parent directory of the script into the Python 
path (`sys.path`), which results in searching packages under `pyspark` 
directory. For example, `import pandas` attempts to import `pyspark.pandas`.

So far, we haven't had such cases within PySpark itself importing code 
path, but Spark Connect now has the case via checking dependency checking 
(which attempts to import pandas) which exposes the actual problem.

Running it with IPython can easily reproduce the error:

```bash
PYSPARK_PYTHON=ipython bin/pyspark --remote "local[*]"
```

### Why are the changes needed?

To make PySpark shell properly import other packages even when the names 
conflict with subpackages (e.g., `pyspark.pandas` vs `pandas`)

### Does this PR introduce _any_ user-facing change?

No to the end users:
- Because this path is only inserted for `shell.py` execution, and 
thankfully we didn't have such relative import case so far.
- It fixes the issue in the unreleased, Spark Connect.

### How was this patch tested?

Manually tested.

```bash
PYSPARK_PYTHON=ipython bin/pyspark --remote "local[*]"
```

**Before:**

```
Python 3.9.16 | packaged by conda-forge | (main, Feb  1 2023, 21:42:20)
Type 'copyright', 'credits' or 'license' for more information
IPython 8.10.0 -- An enhanced Interactive Python. Type '?' for help.
/.../spark/python/pyspark/shell.py:45: UserWarning: Failed to initialize 
Spark session.
  warnings.warn("Failed to initialize Spark session.")
Traceback (most recent call last):
  File "/.../spark/python/pyspark/shell.py", line 40, in 
spark = SparkSession.builder.getOrCreate()
  File "/.../spark/python/pyspark/sql/session.py", line 437, in getOrCreate
from pyspark.sql.connect.session import SparkSession as 
RemoteSparkSession
  File "/.../spark/python/pyspark/sql/connect/session.py", line 19, in 

check_dependencies(__name__, __file__)
  File "/.../spark/python/pyspark/sql/connect/utils.py", line 33, in 
check_dependencies
require_minimum_pandas_version()
  File "/.../spark/python/pyspark/sql/pandas/utils.py", line 27, in 
require_minimum_pandas_version
import pandas
  File "/.../spark/python/pyspark/pandas/__init__.py", line 29, in 
from pyspark.pandas.missing.general_functions import 
MissingPandasLikeGeneralFunctions
  File "/.../spark/python/pyspark/pandas/__init__.py", line 34, in 
require_minimum_pandas_version()
  File "/.../spark/python/pyspark/sql/pandas/utils.py", line 37, in 
require_minimum_pandas_version
if LooseVersion(pandas.__version__) < 
LooseVersion(minimum_pandas_version):
AttributeError: partially initialized module 'pandas' has no attribute 
'__version__' (most likely due to a circular import)
...
```

**After:**

```
Python 3.9.16 | packaged by conda-forge | (main, Feb  1 2023, 21:42:20)
Type 'copyright', 'credits' or 'license' for more information
IPython 8.10.0 -- An enhanced Interactive Python. Type '?' for help.
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
setLogLevel(newLevel).
23/03/08 13:30:51 WARN NativeCodeLoader: Unable to load native-hadoop 
library for your platform... using builtin-java classes where applicable
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 3.5.0.dev0
  /_/

Using Python version 3.9.16 (main, Feb  1 2023 21:42:20)
Client connected to the Spark Connect server at localhost
SparkSession available as 'spark'.

In [1]:
```

Closes #40327 from HyukjinKwon/SPARK-42266.

Authored-by: Hyukjin Kwon 
Signed-off-by: Hyukjin Kwon 
(cherry picked from commit 8e83ab7de6b362df37741ba2ec944d53de95c51c)
Signed-off-by: Hyukjin Kwon 
---
 python/pyspark/shell.py | 12 
 1 file changed, 12 inserti

[spark] branch master updated (8e83ab7de6b -> e28f7f38e10)

2023-03-08 Thread gurwls223
This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from 8e83ab7de6b [SPARK-42266][PYTHON] Remove the parent directory in 
shell.py execution when IPython is used
 add e28f7f38e10 [SPARK-42713][PYTHON][DOCS] Add '__getattr__' and 
'__getitem__' of DataFrame and Column to API reference

No new revisions were added by this update.

Summary of changes:
 .../docs/source/reference/pyspark.sql/column.rst   |  2 +
 .../source/reference/pyspark.sql/dataframe.rst |  2 +
 python/pyspark/sql/column.py   | 59 ++
 python/pyspark/sql/dataframe.py| 33 
 4 files changed, 96 insertions(+)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch branch-3.4 updated: [SPARK-42713][PYTHON][DOCS] Add '__getattr__' and '__getitem__' of DataFrame and Column to API reference

2023-03-08 Thread gurwls223
This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch branch-3.4
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.4 by this push:
 new 1fcf4975063 [SPARK-42713][PYTHON][DOCS] Add '__getattr__' and 
'__getitem__' of DataFrame and Column to API reference
1fcf4975063 is described below

commit 1fcf4975063d4817b794243ef5a1854fe7de8cce
Author: Ruifeng Zheng 
AuthorDate: Wed Mar 8 20:13:57 2023 +0900

[SPARK-42713][PYTHON][DOCS] Add '__getattr__' and '__getitem__' of 
DataFrame and Column to API reference

### What changes were proposed in this pull request?
Add '__getattr__' and '__getitem__' of DataFrame and Column to API reference

### Why are the changes needed?
 '__getattr__' and '__getitem__' are widely used, but we did not document 
them.

### Does this PR introduce _any_ user-facing change?
yes, new doc

### How was this patch tested?
added doctests

Closes #40331 from zhengruifeng/py_doc.

Authored-by: Ruifeng Zheng 
Signed-off-by: Hyukjin Kwon 
(cherry picked from commit e28f7f38e10cf081e4e04760d7f47045973f66ac)
Signed-off-by: Hyukjin Kwon 
---
 .../docs/source/reference/pyspark.sql/column.rst   |  2 +
 .../source/reference/pyspark.sql/dataframe.rst |  2 +
 python/pyspark/sql/column.py   | 59 ++
 python/pyspark/sql/dataframe.py| 33 
 4 files changed, 96 insertions(+)

diff --git a/python/docs/source/reference/pyspark.sql/column.rst 
b/python/docs/source/reference/pyspark.sql/column.rst
index b5f39d299c1..b897b5c00c4 100644
--- a/python/docs/source/reference/pyspark.sql/column.rst
+++ b/python/docs/source/reference/pyspark.sql/column.rst
@@ -24,6 +24,8 @@ Column
 .. autosummary::
 :toctree: api/
 
+Column.__getattr__
+Column.__getitem__
 Column.alias
 Column.asc
 Column.asc_nulls_first
diff --git a/python/docs/source/reference/pyspark.sql/dataframe.rst 
b/python/docs/source/reference/pyspark.sql/dataframe.rst
index e647704158f..aa306ccc382 100644
--- a/python/docs/source/reference/pyspark.sql/dataframe.rst
+++ b/python/docs/source/reference/pyspark.sql/dataframe.rst
@@ -25,6 +25,8 @@ DataFrame
 .. autosummary::
 :toctree: api/
 
+DataFrame.__getattr__
+DataFrame.__getitem__
 DataFrame.agg
 DataFrame.alias
 DataFrame.approxQuantile
diff --git a/python/pyspark/sql/column.py b/python/pyspark/sql/column.py
index abd28136895..9f9ca0abb7a 100644
--- a/python/pyspark/sql/column.py
+++ b/python/pyspark/sql/column.py
@@ -639,11 +639,70 @@ class Column:
 return Column(jc)
 
 def __getattr__(self, item: Any) -> "Column":
+"""
+An expression that gets an item at position ``ordinal`` out of a list,
+or gets an item by key out of a dict.
+
+.. versionadded:: 1.3.0
+
+.. versionchanged:: 3.4.0
+Support Spark Connect.
+
+Parameters
+--
+item
+a literal value.
+
+Returns
+---
+:class:`Column`
+Column representing the item got by key out of a dict.
+
+Examples
+
+>>> df = spark.createDataFrame([('abcedfg', {"key": "value"})], ["l", 
"d"])
+>>> df.select(df.d.key).show()
++--+
+|d[key]|
++--+
+| value|
++--+
+"""
 if item.startswith("__"):
 raise AttributeError(item)
 return self[item]
 
 def __getitem__(self, k: Any) -> "Column":
+"""
+An expression that gets an item at position ``ordinal`` out of a list,
+or gets an item by key out of a dict.
+
+.. versionadded:: 1.3.0
+
+.. versionchanged:: 3.4.0
+Support Spark Connect.
+
+Parameters
+--
+k
+a literal value, or a slice object without step.
+
+Returns
+---
+:class:`Column`
+Column representing the item got by key out of a dict, or 
substrings sliced by
+the given slice object.
+
+Examples
+
+>>> df = spark.createDataFrame([('abcedfg', {"key": "value"})], ["l", 
"d"])
+>>> df.select(df.l[slice(1, 3)], df.d['key']).show()
++--+--+
+|substring(l, 1, 3)|d[key]|
++--+--+
+|   abc| value|
++--+--+
+"""
 if isinstance(k, slice):
 if k.step is not None:
 raise ValueError("slice with step is not supported.")
diff --git a/python/pyspark/sql/dataframe.py b/python/pyspark/sql/dataframe.py
index a6357a7c137..36547dc64c7 100644
--- a/python/pyspark/sql/dataframe.py
+++ b/python/pyspark/sql/dataframe.py
@@ -2847,6 +2847,28 @@ class DataFrame(PandasM

[spark] branch branch-3.4 updated: [SPARK-42684][SQL] v2 catalog should not allow column default value by default

2023-03-08 Thread wenchen
This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch branch-3.4
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.4 by this push:
 new ee3daecfe7c [SPARK-42684][SQL] v2 catalog should not allow column 
default value by default
ee3daecfe7c is described below

commit ee3daecfe7c4a2115ffc94f2b85cd9800ea74196
Author: Wenchen Fan 
AuthorDate: Wed Mar 8 20:05:32 2023 +0800

[SPARK-42684][SQL] v2 catalog should not allow column default value by 
default

### What changes were proposed in this pull request?

Following generated columns, column default value should also have a 
catalog capability and v2 catalogs must explicitly declare 
SUPPORT_COLUMN_DEFAULT_VALUE to support it.

### Why are the changes needed?

column default value needs dedicated handling and if a catalog simply 
ignores it, then query result can be wrong.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

new tests

Closes #40299 from cloud-fan/default.

Authored-by: Wenchen Fan 
Signed-off-by: Wenchen Fan 
(cherry picked from commit 69dd20b5e45c7e3533efbfdc1974f59931c1b781)
Signed-off-by: Wenchen Fan 
---
 .../catalog/DelegatingCatalogExtension.java|  6 ++
 .../connector/catalog/TableCatalogCapability.java  | 39 
 .../spark/sql/catalyst/util/GeneratedColumn.scala  |  7 ++-
 .../catalyst/util/ResolveDefaultColumnsUtil.scala  | 69 ++
 .../sql/connector/catalog/CatalogV2Util.scala  |  3 +-
 .../spark/sql/errors/QueryCompilationErrors.scala  | 31 ++
 .../sql/catalyst/catalog/SessionCatalogSuite.scala |  2 +-
 .../connector/catalog/InMemoryTableCatalog.scala   |  5 +-
 .../catalyst/analysis/ResolveSessionCatalog.scala  | 37 
 .../spark/sql/execution/command/tables.scala   |  5 +-
 .../execution/datasources/DataSourceStrategy.scala |  9 +--
 .../datasources/v2/DataSourceV2Strategy.scala  | 20 ---
 .../datasources/v2/V2SessionCatalog.scala  |  8 ++-
 .../spark/sql/connector/DataSourceV2SQLSuite.scala | 34 ++-
 14 files changed, 178 insertions(+), 97 deletions(-)

diff --git 
a/sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/DelegatingCatalogExtension.java
 
b/sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/DelegatingCatalogExtension.java
index 8bbfe535295..534e1b86eca 100644
--- 
a/sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/DelegatingCatalogExtension.java
+++ 
b/sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/DelegatingCatalogExtension.java
@@ -18,6 +18,7 @@
 package org.apache.spark.sql.connector.catalog;
 
 import java.util.Map;
+import java.util.Set;
 
 import org.apache.spark.annotation.Evolving;
 import org.apache.spark.sql.catalyst.analysis.*;
@@ -52,6 +53,11 @@ public abstract class DelegatingCatalogExtension implements 
CatalogExtension {
   @Override
   public final void initialize(String name, CaseInsensitiveStringMap options) 
{}
 
+  @Override
+  public Set capabilities() {
+return asTableCatalog().capabilities();
+  }
+
   @Override
   public String[] defaultNamespace() {
 return delegate.defaultNamespace();
diff --git 
a/sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/TableCatalogCapability.java
 
b/sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/TableCatalogCapability.java
index 84a2a0f7648..5ccb15ff1f0 100644
--- 
a/sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/TableCatalogCapability.java
+++ 
b/sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/TableCatalogCapability.java
@@ -33,16 +33,31 @@ import org.apache.spark.annotation.Evolving;
 public enum TableCatalogCapability {
 
   /**
-  * Signals that the TableCatalog supports defining generated columns upon 
table creation in SQL.
-  * 
-  * Without this capability, any create/replace table statements with a 
generated column defined
-  * in the table schema will throw an exception during analysis.
-  * 
-  * A generated column is defined with syntax: {@code colName colType 
GENERATED ALWAYS AS (expr)}
-  * 
-  * Generation expression are included in the column definition for APIs like
-  * {@link TableCatalog#createTable}.
-  * See {@link Column#generationExpression()}.
-  */
-  SUPPORTS_CREATE_TABLE_WITH_GENERATED_COLUMNS
+   * Signals that the TableCatalog supports defining generated columns upon 
table creation in SQL.
+   * 
+   * Without this capability, any create/replace table statements with a 
generated column defined
+   * in the table schema will throw an exception during analysis.
+   * 
+   * A generated column is defined with syntax: {@code colName colType 
GENERATED ALWAYS AS (expr)}
+   * 
+   * Generation expression are included in the column definition for APIs like
+

[spark] branch master updated: [SPARK-42684][SQL] v2 catalog should not allow column default value by default

2023-03-08 Thread wenchen
This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 69dd20b5e45 [SPARK-42684][SQL] v2 catalog should not allow column 
default value by default
69dd20b5e45 is described below

commit 69dd20b5e45c7e3533efbfdc1974f59931c1b781
Author: Wenchen Fan 
AuthorDate: Wed Mar 8 20:05:32 2023 +0800

[SPARK-42684][SQL] v2 catalog should not allow column default value by 
default

### What changes were proposed in this pull request?

Following generated columns, column default value should also have a 
catalog capability and v2 catalogs must explicitly declare 
SUPPORT_COLUMN_DEFAULT_VALUE to support it.

### Why are the changes needed?

column default value needs dedicated handling and if a catalog simply 
ignores it, then query result can be wrong.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

new tests

Closes #40299 from cloud-fan/default.

Authored-by: Wenchen Fan 
Signed-off-by: Wenchen Fan 
---
 .../catalog/DelegatingCatalogExtension.java|  6 ++
 .../connector/catalog/TableCatalogCapability.java  | 39 
 .../spark/sql/catalyst/util/GeneratedColumn.scala  |  7 ++-
 .../catalyst/util/ResolveDefaultColumnsUtil.scala  | 69 ++
 .../sql/connector/catalog/CatalogV2Util.scala  |  3 +-
 .../spark/sql/errors/QueryCompilationErrors.scala  | 31 ++
 .../sql/catalyst/catalog/SessionCatalogSuite.scala |  2 +-
 .../connector/catalog/InMemoryTableCatalog.scala   |  5 +-
 .../catalyst/analysis/ResolveSessionCatalog.scala  | 37 
 .../spark/sql/execution/command/tables.scala   |  5 +-
 .../execution/datasources/DataSourceStrategy.scala |  9 +--
 .../datasources/v2/DataSourceV2Strategy.scala  | 20 ---
 .../datasources/v2/V2SessionCatalog.scala  |  8 ++-
 .../spark/sql/connector/DataSourceV2SQLSuite.scala | 34 ++-
 14 files changed, 178 insertions(+), 97 deletions(-)

diff --git 
a/sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/DelegatingCatalogExtension.java
 
b/sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/DelegatingCatalogExtension.java
index 8bbfe535295..534e1b86eca 100644
--- 
a/sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/DelegatingCatalogExtension.java
+++ 
b/sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/DelegatingCatalogExtension.java
@@ -18,6 +18,7 @@
 package org.apache.spark.sql.connector.catalog;
 
 import java.util.Map;
+import java.util.Set;
 
 import org.apache.spark.annotation.Evolving;
 import org.apache.spark.sql.catalyst.analysis.*;
@@ -52,6 +53,11 @@ public abstract class DelegatingCatalogExtension implements 
CatalogExtension {
   @Override
   public final void initialize(String name, CaseInsensitiveStringMap options) 
{}
 
+  @Override
+  public Set capabilities() {
+return asTableCatalog().capabilities();
+  }
+
   @Override
   public String[] defaultNamespace() {
 return delegate.defaultNamespace();
diff --git 
a/sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/TableCatalogCapability.java
 
b/sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/TableCatalogCapability.java
index 84a2a0f7648..5ccb15ff1f0 100644
--- 
a/sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/TableCatalogCapability.java
+++ 
b/sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/TableCatalogCapability.java
@@ -33,16 +33,31 @@ import org.apache.spark.annotation.Evolving;
 public enum TableCatalogCapability {
 
   /**
-  * Signals that the TableCatalog supports defining generated columns upon 
table creation in SQL.
-  * 
-  * Without this capability, any create/replace table statements with a 
generated column defined
-  * in the table schema will throw an exception during analysis.
-  * 
-  * A generated column is defined with syntax: {@code colName colType 
GENERATED ALWAYS AS (expr)}
-  * 
-  * Generation expression are included in the column definition for APIs like
-  * {@link TableCatalog#createTable}.
-  * See {@link Column#generationExpression()}.
-  */
-  SUPPORTS_CREATE_TABLE_WITH_GENERATED_COLUMNS
+   * Signals that the TableCatalog supports defining generated columns upon 
table creation in SQL.
+   * 
+   * Without this capability, any create/replace table statements with a 
generated column defined
+   * in the table schema will throw an exception during analysis.
+   * 
+   * A generated column is defined with syntax: {@code colName colType 
GENERATED ALWAYS AS (expr)}
+   * 
+   * Generation expression are included in the column definition for APIs like
+   * {@link TableCatalog#createTable}.
+   * See {@link Column#generationExpression()}.
+   */
+  SUPPORTS_CREATE

[spark] branch branch-3.4 updated: [SPARK-42555][CONNECT][FOLLOWUP] Add the new proto msg to support the remaining jdbc API

2023-03-08 Thread hvanhovell
This is an automated email from the ASF dual-hosted git repository.

hvanhovell pushed a commit to branch branch-3.4
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.4 by this push:
 new fd97de41206 [SPARK-42555][CONNECT][FOLLOWUP] Add the new proto msg to 
support the remaining jdbc API
fd97de41206 is described below

commit fd97de4120690a5df16b942b0d80770285701f50
Author: Jiaan Geng 
AuthorDate: Wed Mar 8 10:50:00 2023 -0400

[SPARK-42555][CONNECT][FOLLOWUP] Add the new proto msg to support the 
remaining jdbc API

### What changes were proposed in this pull request?
https://github.com/apache/spark/pull/40252 supported some jdbc API that 
reuse the proto msg `DataSource`. The `DataFrameReader` also have another kind 
jdbc API that is unrelated to load data source.

### Why are the changes needed?
This PR adds the new proto msg `PartitionedJDBC` to support the remaining 
jdbc API.

### Does this PR introduce _any_ user-facing change?
'No'.
New feature.

### How was this patch tested?
New test cases.

Closes #40277 from beliefer/SPARK-42555_followup.

Authored-by: Jiaan Geng 
Signed-off-by: Herman van Hovell 
(cherry picked from commit 39a55121888d2543a6056be65e0c74126a9d3bdf)
Signed-off-by: Herman van Hovell 
---
 .../org/apache/spark/sql/DataFrameReader.scala |  42 +
 .../apache/spark/sql/PlanGenerationTestSuite.scala |  21 +--
 .../main/protobuf/spark/connect/relations.proto|   5 +
 .../read_jdbc_with_predicates.explain  |   1 +
 .../queries/read_jdbc_with_predicates.json |  15 ++
 .../queries/read_jdbc_with_predicates.proto.bin| Bin 0 -> 121 bytes
 .../sql/connect/planner/SparkConnectPlanner.scala  |  55 --
 .../sql/connect/ProtoToParsedPlanTestSuite.scala   |   4 +
 python/pyspark/sql/connect/proto/relations_pb2.py  | 190 ++---
 python/pyspark/sql/connect/proto/relations_pb2.pyi |  12 ++
 10 files changed, 222 insertions(+), 123 deletions(-)

diff --git 
a/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/DataFrameReader.scala
 
b/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/DataFrameReader.scala
index 43d6486f124..d5641fb303a 100644
--- 
a/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/DataFrameReader.scala
+++ 
b/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/DataFrameReader.scala
@@ -250,6 +250,48 @@ class DataFrameReader private[sql] (sparkSession: 
SparkSession) extends Logging
 jdbc(url, table, connectionProperties)
   }
 
+  /**
+   * Construct a `DataFrame` representing the database table accessible via 
JDBC URL url named
+   * table using connection properties. The `predicates` parameter gives a 
list expressions
+   * suitable for inclusion in WHERE clauses; each one defines one partition 
of the `DataFrame`.
+   *
+   * Don't create too many partitions in parallel on a large cluster; 
otherwise Spark might crash
+   * your external database systems.
+   *
+   * You can find the JDBC-specific option and parameter documentation for 
reading tables via JDBC
+   * in https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html#data-source-option";>
+   * Data Source Option in the version you use.
+   *
+   * @param table
+   *   Name of the table in the external database.
+   * @param predicates
+   *   Condition in the where clause for each partition.
+   * @param connectionProperties
+   *   JDBC database connection arguments, a list of arbitrary string 
tag/value. Normally at least
+   *   a "user" and "password" property should be included. "fetchsize" can be 
used to control the
+   *   number of rows per fetch.
+   * @since 3.4.0
+   */
+  def jdbc(
+  url: String,
+  table: String,
+  predicates: Array[String],
+  connectionProperties: Properties): DataFrame = {
+sparkSession.newDataFrame { builder =>
+  val dataSourceBuilder = builder.getReadBuilder.getDataSourceBuilder
+  format("jdbc")
+  dataSourceBuilder.setFormat(source)
+  predicates.foreach(predicate => 
dataSourceBuilder.addPredicates(predicate))
+  this.extraOptions ++= Seq("url" -> url, "dbtable" -> table)
+  val params = extraOptions ++ connectionProperties.asScala
+  params.foreach { case (k, v) =>
+dataSourceBuilder.putOptions(k, v)
+  }
+  builder.build()
+}
+  }
+
   /**
* Loads a JSON file and returns the results as a `DataFrame`.
*
diff --git 
a/connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/PlanGenerationTestSuite.scala
 
b/connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/PlanGenerationTestSuite.scala
old mode 100755
new mode 100644
index 027b7a30246..56c5111912a
--- 
a/connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/PlanGenerationTestSuite.scala
+++ 
b/connector/connect/client/jvm/src/test/scala/org/apach

[spark] branch master updated (69dd20b5e45 -> 39a55121888)

2023-03-08 Thread hvanhovell
This is an automated email from the ASF dual-hosted git repository.

hvanhovell pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from 69dd20b5e45 [SPARK-42684][SQL] v2 catalog should not allow column 
default value by default
 add 39a55121888 [SPARK-42555][CONNECT][FOLLOWUP] Add the new proto msg to 
support the remaining jdbc API

No new revisions were added by this update.

Summary of changes:
 .../org/apache/spark/sql/DataFrameReader.scala |  42 +
 .../apache/spark/sql/PlanGenerationTestSuite.scala |  21 +--
 .../main/protobuf/spark/connect/relations.proto|   5 +
 .../read_jdbc_with_predicates.explain  |   1 +
 ...ad_jdbc.json => read_jdbc_with_predicates.json} |   5 +-
 .../queries/read_jdbc_with_predicates.proto.bin| Bin 0 -> 121 bytes
 .../sql/connect/planner/SparkConnectPlanner.scala  |  55 --
 .../sql/connect/ProtoToParsedPlanTestSuite.scala   |   4 +
 python/pyspark/sql/connect/proto/relations_pb2.py  | 190 ++---
 python/pyspark/sql/connect/proto/relations_pb2.pyi |  12 ++
 10 files changed, 210 insertions(+), 125 deletions(-)
 mode change 100755 => 100644 
connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/PlanGenerationTestSuite.scala
 create mode 100644 
connector/connect/common/src/test/resources/query-tests/explain-results/read_jdbc_with_predicates.explain
 copy 
connector/connect/common/src/test/resources/query-tests/queries/{read_jdbc.json 
=> read_jdbc_with_predicates.json} (65%)
 create mode 100644 
connector/connect/common/src/test/resources/query-tests/queries/read_jdbc_with_predicates.proto.bin


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch branch-3.4 updated: [SPARK-42709][PYTHON] Remove the assumption of `__file__` being available

2023-03-08 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch branch-3.4
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.4 by this push:
 new 5aae9707450 [SPARK-42709][PYTHON] Remove the assumption of `__file__` 
being available
5aae9707450 is described below

commit 5aae970745007e2b5fcbcf6491ed3a72e93f4763
Author: Hyukjin Kwon 
AuthorDate: Wed Mar 8 09:59:33 2023 -0800

[SPARK-42709][PYTHON] Remove the assumption of `__file__` being available

### What changes were proposed in this pull request?

This PR proposes to add a check for `__file__` attributes.

### Why are the changes needed?

`__file__` might not be available everywhere. See also 
https://github.com/scikit-learn/scikit-learn/issues/20081

### Does this PR introduce _any_ user-facing change?

If users' Python environment does not have `__file__`, now users can use 
PySpark in their environment too.

### How was this patch tested?

Manually tested.

Closes #40328 from HyukjinKwon/SPARK-42709.

Authored-by: Hyukjin Kwon 
Signed-off-by: Dongjoon Hyun 
(cherry picked from commit f95fc19d9491fa79a3f34cfb721b4919c9b3bb0f)
Signed-off-by: Dongjoon Hyun 
---
 python/pyspark/find_spark_home.py | 10 +++---
 python/pyspark/sql/connect/catalog.py |  2 +-
 python/pyspark/sql/connect/client.py  |  2 +-
 python/pyspark/sql/connect/column.py  |  2 +-
 python/pyspark/sql/connect/conversion.py  |  2 +-
 python/pyspark/sql/connect/dataframe.py   |  2 +-
 python/pyspark/sql/connect/expressions.py |  2 +-
 python/pyspark/sql/connect/functions.py   |  2 +-
 python/pyspark/sql/connect/group.py   |  2 +-
 python/pyspark/sql/connect/plan.py|  2 +-
 python/pyspark/sql/connect/readwriter.py  |  2 +-
 python/pyspark/sql/connect/session.py |  2 +-
 python/pyspark/sql/connect/types.py   |  2 +-
 python/pyspark/sql/connect/udf.py |  2 +-
 python/pyspark/sql/connect/utils.py   |  4 ++--
 python/pyspark/sql/connect/window.py  |  2 +-
 python/pyspark/worker.py  | 19 ++-
 17 files changed, 33 insertions(+), 28 deletions(-)

diff --git a/python/pyspark/find_spark_home.py 
b/python/pyspark/find_spark_home.py
index 09f4551ea5f..a2226f8385e 100755
--- a/python/pyspark/find_spark_home.py
+++ b/python/pyspark/find_spark_home.py
@@ -42,11 +42,15 @@ def _find_spark_home():
 spark_dist_dir = "spark-distribution"
 paths = [
 "../",  # When we're in spark/python.
-# Two case belows are valid when the current script is called as a 
library.
-os.path.join(os.path.dirname(os.path.realpath(__file__)), 
spark_dist_dir),
-os.path.dirname(os.path.realpath(__file__)),
 ]
 
+if "__file__" in globals():
+paths += [
+# Two case belows are valid when the current script is called as a 
library.
+os.path.join(os.path.dirname(os.path.realpath(__file__)), 
spark_dist_dir),
+os.path.dirname(os.path.realpath(__file__)),
+]
+
 # Add the path of the PySpark module if it exists
 import_error_raised = False
 from importlib.util import find_spec
diff --git a/python/pyspark/sql/connect/catalog.py 
b/python/pyspark/sql/connect/catalog.py
index f2bbae344f2..261f87b4cc6 100644
--- a/python/pyspark/sql/connect/catalog.py
+++ b/python/pyspark/sql/connect/catalog.py
@@ -16,7 +16,7 @@
 #
 from pyspark.sql.connect.utils import check_dependencies
 
-check_dependencies(__name__, __file__)
+check_dependencies(__name__)
 
 from typing import Any, Callable, List, Optional, TYPE_CHECKING
 
diff --git a/python/pyspark/sql/connect/client.py 
b/python/pyspark/sql/connect/client.py
index 6334036fca4..baa6d641422 100644
--- a/python/pyspark/sql/connect/client.py
+++ b/python/pyspark/sql/connect/client.py
@@ -23,7 +23,7 @@ import string
 
 from pyspark.sql.connect.utils import check_dependencies
 
-check_dependencies(__name__, __file__)
+check_dependencies(__name__)
 
 import logging
 import os
diff --git a/python/pyspark/sql/connect/column.py 
b/python/pyspark/sql/connect/column.py
index bc8b60beb97..d2be32b905e 100644
--- a/python/pyspark/sql/connect/column.py
+++ b/python/pyspark/sql/connect/column.py
@@ -16,7 +16,7 @@
 #
 from pyspark.sql.connect.utils import check_dependencies
 
-check_dependencies(__name__, __file__)
+check_dependencies(__name__)
 
 import datetime
 import decimal
diff --git a/python/pyspark/sql/connect/conversion.py 
b/python/pyspark/sql/connect/conversion.py
index 7b452de48f6..2b16fc7766d 100644
--- a/python/pyspark/sql/connect/conversion.py
+++ b/python/pyspark/sql/connect/conversion.py
@@ -16,7 +16,7 @@
 #
 from pyspark.sql.connect.utils import check_dependencies
 
-check_dependencies(__name__, __file__)
+check_dependencies(__name__)
 
 import array
 import datetime
diff --git a/python/pyspark/sql

[spark] branch master updated (39a55121888 -> f95fc19d949)

2023-03-08 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from 39a55121888 [SPARK-42555][CONNECT][FOLLOWUP] Add the new proto msg to 
support the remaining jdbc API
 add f95fc19d949 [SPARK-42709][PYTHON] Remove the assumption of `__file__` 
being available

No new revisions were added by this update.

Summary of changes:
 python/pyspark/find_spark_home.py | 10 +++---
 python/pyspark/sql/connect/catalog.py |  2 +-
 python/pyspark/sql/connect/client.py  |  2 +-
 python/pyspark/sql/connect/column.py  |  2 +-
 python/pyspark/sql/connect/conversion.py  |  2 +-
 python/pyspark/sql/connect/dataframe.py   |  2 +-
 python/pyspark/sql/connect/expressions.py |  2 +-
 python/pyspark/sql/connect/functions.py   |  2 +-
 python/pyspark/sql/connect/group.py   |  2 +-
 python/pyspark/sql/connect/plan.py|  2 +-
 python/pyspark/sql/connect/readwriter.py  |  2 +-
 python/pyspark/sql/connect/session.py |  2 +-
 python/pyspark/sql/connect/types.py   |  2 +-
 python/pyspark/sql/connect/udf.py |  2 +-
 python/pyspark/sql/connect/utils.py   |  4 ++--
 python/pyspark/sql/connect/window.py  |  2 +-
 python/pyspark/worker.py  | 19 ++-
 17 files changed, 33 insertions(+), 28 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated: [SPARK-42480][SQL] Improve the performance of drop partitions

2023-03-08 Thread sunchao
This is an automated email from the ASF dual-hosted git repository.

sunchao pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 153ace7075e [SPARK-42480][SQL] Improve the performance of drop 
partitions
153ace7075e is described below

commit 153ace7075ef0fca7804d1b4789eb33e9e0e6ecf
Author: wecharyu 
AuthorDate: Wed Mar 8 16:30:01 2023 -0800

[SPARK-42480][SQL] Improve the performance of drop partitions

### What changes were proposed in this pull request?
1. Change to get matching partition names rather than partition objects 
when drop partitions

### Why are the changes needed?
1. Partition names are enough to drop partitions
2. It can reduce the time overhead and driver memory overhead.

### Does this PR introduce _any_ user-facing change?
Yes, we have add a new sql conf to enable this feature: 
`spark.sql.hive.dropPartitionByName.enabled`

### How was this patch tested?
Add new tests.

Closes #40069 from wecharyu/SPARK-42480.

Authored-by: wecharyu 
Signed-off-by: Chao Sun 
---
 .../org/apache/spark/sql/internal/SQLConf.scala| 10 +
 .../command/AlterTableDropPartitionSuiteBase.scala | 18 
 .../org/apache/spark/sql/hive/HiveUtils.scala  | 13 +++
 .../spark/sql/hive/client/HiveClientImpl.scala | 21 --
 .../command/AlterTableDropPartitionSuite.scala | 25 ++
 5 files changed, 81 insertions(+), 6 deletions(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
index c09d40802f0..b8b8e84a56d 100644
--- a/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
+++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
@@ -1227,6 +1227,14 @@ object SQLConf {
 .booleanConf
 .createWithDefault(false)
 
+  val HIVE_METASTORE_DROP_PARTITION_BY_NAME =
+buildConf("spark.sql.hive.dropPartitionByName.enabled")
+  .doc("When true, Spark will get partition name rather than partition 
object " +
+   "to drop partition, which can improve the performance of drop 
partition.")
+  .version("3.4.0")
+  .booleanConf
+  .createWithDefault(false)
+
   val HIVE_METASTORE_PARTITION_PRUNING =
 buildConf("spark.sql.hive.metastorePartitionPruning")
   .doc("When true, some predicates will be pushed down into the Hive 
metastore so that " +
@@ -4483,6 +4491,8 @@ class SQLConf extends Serializable with Logging {
 
   def verifyPartitionPath: Boolean = getConf(HIVE_VERIFY_PARTITION_PATH)
 
+  def metastoreDropPartitionsByName: Boolean = 
getConf(HIVE_METASTORE_DROP_PARTITION_BY_NAME)
+
   def metastorePartitionPruning: Boolean = 
getConf(HIVE_METASTORE_PARTITION_PRUNING)
 
   def metastorePartitionPruningInSetThreshold: Int =
diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/execution/command/AlterTableDropPartitionSuiteBase.scala
 
b/sql/core/src/test/scala/org/apache/spark/sql/execution/command/AlterTableDropPartitionSuiteBase.scala
index 3f15533ca5f..eaf305414f1 100644
--- 
a/sql/core/src/test/scala/org/apache/spark/sql/execution/command/AlterTableDropPartitionSuiteBase.scala
+++ 
b/sql/core/src/test/scala/org/apache/spark/sql/execution/command/AlterTableDropPartitionSuiteBase.scala
@@ -284,4 +284,22 @@ trait AlterTableDropPartitionSuiteBase extends QueryTest 
with DDLCommandTestUtil
   }
 }
   }
+
+  test("SPARK-42480: drop partition when dropPartitionByName enabled") {
+withSQLConf(SQLConf.HIVE_METASTORE_DROP_PARTITION_BY_NAME.key -> "true") {
+  withNamespaceAndTable("ns", "tbl") { t =>
+sql(s"CREATE TABLE $t(name STRING, age INT) USING PARQUET PARTITIONED 
BY (region STRING)")
+sql(s"ALTER TABLE $t ADD PARTITION (region='=reg1') LOCATION 'loc1'")
+checkPartitions(t, Map("region" -> "=reg1"))
+sql(s"ALTER TABLE $t PARTITION (region='=reg1') RENAME TO PARTITION 
(region='=%reg1')")
+checkPartitions(t, Map("region" -> "=%reg1"))
+sql(s"ALTER TABLE $t DROP PARTITION (region='=%reg1')")
+checkPartitions(t)
+sql(s"ALTER TABLE $t ADD PARTITION (region='reg?2') LOCATION 'loc2'")
+checkPartitions(t, Map("region" -> "reg?2"))
+sql(s"ALTER TABLE $t DROP PARTITION (region='reg?2')")
+checkPartitions(t)
+  }
+}
+  }
 }
diff --git a/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveUtils.scala 
b/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveUtils.scala
index 4637a4a0179..a01246520f3 100644
--- a/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveUtils.scala
+++ b/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveUtils.scala
@@ -27,6 +27,8 @@ import scala.collection.mutable.HashMap
 import scala.util.Try
 
 import org.apache.hadoop.conf.Config

[spark] branch branch-3.4 updated: [SPARK-42480][SQL] Improve the performance of drop partitions

2023-03-08 Thread sunchao
This is an automated email from the ASF dual-hosted git repository.

sunchao pushed a commit to branch branch-3.4
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.4 by this push:
 new 5eb4edf53c4 [SPARK-42480][SQL] Improve the performance of drop 
partitions
5eb4edf53c4 is described below

commit 5eb4edf53c4bdda9d9ce8847048d195fb2b5c6a2
Author: wecharyu 
AuthorDate: Wed Mar 8 16:30:01 2023 -0800

[SPARK-42480][SQL] Improve the performance of drop partitions

### What changes were proposed in this pull request?
1. Change to get matching partition names rather than partition objects 
when drop partitions

### Why are the changes needed?
1. Partition names are enough to drop partitions
2. It can reduce the time overhead and driver memory overhead.

### Does this PR introduce _any_ user-facing change?
Yes, we have add a new sql conf to enable this feature: 
`spark.sql.hive.dropPartitionByName.enabled`

### How was this patch tested?
Add new tests.

Closes #40069 from wecharyu/SPARK-42480.

Authored-by: wecharyu 
Signed-off-by: Chao Sun 
---
 .../org/apache/spark/sql/internal/SQLConf.scala| 10 +
 .../command/AlterTableDropPartitionSuiteBase.scala | 18 
 .../org/apache/spark/sql/hive/HiveUtils.scala  | 13 +++
 .../spark/sql/hive/client/HiveClientImpl.scala | 21 --
 .../command/AlterTableDropPartitionSuite.scala | 25 ++
 5 files changed, 81 insertions(+), 6 deletions(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
index 21bf571c494..5a93fdde304 100644
--- a/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
+++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
@@ -1227,6 +1227,14 @@ object SQLConf {
 .booleanConf
 .createWithDefault(false)
 
+  val HIVE_METASTORE_DROP_PARTITION_BY_NAME =
+buildConf("spark.sql.hive.dropPartitionByName.enabled")
+  .doc("When true, Spark will get partition name rather than partition 
object " +
+   "to drop partition, which can improve the performance of drop 
partition.")
+  .version("3.4.0")
+  .booleanConf
+  .createWithDefault(false)
+
   val HIVE_METASTORE_PARTITION_PRUNING =
 buildConf("spark.sql.hive.metastorePartitionPruning")
   .doc("When true, some predicates will be pushed down into the Hive 
metastore so that " +
@@ -4472,6 +4480,8 @@ class SQLConf extends Serializable with Logging {
 
   def verifyPartitionPath: Boolean = getConf(HIVE_VERIFY_PARTITION_PATH)
 
+  def metastoreDropPartitionsByName: Boolean = 
getConf(HIVE_METASTORE_DROP_PARTITION_BY_NAME)
+
   def metastorePartitionPruning: Boolean = 
getConf(HIVE_METASTORE_PARTITION_PRUNING)
 
   def metastorePartitionPruningInSetThreshold: Int =
diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/execution/command/AlterTableDropPartitionSuiteBase.scala
 
b/sql/core/src/test/scala/org/apache/spark/sql/execution/command/AlterTableDropPartitionSuiteBase.scala
index 3f15533ca5f..eaf305414f1 100644
--- 
a/sql/core/src/test/scala/org/apache/spark/sql/execution/command/AlterTableDropPartitionSuiteBase.scala
+++ 
b/sql/core/src/test/scala/org/apache/spark/sql/execution/command/AlterTableDropPartitionSuiteBase.scala
@@ -284,4 +284,22 @@ trait AlterTableDropPartitionSuiteBase extends QueryTest 
with DDLCommandTestUtil
   }
 }
   }
+
+  test("SPARK-42480: drop partition when dropPartitionByName enabled") {
+withSQLConf(SQLConf.HIVE_METASTORE_DROP_PARTITION_BY_NAME.key -> "true") {
+  withNamespaceAndTable("ns", "tbl") { t =>
+sql(s"CREATE TABLE $t(name STRING, age INT) USING PARQUET PARTITIONED 
BY (region STRING)")
+sql(s"ALTER TABLE $t ADD PARTITION (region='=reg1') LOCATION 'loc1'")
+checkPartitions(t, Map("region" -> "=reg1"))
+sql(s"ALTER TABLE $t PARTITION (region='=reg1') RENAME TO PARTITION 
(region='=%reg1')")
+checkPartitions(t, Map("region" -> "=%reg1"))
+sql(s"ALTER TABLE $t DROP PARTITION (region='=%reg1')")
+checkPartitions(t)
+sql(s"ALTER TABLE $t ADD PARTITION (region='reg?2') LOCATION 'loc2'")
+checkPartitions(t, Map("region" -> "reg?2"))
+sql(s"ALTER TABLE $t DROP PARTITION (region='reg?2')")
+checkPartitions(t)
+  }
+}
+  }
 }
diff --git a/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveUtils.scala 
b/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveUtils.scala
index fe9bdef3d0e..0b59011f4e2 100644
--- a/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveUtils.scala
+++ b/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveUtils.scala
@@ -28,6 +28,8 @@ import scala.util.Try
 
 import org.apache.commons.lang3.{JavaVersion, SystemUtils}
 import or

[spark] branch master updated: [SPARK-42722][CONNECT][PYTHON] Python Connect def schema() should not cache the schema

2023-03-08 Thread gurwls223
This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new a77bb37b411 [SPARK-42722][CONNECT][PYTHON] Python Connect def schema() 
should not cache the schema
a77bb37b411 is described below

commit a77bb37b4112543fcd77a7f6091e465daeb3f8ae
Author: Rui Wang 
AuthorDate: Thu Mar 9 09:41:08 2023 +0900

[SPARK-42722][CONNECT][PYTHON] Python Connect def schema() should not cache 
the schema

### What changes were proposed in this pull request?

As of now that Connect Python client cache the schema when calling `def 
schema()`. However this might cause stale data issue. For example:
```
1. Create table
2. table.schema
3. drop table and recreate the table with different schema
4. table.schema // now this is incorrect
```

### Why are the changes needed?

Fix the behavior when the cached schema could be stale.

### Does this PR introduce _any_ user-facing change?

This is actually a fix that users now can always see the most up-to-dated 
schema.

### How was this patch tested?

Existing UT

Closes #40343 from amaliujia/disable_cache.

Authored-by: Rui Wang 
Signed-off-by: Hyukjin Kwon 
---
 python/pyspark/sql/connect/dataframe.py | 16 ++--
 1 file changed, 6 insertions(+), 10 deletions(-)

diff --git a/python/pyspark/sql/connect/dataframe.py 
b/python/pyspark/sql/connect/dataframe.py
index f8b92cdc7ae..504b83d1165 100644
--- a/python/pyspark/sql/connect/dataframe.py
+++ b/python/pyspark/sql/connect/dataframe.py
@@ -1367,17 +1367,13 @@ class DataFrame:
 
 @property
 def schema(self) -> StructType:
-if self._schema is None:
-if self._plan is not None:
-query = self._plan.to_proto(self._session.client)
-if self._session is None:
-raise Exception("Cannot analyze without SparkSession.")
-self._schema = self._session.client.schema(query)
-return self._schema
-else:
-raise Exception("Empty plan.")
+if self._plan is not None:
+query = self._plan.to_proto(self._session.client)
+if self._session is None:
+raise Exception("Cannot analyze without SparkSession.")
+return self._session.client.schema(query)
 else:
-return self._schema
+raise Exception("Empty plan.")
 
 schema.__doc__ = PySparkDataFrame.schema.__doc__
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch branch-3.4 updated: [SPARK-42722][CONNECT][PYTHON] Python Connect def schema() should not cache the schema

2023-03-08 Thread gurwls223
This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch branch-3.4
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.4 by this push:
 new 8655dfe66de [SPARK-42722][CONNECT][PYTHON] Python Connect def schema() 
should not cache the schema
8655dfe66de is described below

commit 8655dfe66de6d24843afd733ddfb8e92d220d76d
Author: Rui Wang 
AuthorDate: Thu Mar 9 09:41:08 2023 +0900

[SPARK-42722][CONNECT][PYTHON] Python Connect def schema() should not cache 
the schema

### What changes were proposed in this pull request?

As of now that Connect Python client cache the schema when calling `def 
schema()`. However this might cause stale data issue. For example:
```
1. Create table
2. table.schema
3. drop table and recreate the table with different schema
4. table.schema // now this is incorrect
```

### Why are the changes needed?

Fix the behavior when the cached schema could be stale.

### Does this PR introduce _any_ user-facing change?

This is actually a fix that users now can always see the most up-to-dated 
schema.

### How was this patch tested?

Existing UT

Closes #40343 from amaliujia/disable_cache.

Authored-by: Rui Wang 
Signed-off-by: Hyukjin Kwon 
(cherry picked from commit a77bb37b4112543fcd77a7f6091e465daeb3f8ae)
Signed-off-by: Hyukjin Kwon 
---
 python/pyspark/sql/connect/dataframe.py | 16 ++--
 1 file changed, 6 insertions(+), 10 deletions(-)

diff --git a/python/pyspark/sql/connect/dataframe.py 
b/python/pyspark/sql/connect/dataframe.py
index f8b92cdc7ae..504b83d1165 100644
--- a/python/pyspark/sql/connect/dataframe.py
+++ b/python/pyspark/sql/connect/dataframe.py
@@ -1367,17 +1367,13 @@ class DataFrame:
 
 @property
 def schema(self) -> StructType:
-if self._schema is None:
-if self._plan is not None:
-query = self._plan.to_proto(self._session.client)
-if self._session is None:
-raise Exception("Cannot analyze without SparkSession.")
-self._schema = self._session.client.schema(query)
-return self._schema
-else:
-raise Exception("Empty plan.")
+if self._plan is not None:
+query = self._plan.to_proto(self._session.client)
+if self._session is None:
+raise Exception("Cannot analyze without SparkSession.")
+return self._session.client.schema(query)
 else:
-return self._schema
+raise Exception("Empty plan.")
 
 schema.__doc__ = PySparkDataFrame.schema.__doc__
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated: [SPARK-42656][CONNECT][FOLLOWUP] Fix the spark-connect script

2023-03-08 Thread gurwls223
This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new b5243d7f7f9 [SPARK-42656][CONNECT][FOLLOWUP] Fix the spark-connect 
script
b5243d7f7f9 is described below

commit b5243d7f7f9ede78a711eb168cf951f4bde7a8fa
Author: Zhen Li 
AuthorDate: Thu Mar 9 10:29:47 2023 +0900

[SPARK-42656][CONNECT][FOLLOWUP] Fix the spark-connect script

### What changes were proposed in this pull request?
The spark-connect script is broken as it need a jar at the end.
Also ensured when scala 2.13 is set, all commands in the scripts runs with 
`-PScala-2.13`

Example usage:
Start spark connect with default settings:
* `./connector/connect/bin/spark-connect-shell`
* or `./connector/connect/bin/spark-connect` (Enter "q"  to exit 
the program)

Start Scala client with default settings: 
`./connector/connect/bin/spark-connect-scala-client`

Start spark connect with extra configs:
* `./connector/connect/bin/spark-connect-shell --conf 
spark.connect.grpc.binding.port=`
* or `./connector/connect/bin/spark-connect --conf 
spark.connect.grpc.binding.port=`

Start Scala client with a connection string:
```
export SPARK_REMOTE="sc://localhost:/"
./connector/connect/bin/spark-connect-scala-client
```

### Why are the changes needed?
Bug fix

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Manually tested on 2.12 and 2.13 for all the scripts changed.

Test example with expected results:
`./connector/connect/bin/spark-connect-shell` :
https://user-images.githubusercontent.com/4190164/223863343-d5d159d9-da7c-47c7-b55a-a2854c5f5d76.png";>

Verify the spark connect server is started at the correct port, e.g.
```
>Telnet localhost 15002
Trying ::1...
Connected to localhost.
Escape character is '^]'.
```

`./connector/connect/bin/spark-connect`:
https://user-images.githubusercontent.com/4190164/223863099-41195599-c49d-4db4-a1e2-e129a649cd81.png";>
Server started successfully when seeing the last line output.

`./connector/connect/bin/spark-connect-scala-client`:
https://user-images.githubusercontent.com/4190164/223862992-c8a3a36a-9f69-40b8-b82e-5dab85ed14ce.png";>
Verify the client can run some simple quries.

Closes #40344 from zhenlineo/fix-scripts.

Authored-by: Zhen Li 
Signed-off-by: Hyukjin Kwon 
---
 connector/connect/bin/spark-connect  | 11 +--
 connector/connect/bin/spark-connect-scala-client | 19 ++-
 connector/connect/bin/spark-connect-shell| 10 +++---
 3 files changed, 26 insertions(+), 14 deletions(-)

diff --git a/connector/connect/bin/spark-connect 
b/connector/connect/bin/spark-connect
index 62d0d36b441..772a88a04f3 100755
--- a/connector/connect/bin/spark-connect
+++ b/connector/connect/bin/spark-connect
@@ -26,7 +26,14 @@ FWDIR="$(cd "`dirname "$0"`"/../../..; pwd)"
 cd "$FWDIR"
 export SPARK_HOME=$FWDIR
 
+# Determine the Scala version used in Spark
+SCALA_BINARY_VER=`grep "scala.binary.version" "${SPARK_HOME}/pom.xml" | head 
-n1 | awk -F '[<>]' '{print $3}'`
+SCALA_ARG="-Pscala-${SCALA_BINARY_VER}"
+
 # Build the jars needed for spark submit and spark connect
-build/sbt -Phive -Pconnect package
+build/sbt "${SCALA_ARG}" -Phive -Pconnect package
+
+# This jar is already in the classpath, but the submit commands wants a jar as 
the input.
+CONNECT_JAR=`ls 
"${SPARK_HOME}"/assembly/target/scala-"${SCALA_BINARY_VER}"/jars/spark-connect_*.jar
 | paste -sd ',' -`
 
-exec "${SPARK_HOME}"/bin/spark-submit --class 
org.apache.spark.sql.connect.SimpleSparkConnectService "$@"
\ No newline at end of file
+exec "${SPARK_HOME}"/bin/spark-submit "$@" --class 
org.apache.spark.sql.connect.SimpleSparkConnectService "$CONNECT_JAR"
diff --git a/connector/connect/bin/spark-connect-scala-client 
b/connector/connect/bin/spark-connect-scala-client
index 902091a74de..8c5e687ef24 100755
--- a/connector/connect/bin/spark-connect-scala-client
+++ b/connector/connect/bin/spark-connect-scala-client
@@ -34,17 +34,18 @@ FWDIR="$(cd "`dirname "$0"`"/../../..; pwd)"
 cd "$FWDIR"
 export SPARK_HOME=$FWDIR
 
-# Build the jars needed for spark connect JVM client
-build/sbt "sql/package;connect-client-jvm/assembly"
-
-CONNECT_CLASSPATH="$(build/sbt -DcopyDependencies=false "export 
connect-client-jvm/fullClasspath" | grep jar | tail -n1)"
-SQL_CLASSPATH="$(build/sbt -DcopyDependencies=false "export sql/fullClasspath" 
| grep jar | tail -n1)"
-
-INIT_SCRIPT="${SPARK_HOME}"/connector/connect/bin/spark-connect-scala-client.sc
-
 # Determine the Scala version used in Spark
 SCALA_BINARY_VER=`grep "scala.binary.version" "${SPARK_HOME}/pom.xml" | head 
-n1 |

[spark] branch branch-3.4 updated: [SPARK-42656][CONNECT][FOLLOWUP] Fix the spark-connect script

2023-03-08 Thread gurwls223
This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch branch-3.4
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.4 by this push:
 new fc29b07a31d [SPARK-42656][CONNECT][FOLLOWUP] Fix the spark-connect 
script
fc29b07a31d is described below

commit fc29b07a31da4be9142b6a0a7cdff5e72ab4edb7
Author: Zhen Li 
AuthorDate: Thu Mar 9 10:29:47 2023 +0900

[SPARK-42656][CONNECT][FOLLOWUP] Fix the spark-connect script

### What changes were proposed in this pull request?
The spark-connect script is broken as it need a jar at the end.
Also ensured when scala 2.13 is set, all commands in the scripts runs with 
`-PScala-2.13`

Example usage:
Start spark connect with default settings:
* `./connector/connect/bin/spark-connect-shell`
* or `./connector/connect/bin/spark-connect` (Enter "q"  to exit 
the program)

Start Scala client with default settings: 
`./connector/connect/bin/spark-connect-scala-client`

Start spark connect with extra configs:
* `./connector/connect/bin/spark-connect-shell --conf 
spark.connect.grpc.binding.port=`
* or `./connector/connect/bin/spark-connect --conf 
spark.connect.grpc.binding.port=`

Start Scala client with a connection string:
```
export SPARK_REMOTE="sc://localhost:/"
./connector/connect/bin/spark-connect-scala-client
```

### Why are the changes needed?
Bug fix

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Manually tested on 2.12 and 2.13 for all the scripts changed.

Test example with expected results:
`./connector/connect/bin/spark-connect-shell` :
https://user-images.githubusercontent.com/4190164/223863343-d5d159d9-da7c-47c7-b55a-a2854c5f5d76.png";>

Verify the spark connect server is started at the correct port, e.g.
```
>Telnet localhost 15002
Trying ::1...
Connected to localhost.
Escape character is '^]'.
```

`./connector/connect/bin/spark-connect`:
https://user-images.githubusercontent.com/4190164/223863099-41195599-c49d-4db4-a1e2-e129a649cd81.png";>
Server started successfully when seeing the last line output.

`./connector/connect/bin/spark-connect-scala-client`:
https://user-images.githubusercontent.com/4190164/223862992-c8a3a36a-9f69-40b8-b82e-5dab85ed14ce.png";>
Verify the client can run some simple quries.

Closes #40344 from zhenlineo/fix-scripts.

Authored-by: Zhen Li 
Signed-off-by: Hyukjin Kwon 
(cherry picked from commit b5243d7f7f9ede78a711eb168cf951f4bde7a8fa)
Signed-off-by: Hyukjin Kwon 
---
 connector/connect/bin/spark-connect  | 11 +--
 connector/connect/bin/spark-connect-scala-client | 19 ++-
 connector/connect/bin/spark-connect-shell| 10 +++---
 3 files changed, 26 insertions(+), 14 deletions(-)

diff --git a/connector/connect/bin/spark-connect 
b/connector/connect/bin/spark-connect
index 62d0d36b441..772a88a04f3 100755
--- a/connector/connect/bin/spark-connect
+++ b/connector/connect/bin/spark-connect
@@ -26,7 +26,14 @@ FWDIR="$(cd "`dirname "$0"`"/../../..; pwd)"
 cd "$FWDIR"
 export SPARK_HOME=$FWDIR
 
+# Determine the Scala version used in Spark
+SCALA_BINARY_VER=`grep "scala.binary.version" "${SPARK_HOME}/pom.xml" | head 
-n1 | awk -F '[<>]' '{print $3}'`
+SCALA_ARG="-Pscala-${SCALA_BINARY_VER}"
+
 # Build the jars needed for spark submit and spark connect
-build/sbt -Phive -Pconnect package
+build/sbt "${SCALA_ARG}" -Phive -Pconnect package
+
+# This jar is already in the classpath, but the submit commands wants a jar as 
the input.
+CONNECT_JAR=`ls 
"${SPARK_HOME}"/assembly/target/scala-"${SCALA_BINARY_VER}"/jars/spark-connect_*.jar
 | paste -sd ',' -`
 
-exec "${SPARK_HOME}"/bin/spark-submit --class 
org.apache.spark.sql.connect.SimpleSparkConnectService "$@"
\ No newline at end of file
+exec "${SPARK_HOME}"/bin/spark-submit "$@" --class 
org.apache.spark.sql.connect.SimpleSparkConnectService "$CONNECT_JAR"
diff --git a/connector/connect/bin/spark-connect-scala-client 
b/connector/connect/bin/spark-connect-scala-client
index 902091a74de..8c5e687ef24 100755
--- a/connector/connect/bin/spark-connect-scala-client
+++ b/connector/connect/bin/spark-connect-scala-client
@@ -34,17 +34,18 @@ FWDIR="$(cd "`dirname "$0"`"/../../..; pwd)"
 cd "$FWDIR"
 export SPARK_HOME=$FWDIR
 
-# Build the jars needed for spark connect JVM client
-build/sbt "sql/package;connect-client-jvm/assembly"
-
-CONNECT_CLASSPATH="$(build/sbt -DcopyDependencies=false "export 
connect-client-jvm/fullClasspath" | grep jar | tail -n1)"
-SQL_CLASSPATH="$(build/sbt -DcopyDependencies=false "export sql/fullClasspath" 
| grep jar | tail -n1)"
-
-INIT_SCRIPT="${SPARK_HOME}"/connector/connect/bin/spark-connect-scala-client.sc
-
 # Determine th

[spark] branch master updated: [SPARK-42723][SQL] Support parser data type json "timestamp_ltz" as TimestampType

2023-03-08 Thread gengliang
This is an automated email from the ASF dual-hosted git repository.

gengliang pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new c4756edde5d [SPARK-42723][SQL] Support parser data type json 
"timestamp_ltz" as TimestampType
c4756edde5d is described below

commit c4756edde5d71fa6fcb071a071c93c4ca4b5a37a
Author: Gengliang Wang 
AuthorDate: Wed Mar 8 19:30:09 2023 -0800

[SPARK-42723][SQL] Support parser data type json "timestamp_ltz" as 
TimestampType

### What changes were proposed in this pull request?

Support parsing `timestamp_ltz` as `TimestampType` in schema JSON string.
It also add tests for both parsing JSON/DDL for "timestamp_ltz" and 
"timestamp_ntz"

### Why are the changes needed?

`timestamp_ltz` becomes an alias for TimestampType since Spark 3.4

### Does this PR introduce _any_ user-facing change?

No, the new keyword is not released yet.
### How was this patch tested?

New UT

Closes #40345 from gengliangwang/parseJson.

Authored-by: Gengliang Wang 
Signed-off-by: Gengliang Wang 
---
 .../src/main/scala/org/apache/spark/sql/types/DataType.scala | 1 +
 .../test/scala/org/apache/spark/sql/types/DataTypeSuite.scala| 9 +
 2 files changed, 10 insertions(+)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/types/DataType.scala 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/types/DataType.scala
index e4e548127d3..08d6f312066 100644
--- a/sql/catalyst/src/main/scala/org/apache/spark/sql/types/DataType.scala
+++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/types/DataType.scala
@@ -201,6 +201,7 @@ object DataType {
   case VARCHAR_TYPE(length) => VarcharType(length.toInt)
   // For backwards compatibility, previously the type name of NullType is 
"null"
   case "null" => NullType
+  case "timestamp_ltz" => TimestampType
   case other => otherTypes.getOrElse(
 other,
 throw new IllegalArgumentException(
diff --git 
a/sql/catalyst/src/test/scala/org/apache/spark/sql/types/DataTypeSuite.scala 
b/sql/catalyst/src/test/scala/org/apache/spark/sql/types/DataTypeSuite.scala
index 5d71732f7b0..2a487133b48 100644
--- a/sql/catalyst/src/test/scala/org/apache/spark/sql/types/DataTypeSuite.scala
+++ b/sql/catalyst/src/test/scala/org/apache/spark/sql/types/DataTypeSuite.scala
@@ -191,6 +191,12 @@ class DataTypeSuite extends SparkFunSuite {
 assert(DataType.fromJson("\"null\"") == NullType)
   }
 
+  test("SPARK-42723: Parse timestamp_ltz as TimestampType") {
+assert(DataType.fromJson("\"timestamp_ltz\"") == TimestampType)
+val expectedStructType = StructType(Seq(StructField("ts", TimestampType)))
+assert(DataType.fromDDL("ts timestamp_ltz") == expectedStructType)
+  }
+
   def checkDataTypeFromJson(dataType: DataType): Unit = {
 test(s"from Json - $dataType") {
   assert(DataType.fromJson(dataType.json) === dataType)
@@ -241,6 +247,9 @@ class DataTypeSuite extends SparkFunSuite {
   checkDataTypeFromJson(TimestampType)
   checkDataTypeFromDDL(TimestampType)
 
+  checkDataTypeFromJson(TimestampNTZType)
+  checkDataTypeFromDDL(TimestampNTZType)
+
   checkDataTypeFromJson(StringType)
   checkDataTypeFromDDL(StringType)
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch branch-3.4 updated: [SPARK-42723][SQL] Support parser data type json "timestamp_ltz" as TimestampType

2023-03-08 Thread gengliang
This is an automated email from the ASF dual-hosted git repository.

gengliang pushed a commit to branch branch-3.4
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.4 by this push:
 new 006e838edeb [SPARK-42723][SQL] Support parser data type json 
"timestamp_ltz" as TimestampType
006e838edeb is described below

commit 006e838edebc62ba8a4398dde1b5720b72227c41
Author: Gengliang Wang 
AuthorDate: Wed Mar 8 19:30:09 2023 -0800

[SPARK-42723][SQL] Support parser data type json "timestamp_ltz" as 
TimestampType

### What changes were proposed in this pull request?

Support parsing `timestamp_ltz` as `TimestampType` in schema JSON string.
It also add tests for both parsing JSON/DDL for "timestamp_ltz" and 
"timestamp_ntz"

### Why are the changes needed?

`timestamp_ltz` becomes an alias for TimestampType since Spark 3.4

### Does this PR introduce _any_ user-facing change?

No, the new keyword is not released yet.
### How was this patch tested?

New UT

Closes #40345 from gengliangwang/parseJson.

Authored-by: Gengliang Wang 
Signed-off-by: Gengliang Wang 
(cherry picked from commit c4756edde5d71fa6fcb071a071c93c4ca4b5a37a)
Signed-off-by: Gengliang Wang 
---
 .../src/main/scala/org/apache/spark/sql/types/DataType.scala | 1 +
 .../test/scala/org/apache/spark/sql/types/DataTypeSuite.scala| 9 +
 2 files changed, 10 insertions(+)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/types/DataType.scala 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/types/DataType.scala
index e4e548127d3..08d6f312066 100644
--- a/sql/catalyst/src/main/scala/org/apache/spark/sql/types/DataType.scala
+++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/types/DataType.scala
@@ -201,6 +201,7 @@ object DataType {
   case VARCHAR_TYPE(length) => VarcharType(length.toInt)
   // For backwards compatibility, previously the type name of NullType is 
"null"
   case "null" => NullType
+  case "timestamp_ltz" => TimestampType
   case other => otherTypes.getOrElse(
 other,
 throw new IllegalArgumentException(
diff --git 
a/sql/catalyst/src/test/scala/org/apache/spark/sql/types/DataTypeSuite.scala 
b/sql/catalyst/src/test/scala/org/apache/spark/sql/types/DataTypeSuite.scala
index 5d71732f7b0..2a487133b48 100644
--- a/sql/catalyst/src/test/scala/org/apache/spark/sql/types/DataTypeSuite.scala
+++ b/sql/catalyst/src/test/scala/org/apache/spark/sql/types/DataTypeSuite.scala
@@ -191,6 +191,12 @@ class DataTypeSuite extends SparkFunSuite {
 assert(DataType.fromJson("\"null\"") == NullType)
   }
 
+  test("SPARK-42723: Parse timestamp_ltz as TimestampType") {
+assert(DataType.fromJson("\"timestamp_ltz\"") == TimestampType)
+val expectedStructType = StructType(Seq(StructField("ts", TimestampType)))
+assert(DataType.fromDDL("ts timestamp_ltz") == expectedStructType)
+  }
+
   def checkDataTypeFromJson(dataType: DataType): Unit = {
 test(s"from Json - $dataType") {
   assert(DataType.fromJson(dataType.json) === dataType)
@@ -241,6 +247,9 @@ class DataTypeSuite extends SparkFunSuite {
   checkDataTypeFromJson(TimestampType)
   checkDataTypeFromDDL(TimestampType)
 
+  checkDataTypeFromJson(TimestampNTZType)
+  checkDataTypeFromDDL(TimestampNTZType)
+
   checkDataTypeFromJson(StringType)
   checkDataTypeFromDDL(StringType)
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated: [SPARK-42697][WEBUI] Fix /api/v1/applications to return total uptime instead of 0 for the duration field

2023-03-08 Thread yao
This is an automated email from the ASF dual-hosted git repository.

yao pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new d3d8fdc2882 [SPARK-42697][WEBUI] Fix /api/v1/applications to return 
total uptime instead of 0 for the duration field
d3d8fdc2882 is described below

commit d3d8fdc2882f5c084897ca9b2af9a063358f3a21
Author: Kent Yao 
AuthorDate: Thu Mar 9 13:33:43 2023 +0800

[SPARK-42697][WEBUI] Fix /api/v1/applications to return total uptime 
instead of 0 for the duration field

### What changes were proposed in this pull request?

Fix /api/v1/applications to return total uptime instead of 0 for duration

### Why are the changes needed?

Fix REST API OneApplicationResource

### Does this PR introduce _any_ user-facing change?

yes, /api/v1/applications will return the total uptime instead of 0 for the 
duration

### How was this patch tested?

locally build and run

```json
[ {
  "id" : "local-1678183638394",
  "name" : "SparkSQL::10.221.102.180",
  "attempts" : [ {
"startTime" : "2023-03-07T10:07:17.754GMT",
"endTime" : "1969-12-31T23:59:59.999GMT",
"lastUpdated" : "2023-03-07T10:07:17.754GMT",
"duration" : 20317,
"sparkUser" : "kentyao",
"completed" : false,
"appSparkVersion" : "3.5.0-SNAPSHOT",
"startTimeEpoch" : 1678183637754,
"endTimeEpoch" : -1,
"lastUpdatedEpoch" : 1678183637754
  } ]
} ]
```

Closes #40313 from yaooqinn/SPARK-42697.

Authored-by: Kent Yao 
Signed-off-by: Kent Yao 
---
 core/src/main/scala/org/apache/spark/ui/SparkUI.scala | 2 +-
 core/src/test/scala/org/apache/spark/ui/UISeleniumSuite.scala | 9 -
 2 files changed, 9 insertions(+), 2 deletions(-)

diff --git a/core/src/main/scala/org/apache/spark/ui/SparkUI.scala 
b/core/src/main/scala/org/apache/spark/ui/SparkUI.scala
index db1f8bc1a2f..ac154b79385 100644
--- a/core/src/main/scala/org/apache/spark/ui/SparkUI.scala
+++ b/core/src/main/scala/org/apache/spark/ui/SparkUI.scala
@@ -167,7 +167,7 @@ private[spark] class SparkUI private (
 attemptId = None,
 startTime = new Date(startTime),
 endTime = new Date(-1),
-duration = 0,
+duration = System.currentTimeMillis() - startTime,
 lastUpdated = new Date(startTime),
 sparkUser = getSparkUser,
 completed = false,
diff --git a/core/src/test/scala/org/apache/spark/ui/UISeleniumSuite.scala 
b/core/src/test/scala/org/apache/spark/ui/UISeleniumSuite.scala
index 45348b2e9a7..79496bba667 100644
--- a/core/src/test/scala/org/apache/spark/ui/UISeleniumSuite.scala
+++ b/core/src/test/scala/org/apache/spark/ui/UISeleniumSuite.scala
@@ -700,7 +700,14 @@ class UISeleniumSuite extends SparkFunSuite with 
WebBrowser with Matchers {
   parseDate(attempts(0) \ "startTime") should be (sc.startTime)
   parseDate(attempts(0) \ "endTime") should be (-1)
   val oneAppJsonAst = getJson(sc.ui.get, "")
-  oneAppJsonAst should be (appListJsonAst.children(0))
+  val duration = attempts(0) \ "duration"
+  oneAppJsonAst \\ "duration" should not be duration
+  // SPARK-42697: duration will increase as the app is running
+  // Replace the duration before we compare the full JObjects
+  val durationAdjusted = oneAppJsonAst.transformField {
+case ("duration", _) => ("duration", duration)
+  }
+  durationAdjusted should be (appListJsonAst.children(0))
 }
   }
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] tag v3.4.0-rc3 created (now b9be9ce15a8)

2023-03-08 Thread xinrong
This is an automated email from the ASF dual-hosted git repository.

xinrong pushed a change to tag v3.4.0-rc3
in repository https://gitbox.apache.org/repos/asf/spark.git


  at b9be9ce15a8 (commit)
This tag includes the following new commits:

 new b9be9ce15a8 Preparing Spark release v3.4.0-rc3

The 1 revisions listed above as "new" are entirely new to this
repository and will be described in separate emails.  The revisions
listed as "add" were already present in the repository and have only
been added to this reference.



-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] 01/01: Preparing Spark release v3.4.0-rc3

2023-03-08 Thread xinrong
This is an automated email from the ASF dual-hosted git repository.

xinrong pushed a commit to tag v3.4.0-rc3
in repository https://gitbox.apache.org/repos/asf/spark.git

commit b9be9ce15a82b18cca080ee365d308c0820a29a9
Author: Xinrong Meng 
AuthorDate: Thu Mar 9 05:34:00 2023 +

Preparing Spark release v3.4.0-rc3
---
 R/pkg/DESCRIPTION  | 2 +-
 assembly/pom.xml   | 2 +-
 common/kvstore/pom.xml | 2 +-
 common/network-common/pom.xml  | 2 +-
 common/network-shuffle/pom.xml | 2 +-
 common/network-yarn/pom.xml| 2 +-
 common/sketch/pom.xml  | 2 +-
 common/tags/pom.xml| 2 +-
 common/unsafe/pom.xml  | 2 +-
 connector/avro/pom.xml | 2 +-
 connector/connect/client/jvm/pom.xml   | 2 +-
 connector/connect/common/pom.xml   | 2 +-
 connector/connect/server/pom.xml   | 2 +-
 connector/docker-integration-tests/pom.xml | 2 +-
 connector/kafka-0-10-assembly/pom.xml  | 2 +-
 connector/kafka-0-10-sql/pom.xml   | 2 +-
 connector/kafka-0-10-token-provider/pom.xml| 2 +-
 connector/kafka-0-10/pom.xml   | 2 +-
 connector/kinesis-asl-assembly/pom.xml | 2 +-
 connector/kinesis-asl/pom.xml  | 2 +-
 connector/protobuf/pom.xml | 2 +-
 connector/spark-ganglia-lgpl/pom.xml   | 2 +-
 core/pom.xml   | 2 +-
 docs/_config.yml   | 6 +++---
 examples/pom.xml   | 2 +-
 graphx/pom.xml | 2 +-
 hadoop-cloud/pom.xml   | 2 +-
 launcher/pom.xml   | 2 +-
 mllib-local/pom.xml| 2 +-
 mllib/pom.xml  | 2 +-
 pom.xml| 2 +-
 python/pyspark/version.py  | 2 +-
 repl/pom.xml   | 2 +-
 resource-managers/kubernetes/core/pom.xml  | 2 +-
 resource-managers/kubernetes/integration-tests/pom.xml | 2 +-
 resource-managers/mesos/pom.xml| 2 +-
 resource-managers/yarn/pom.xml | 2 +-
 sql/catalyst/pom.xml   | 2 +-
 sql/core/pom.xml   | 2 +-
 sql/hive-thriftserver/pom.xml  | 2 +-
 sql/hive/pom.xml   | 2 +-
 streaming/pom.xml  | 2 +-
 tools/pom.xml  | 2 +-
 43 files changed, 45 insertions(+), 45 deletions(-)

diff --git a/R/pkg/DESCRIPTION b/R/pkg/DESCRIPTION
index fa7028630a8..4a32762b34c 100644
--- a/R/pkg/DESCRIPTION
+++ b/R/pkg/DESCRIPTION
@@ -1,6 +1,6 @@
 Package: SparkR
 Type: Package
-Version: 3.4.1
+Version: 3.4.0
 Title: R Front End for 'Apache Spark'
 Description: Provides an R Front end for 'Apache Spark' 
.
 Authors@R:
diff --git a/assembly/pom.xml b/assembly/pom.xml
index b86fee4bceb..c58da7aa112 100644
--- a/assembly/pom.xml
+++ b/assembly/pom.xml
@@ -21,7 +21,7 @@
   
 org.apache.spark
 spark-parent_2.12
-3.4.1-SNAPSHOT
+3.4.0
 ../pom.xml
   
 
diff --git a/common/kvstore/pom.xml b/common/kvstore/pom.xml
index f9ecfb3d692..95ea15552da 100644
--- a/common/kvstore/pom.xml
+++ b/common/kvstore/pom.xml
@@ -22,7 +22,7 @@
   
 org.apache.spark
 spark-parent_2.12
-3.4.1-SNAPSHOT
+3.4.0
 ../../pom.xml
   
 
diff --git a/common/network-common/pom.xml b/common/network-common/pom.xml
index 22ee65b7d25..e4d98471bf9 100644
--- a/common/network-common/pom.xml
+++ b/common/network-common/pom.xml
@@ -22,7 +22,7 @@
   
 org.apache.spark
 spark-parent_2.12
-3.4.1-SNAPSHOT
+3.4.0
 ../../pom.xml
   
 
diff --git a/common/network-shuffle/pom.xml b/common/network-shuffle/pom.xml
index 2c67da81ca4..7a6d5aedf65 100644
--- a/common/network-shuffle/pom.xml
+++ b/common/network-shuffle/pom.xml
@@ -22,7 +22,7 @@
   
 org.apache.spark
 spark-parent_2.12
-3.4.1-SNAPSHOT
+3.4.0
 ../../pom.xml
   
 
diff --git a/common/network-yarn/pom.xml b/common/network-yarn/pom.xml
index 219682e047d..1c421754083 100644
--- a/common/network-yarn/pom.xml
+++ b/common/network-yarn/pom.xml
@@ -22,7 +22,7 @@
   
 org.apache.spark
 spark-parent_2.12
-3.4.1-SNAPSHOT
+3.4.0
 ../../pom.xml
   
 
diff --git a/common/sketch/pom.xml b/common/sketch/pom.xml
index 22ce78c6fd2..2ee25ebfffc 100644

[spark] branch branch-3.4 updated: [SPARK-42697][WEBUI] Fix /api/v1/applications to return total uptime instead of 0 for the duration field

2023-03-08 Thread yao
This is an automated email from the ASF dual-hosted git repository.

yao pushed a commit to branch branch-3.4
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.4 by this push:
 new afced91348f [SPARK-42697][WEBUI] Fix /api/v1/applications to return 
total uptime instead of 0 for the duration field
afced91348f is described below

commit afced91348f73adc852caf9ec5f0f7f7e0f543e6
Author: Kent Yao 
AuthorDate: Thu Mar 9 13:33:43 2023 +0800

[SPARK-42697][WEBUI] Fix /api/v1/applications to return total uptime 
instead of 0 for the duration field

### What changes were proposed in this pull request?

Fix /api/v1/applications to return total uptime instead of 0 for duration

### Why are the changes needed?

Fix REST API OneApplicationResource

### Does this PR introduce _any_ user-facing change?

yes, /api/v1/applications will return the total uptime instead of 0 for the 
duration

### How was this patch tested?

locally build and run

```json
[ {
  "id" : "local-1678183638394",
  "name" : "SparkSQL::10.221.102.180",
  "attempts" : [ {
"startTime" : "2023-03-07T10:07:17.754GMT",
"endTime" : "1969-12-31T23:59:59.999GMT",
"lastUpdated" : "2023-03-07T10:07:17.754GMT",
"duration" : 20317,
"sparkUser" : "kentyao",
"completed" : false,
"appSparkVersion" : "3.5.0-SNAPSHOT",
"startTimeEpoch" : 1678183637754,
"endTimeEpoch" : -1,
"lastUpdatedEpoch" : 1678183637754
  } ]
} ]
```

Closes #40313 from yaooqinn/SPARK-42697.

Authored-by: Kent Yao 
Signed-off-by: Kent Yao 
(cherry picked from commit d3d8fdc2882f5c084897ca9b2af9a063358f3a21)
Signed-off-by: Kent Yao 
---
 core/src/main/scala/org/apache/spark/ui/SparkUI.scala | 2 +-
 core/src/test/scala/org/apache/spark/ui/UISeleniumSuite.scala | 9 -
 2 files changed, 9 insertions(+), 2 deletions(-)

diff --git a/core/src/main/scala/org/apache/spark/ui/SparkUI.scala 
b/core/src/main/scala/org/apache/spark/ui/SparkUI.scala
index db1f8bc1a2f..ac154b79385 100644
--- a/core/src/main/scala/org/apache/spark/ui/SparkUI.scala
+++ b/core/src/main/scala/org/apache/spark/ui/SparkUI.scala
@@ -167,7 +167,7 @@ private[spark] class SparkUI private (
 attemptId = None,
 startTime = new Date(startTime),
 endTime = new Date(-1),
-duration = 0,
+duration = System.currentTimeMillis() - startTime,
 lastUpdated = new Date(startTime),
 sparkUser = getSparkUser,
 completed = false,
diff --git a/core/src/test/scala/org/apache/spark/ui/UISeleniumSuite.scala 
b/core/src/test/scala/org/apache/spark/ui/UISeleniumSuite.scala
index 45348b2e9a7..79496bba667 100644
--- a/core/src/test/scala/org/apache/spark/ui/UISeleniumSuite.scala
+++ b/core/src/test/scala/org/apache/spark/ui/UISeleniumSuite.scala
@@ -700,7 +700,14 @@ class UISeleniumSuite extends SparkFunSuite with 
WebBrowser with Matchers {
   parseDate(attempts(0) \ "startTime") should be (sc.startTime)
   parseDate(attempts(0) \ "endTime") should be (-1)
   val oneAppJsonAst = getJson(sc.ui.get, "")
-  oneAppJsonAst should be (appListJsonAst.children(0))
+  val duration = attempts(0) \ "duration"
+  oneAppJsonAst \\ "duration" should not be duration
+  // SPARK-42697: duration will increase as the app is running
+  // Replace the duration before we compare the full JObjects
+  val durationAdjusted = oneAppJsonAst.transformField {
+case ("duration", _) => ("duration", duration)
+  }
+  durationAdjusted should be (appListJsonAst.children(0))
 }
   }
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch branch-3.3 updated: [SPARK-42697][WEBUI] Fix /api/v1/applications to return total uptime instead of 0 for the duration field

2023-03-08 Thread yao
This is an automated email from the ASF dual-hosted git repository.

yao pushed a commit to branch branch-3.3
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.3 by this push:
 new cfcd453ea27 [SPARK-42697][WEBUI] Fix /api/v1/applications to return 
total uptime instead of 0 for the duration field
cfcd453ea27 is described below

commit cfcd453ea27897424e70aa0c772d334726331a88
Author: Kent Yao 
AuthorDate: Thu Mar 9 13:33:43 2023 +0800

[SPARK-42697][WEBUI] Fix /api/v1/applications to return total uptime 
instead of 0 for the duration field

### What changes were proposed in this pull request?

Fix /api/v1/applications to return total uptime instead of 0 for duration

### Why are the changes needed?

Fix REST API OneApplicationResource

### Does this PR introduce _any_ user-facing change?

yes, /api/v1/applications will return the total uptime instead of 0 for the 
duration

### How was this patch tested?

locally build and run

```json
[ {
  "id" : "local-1678183638394",
  "name" : "SparkSQL::10.221.102.180",
  "attempts" : [ {
"startTime" : "2023-03-07T10:07:17.754GMT",
"endTime" : "1969-12-31T23:59:59.999GMT",
"lastUpdated" : "2023-03-07T10:07:17.754GMT",
"duration" : 20317,
"sparkUser" : "kentyao",
"completed" : false,
"appSparkVersion" : "3.5.0-SNAPSHOT",
"startTimeEpoch" : 1678183637754,
"endTimeEpoch" : -1,
"lastUpdatedEpoch" : 1678183637754
  } ]
} ]
```

Closes #40313 from yaooqinn/SPARK-42697.

Authored-by: Kent Yao 
Signed-off-by: Kent Yao 
(cherry picked from commit d3d8fdc2882f5c084897ca9b2af9a063358f3a21)
Signed-off-by: Kent Yao 
---
 core/src/main/scala/org/apache/spark/ui/SparkUI.scala | 2 +-
 core/src/test/scala/org/apache/spark/ui/UISeleniumSuite.scala | 9 -
 2 files changed, 9 insertions(+), 2 deletions(-)

diff --git a/core/src/main/scala/org/apache/spark/ui/SparkUI.scala 
b/core/src/main/scala/org/apache/spark/ui/SparkUI.scala
index db1f8bc1a2f..ac154b79385 100644
--- a/core/src/main/scala/org/apache/spark/ui/SparkUI.scala
+++ b/core/src/main/scala/org/apache/spark/ui/SparkUI.scala
@@ -167,7 +167,7 @@ private[spark] class SparkUI private (
 attemptId = None,
 startTime = new Date(startTime),
 endTime = new Date(-1),
-duration = 0,
+duration = System.currentTimeMillis() - startTime,
 lastUpdated = new Date(startTime),
 sparkUser = getSparkUser,
 completed = false,
diff --git a/core/src/test/scala/org/apache/spark/ui/UISeleniumSuite.scala 
b/core/src/test/scala/org/apache/spark/ui/UISeleniumSuite.scala
index 0a751e975ff..27daa08e17a 100644
--- a/core/src/test/scala/org/apache/spark/ui/UISeleniumSuite.scala
+++ b/core/src/test/scala/org/apache/spark/ui/UISeleniumSuite.scala
@@ -699,7 +699,14 @@ class UISeleniumSuite extends SparkFunSuite with 
WebBrowser with Matchers with B
   parseDate(attempts(0) \ "startTime") should be (sc.startTime)
   parseDate(attempts(0) \ "endTime") should be (-1)
   val oneAppJsonAst = getJson(sc.ui.get, "")
-  oneAppJsonAst should be (appListJsonAst.children(0))
+  val duration = attempts(0) \ "duration"
+  oneAppJsonAst \\ "duration" should not be duration
+  // SPARK-42697: duration will increase as the app is running
+  // Replace the duration before we compare the full JObjects
+  val durationAdjusted = oneAppJsonAst.transformField {
+case ("duration", _) => ("duration", duration)
+  }
+  durationAdjusted should be (appListJsonAst.children(0))
 }
   }
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch branch-3.2 updated: [SPARK-42697][WEBUI] Fix /api/v1/applications to return total uptime instead of 0 for the duration field

2023-03-08 Thread yao
This is an automated email from the ASF dual-hosted git repository.

yao pushed a commit to branch branch-3.2
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.2 by this push:
 new 794d143f6c3 [SPARK-42697][WEBUI] Fix /api/v1/applications to return 
total uptime instead of 0 for the duration field
794d143f6c3 is described below

commit 794d143f6c3eb3854f737b547ba9408986543047
Author: Kent Yao 
AuthorDate: Thu Mar 9 13:33:43 2023 +0800

[SPARK-42697][WEBUI] Fix /api/v1/applications to return total uptime 
instead of 0 for the duration field

### What changes were proposed in this pull request?

Fix /api/v1/applications to return total uptime instead of 0 for duration

### Why are the changes needed?

Fix REST API OneApplicationResource

### Does this PR introduce _any_ user-facing change?

yes, /api/v1/applications will return the total uptime instead of 0 for the 
duration

### How was this patch tested?

locally build and run

```json
[ {
  "id" : "local-1678183638394",
  "name" : "SparkSQL::10.221.102.180",
  "attempts" : [ {
"startTime" : "2023-03-07T10:07:17.754GMT",
"endTime" : "1969-12-31T23:59:59.999GMT",
"lastUpdated" : "2023-03-07T10:07:17.754GMT",
"duration" : 20317,
"sparkUser" : "kentyao",
"completed" : false,
"appSparkVersion" : "3.5.0-SNAPSHOT",
"startTimeEpoch" : 1678183637754,
"endTimeEpoch" : -1,
"lastUpdatedEpoch" : 1678183637754
  } ]
} ]
```

Closes #40313 from yaooqinn/SPARK-42697.

Authored-by: Kent Yao 
Signed-off-by: Kent Yao 
(cherry picked from commit d3d8fdc2882f5c084897ca9b2af9a063358f3a21)
Signed-off-by: Kent Yao 
---
 core/src/main/scala/org/apache/spark/ui/SparkUI.scala | 2 +-
 core/src/test/scala/org/apache/spark/ui/UISeleniumSuite.scala | 9 -
 2 files changed, 9 insertions(+), 2 deletions(-)

diff --git a/core/src/main/scala/org/apache/spark/ui/SparkUI.scala 
b/core/src/main/scala/org/apache/spark/ui/SparkUI.scala
index b1769a8a9c9..f7afd118fa4 100644
--- a/core/src/main/scala/org/apache/spark/ui/SparkUI.scala
+++ b/core/src/main/scala/org/apache/spark/ui/SparkUI.scala
@@ -127,7 +127,7 @@ private[spark] class SparkUI private (
 attemptId = None,
 startTime = new Date(startTime),
 endTime = new Date(-1),
-duration = 0,
+duration = System.currentTimeMillis() - startTime,
 lastUpdated = new Date(startTime),
 sparkUser = getSparkUser,
 completed = false,
diff --git a/core/src/test/scala/org/apache/spark/ui/UISeleniumSuite.scala 
b/core/src/test/scala/org/apache/spark/ui/UISeleniumSuite.scala
index 015f299fc6b..09e9ea1a43a 100644
--- a/core/src/test/scala/org/apache/spark/ui/UISeleniumSuite.scala
+++ b/core/src/test/scala/org/apache/spark/ui/UISeleniumSuite.scala
@@ -698,7 +698,14 @@ class UISeleniumSuite extends SparkFunSuite with 
WebBrowser with Matchers with B
   parseDate(attempts(0) \ "startTime") should be (sc.startTime)
   parseDate(attempts(0) \ "endTime") should be (-1)
   val oneAppJsonAst = getJson(sc.ui.get, "")
-  oneAppJsonAst should be (appListJsonAst.children(0))
+  val duration = attempts(0) \ "duration"
+  oneAppJsonAst \\ "duration" should not be duration
+  // SPARK-42697: duration will increase as the app is running
+  // Replace the duration before we compare the full JObjects
+  val durationAdjusted = oneAppJsonAst.transformField {
+case ("duration", _) => ("duration", duration)
+  }
+  durationAdjusted should be (appListJsonAst.children(0))
 }
   }
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated (d3d8fdc2882 -> e4566f4276a)

2023-03-08 Thread yao
This is an automated email from the ASF dual-hosted git repository.

yao pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from d3d8fdc2882 [SPARK-42697][WEBUI] Fix /api/v1/applications to return 
total uptime instead of 0 for the duration field
 add e4566f4276a [SPARK-42717][BUILD] Upgrade mysql-connector-java from 
8.0.31 to 8.0.32

No new revisions were added by this update.

Summary of changes:
 pom.xml | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated: [SPARK-42701][SQL] Add the `try_aes_decrypt()` function

2023-03-08 Thread maxgekk
This is an automated email from the ASF dual-hosted git repository.

maxgekk pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 7d10330720f [SPARK-42701][SQL] Add the `try_aes_decrypt()` function
7d10330720f is described below

commit 7d10330720f600d7d1aca3ea1ccfcf1f74f41136
Author: Max Gekk 
AuthorDate: Thu Mar 9 09:13:05 2023 +0300

[SPARK-42701][SQL] Add the `try_aes_decrypt()` function

### What changes were proposed in this pull request?
In the PR, I propose to add new function `try_aes_decrypt()` which binds to 
new expression `TryAesDecrypt` that is a runtime replaceable expression of the 
combination of `TryEval` and `AesDecrypt`.

### Why are the changes needed?
The changes improve user experience with Spark SQL. The existing function 
`aes_decrypt()` fails w/ an exception as soon as it faces to some invalid input 
that cannot be decrypted, and the rest (even if the values can be decrypted) is 
ignored. New function returns `NULL` on bad inputs and decrypts other values.

### Does this PR introduce _any_ user-facing change?
No. This PR just extends existing API.

### How was this patch tested?
By running the affected test suites:
```
$ build/sbt "sql/test:testOnly 
org.apache.spark.sql.expressions.ExpressionInfoSuite"
$ build/sbt "sql/testOnly *ExpressionsSchemaSuite"
```

Closes #40340 from MaxGekk/try_aes_decrypt.

Authored-by: Max Gekk 
Signed-off-by: Max Gekk 
---
 .../sql/catalyst/analysis/FunctionRegistry.scala   |  1 +
 .../spark/sql/catalyst/expressions/misc.scala  | 34 ++
 .../sql-functions/sql-expression-schema.md |  1 +
 3 files changed, 36 insertions(+)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala
index 103e6aae603..ad82a836199 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala
@@ -452,6 +452,7 @@ object FunctionRegistry {
 expressionBuilder("try_sum", TrySumExpressionBuilder, setAlias = true),
 expression[TryToBinary]("try_to_binary"),
 expressionBuilder("try_to_timestamp", TryToTimestampExpressionBuilder, 
setAlias = true),
+expression[TryAesDecrypt]("try_aes_decrypt"),
 
 // aggregate functions
 expression[HyperLogLogPlusPlus]("approx_count_distinct"),
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/misc.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/misc.scala
index bf9dd700dfa..300fab0386c 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/misc.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/misc.scala
@@ -432,4 +432,38 @@ case class AesDecrypt(
 copy(newChildren(0), newChildren(1), newChildren(2), newChildren(3))
   }
 }
+
+@ExpressionDescription(
+  usage = "_FUNC_(expr, key[, mode[, padding]]) - This is a special version of 
`aes_decrypt` that performs the same operation, but returns a NULL value 
instead of raising an error if the decryption cannot be performed.",
+  examples = """
+Examples:
+  > SELECT 
_FUNC_(unhex('6E7CA17BBB468D3084B5744BCA729FB7B2B7BCB8E4472847D02670489D95FA97DBBA7D3210'),
 '', 'GCM');
+   Spark SQL
+  > SELECT 
_FUNC_(unhex('--468D3084B5744BCA729FB7B2B7BCB8E4472847D02670489D95FA97DBBA7D3210'),
 '', 'GCM');
+   NULL
+  """,
+  since = "3.5.0",
+  group = "misc_funcs")
+// scalastyle:on line.size.limit
+case class TryAesDecrypt(
+input: Expression,
+key: Expression,
+mode: Expression,
+padding: Expression,
+replacement: Expression) extends RuntimeReplaceable with 
InheritAnalysisRules {
+
+  def this(input: Expression, key: Expression, mode: Expression, padding: 
Expression) =
+this(input, key, mode, padding, TryEval(AesDecrypt(input, key, mode, 
padding)))
+  def this(input: Expression, key: Expression, mode: Expression) =
+this(input, key, mode, Literal("DEFAULT"))
+  def this(input: Expression, key: Expression) =
+this(input, key, Literal("GCM"))
+
+  override def prettyName: String = "try_aes_decrypt"
+
+  override def parameters: Seq[Expression] = Seq(input, key, mode, padding)
+
+  override protected def withNewChildInternal(newChild: Expression): 
Expression =
+this.copy(replacement = newChild)
+}
 // scalastyle:on line.size.limit
diff --git a/sql/core/src/test/resources/sql-functions/sql-expression-schema.md 
b/sql/core/src/test/resources/sql-functions/sql-expression-schema.md
index 03ec4bce54b..0894d03f9d4 100644
--- a/sql/core/src/test/resources/sql-fun

[spark] branch master updated: [SPARK-42690][CONNECT] Implement CSV/JSON parsing functions for Scala client

2023-03-08 Thread ruifengz
This is an automated email from the ASF dual-hosted git repository.

ruifengz pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 07f71d2ba61 [SPARK-42690][CONNECT] Implement CSV/JSON parsing 
functions for Scala client
07f71d2ba61 is described below

commit 07f71d2ba61325331aabbc686ce30cb9012a6643
Author: yangjie01 
AuthorDate: Thu Mar 9 14:59:32 2023 +0800

[SPARK-42690][CONNECT] Implement CSV/JSON parsing functions for Scala client

### What changes were proposed in this pull request?
This pr add a new proto message

```
message Parse {
  // (Required) Input relation to Parse. The input is expected to have 
single text column.
  Relation input = 1;
  // (Required) The expected format of the text.
  ParseFormat format = 2;

  // (Optional) DataType representing the schema. If not set, Spark will 
infer the schema.
  optional DataType schema = 3;

  // Options for the csv/json parser. The map key is case insensitive.
  map options = 4;
  enum ParseFormat {
PARSE_FORMAT_UNSPECIFIED = 0;
PARSE_FORMAT_CSV = 1;
PARSE_FORMAT_JSON = 2;
  }
}
```

and implement CSV/JSON parsing functions for Scala client.

### Why are the changes needed?
Add Spark connect jvm client api coverage.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?

- Pass Github Actions
- Manual checked Scala 2.13

Closes #40332 from LuciferYang/SPARK-42690.

Authored-by: yangjie01 
Signed-off-by: Ruifeng Zheng 
---
 .../org/apache/spark/sql/DataFrameReader.scala |  52 +
 .../org/apache/spark/sql/ClientE2ETestSuite.scala  |  64 ++
 .../apache/spark/sql/PlanGenerationTestSuite.scala |  15 ++
 .../CheckConnectJvmClientCompatibility.scala   |   1 -
 .../main/protobuf/spark/connect/relations.proto|  19 ++
 .../explain-results/csv_from_dataset.explain   |   1 +
 .../explain-results/json_from_dataset.explain  |   1 +
 .../query-tests/queries/csv_from_dataset.json  |  38 
 .../query-tests/queries/csv_from_dataset.proto.bin | Bin 0 -> 156 bytes
 .../query-tests/queries/json_from_dataset.json |  38 
 .../queries/json_from_dataset.proto.bin| Bin 0 -> 167 bytes
 .../sql/connect/planner/SparkConnectPlanner.scala  |  26 +++
 python/pyspark/sql/connect/proto/relations_pb2.py  | 248 -
 python/pyspark/sql/connect/proto/relations_pb2.pyi |  97 
 14 files changed, 491 insertions(+), 109 deletions(-)

diff --git 
a/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/DataFrameReader.scala
 
b/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/DataFrameReader.scala
index d5641fb303a..ad921bcc4e3 100644
--- 
a/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/DataFrameReader.scala
+++ 
b/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/DataFrameReader.scala
@@ -22,8 +22,10 @@ import java.util.Properties
 import scala.collection.JavaConverters._
 
 import org.apache.spark.annotation.Stable
+import org.apache.spark.connect.proto.Parse.ParseFormat
 import org.apache.spark.internal.Logging
 import org.apache.spark.sql.catalyst.util.{CaseInsensitiveMap, 
CharVarcharUtils}
+import org.apache.spark.sql.connect.common.DataTypeProtoConverter
 import org.apache.spark.sql.types.StructType
 
 /**
@@ -324,6 +326,20 @@ class DataFrameReader private[sql] (sparkSession: 
SparkSession) extends Logging
 format("json").load(paths: _*)
   }
 
+  /**
+   * Loads a `Dataset[String]` storing JSON objects (http://jsonlines.org/";>JSON Lines
+   * text format or newline-delimited JSON) and returns the result as a 
`DataFrame`.
+   *
+   * Unless the schema is specified using `schema` function, this function 
goes through the input
+   * once to determine the input schema.
+   *
+   * @param jsonDataset
+   *   input Dataset with one JSON object per record
+   * @since 3.4.0
+   */
+  def json(jsonDataset: Dataset[String]): DataFrame =
+parse(jsonDataset, ParseFormat.PARSE_FORMAT_JSON)
+
   /**
* Loads a CSV file and returns the result as a `DataFrame`. See the 
documentation on the other
* overloaded `csv()` method for more details.
@@ -351,6 +367,29 @@ class DataFrameReader private[sql] (sparkSession: 
SparkSession) extends Logging
   @scala.annotation.varargs
   def csv(paths: String*): DataFrame = format("csv").load(paths: _*)
 
+  /**
+   * Loads an `Dataset[String]` storing CSV rows and returns the result as a 
`DataFrame`.
+   *
+   * If the schema is not specified using `schema` function and `inferSchema` 
option is enabled,
+   * this function goes through the input once to determine the input schema.
+   *
+   * If the schema is not specified using `schema` function and `inferSchema` 
option is 

[spark] branch branch-3.4 updated: [SPARK-42690][CONNECT] Implement CSV/JSON parsing functions for Scala client

2023-03-08 Thread ruifengz
This is an automated email from the ASF dual-hosted git repository.

ruifengz pushed a commit to branch branch-3.4
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.4 by this push:
 new 0191a5bde08 [SPARK-42690][CONNECT] Implement CSV/JSON parsing 
functions for Scala client
0191a5bde08 is described below

commit 0191a5bde082c350b0eda07cf93953c497fd273b
Author: yangjie01 
AuthorDate: Thu Mar 9 14:59:32 2023 +0800

[SPARK-42690][CONNECT] Implement CSV/JSON parsing functions for Scala client

### What changes were proposed in this pull request?
This pr add a new proto message

```
message Parse {
  // (Required) Input relation to Parse. The input is expected to have 
single text column.
  Relation input = 1;
  // (Required) The expected format of the text.
  ParseFormat format = 2;

  // (Optional) DataType representing the schema. If not set, Spark will 
infer the schema.
  optional DataType schema = 3;

  // Options for the csv/json parser. The map key is case insensitive.
  map options = 4;
  enum ParseFormat {
PARSE_FORMAT_UNSPECIFIED = 0;
PARSE_FORMAT_CSV = 1;
PARSE_FORMAT_JSON = 2;
  }
}
```

and implement CSV/JSON parsing functions for Scala client.

### Why are the changes needed?
Add Spark connect jvm client api coverage.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?

- Pass Github Actions
- Manual checked Scala 2.13

Closes #40332 from LuciferYang/SPARK-42690.

Authored-by: yangjie01 
Signed-off-by: Ruifeng Zheng 
(cherry picked from commit 07f71d2ba61325331aabbc686ce30cb9012a6643)
Signed-off-by: Ruifeng Zheng 
---
 .../org/apache/spark/sql/DataFrameReader.scala |  52 +
 .../org/apache/spark/sql/ClientE2ETestSuite.scala  |  64 ++
 .../apache/spark/sql/PlanGenerationTestSuite.scala |  15 ++
 .../CheckConnectJvmClientCompatibility.scala   |   1 -
 .../main/protobuf/spark/connect/relations.proto|  19 ++
 .../explain-results/csv_from_dataset.explain   |   1 +
 .../explain-results/json_from_dataset.explain  |   1 +
 .../query-tests/queries/csv_from_dataset.json  |  38 
 .../query-tests/queries/csv_from_dataset.proto.bin | Bin 0 -> 156 bytes
 .../query-tests/queries/json_from_dataset.json |  38 
 .../queries/json_from_dataset.proto.bin| Bin 0 -> 167 bytes
 .../sql/connect/planner/SparkConnectPlanner.scala  |  26 +++
 python/pyspark/sql/connect/proto/relations_pb2.py  | 248 -
 python/pyspark/sql/connect/proto/relations_pb2.pyi |  97 
 14 files changed, 491 insertions(+), 109 deletions(-)

diff --git 
a/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/DataFrameReader.scala
 
b/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/DataFrameReader.scala
index d5641fb303a..ad921bcc4e3 100644
--- 
a/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/DataFrameReader.scala
+++ 
b/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/DataFrameReader.scala
@@ -22,8 +22,10 @@ import java.util.Properties
 import scala.collection.JavaConverters._
 
 import org.apache.spark.annotation.Stable
+import org.apache.spark.connect.proto.Parse.ParseFormat
 import org.apache.spark.internal.Logging
 import org.apache.spark.sql.catalyst.util.{CaseInsensitiveMap, 
CharVarcharUtils}
+import org.apache.spark.sql.connect.common.DataTypeProtoConverter
 import org.apache.spark.sql.types.StructType
 
 /**
@@ -324,6 +326,20 @@ class DataFrameReader private[sql] (sparkSession: 
SparkSession) extends Logging
 format("json").load(paths: _*)
   }
 
+  /**
+   * Loads a `Dataset[String]` storing JSON objects (http://jsonlines.org/";>JSON Lines
+   * text format or newline-delimited JSON) and returns the result as a 
`DataFrame`.
+   *
+   * Unless the schema is specified using `schema` function, this function 
goes through the input
+   * once to determine the input schema.
+   *
+   * @param jsonDataset
+   *   input Dataset with one JSON object per record
+   * @since 3.4.0
+   */
+  def json(jsonDataset: Dataset[String]): DataFrame =
+parse(jsonDataset, ParseFormat.PARSE_FORMAT_JSON)
+
   /**
* Loads a CSV file and returns the result as a `DataFrame`. See the 
documentation on the other
* overloaded `csv()` method for more details.
@@ -351,6 +367,29 @@ class DataFrameReader private[sql] (sparkSession: 
SparkSession) extends Logging
   @scala.annotation.varargs
   def csv(paths: String*): DataFrame = format("csv").load(paths: _*)
 
+  /**
+   * Loads an `Dataset[String]` storing CSV rows and returns the result as a 
`DataFrame`.
+   *
+   * If the schema is not specified using `schema` function and `inferSchema` 
option is enabled,
+   * this function goes through the input once to determine

[spark] branch master updated: [SPARK-42689][CORE][SHUFFLE] Allow ShuffleDriverComponent to declare if shuffle data is reliably stored

2023-03-08 Thread mridulm80
This is an automated email from the ASF dual-hosted git repository.

mridulm80 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 30ff2c77043 [SPARK-42689][CORE][SHUFFLE] Allow ShuffleDriverComponent 
to declare if shuffle data is reliably stored
30ff2c77043 is described below

commit 30ff2c77043e53e98468844cefdb6c8c8454c967
Author: Mridul Muralidharan 
AuthorDate: Thu Mar 9 01:09:54 2023 -0600

[SPARK-42689][CORE][SHUFFLE] Allow ShuffleDriverComponent to declare if 
shuffle data is reliably stored

### What changes were proposed in this pull request?

Currently, if there is an executor node loss, we assume the shuffle data on 
that node is also lost. This is not necessarily the case if there is a shuffle 
component managing the shuffle data and reliably maintaining it (for example, 
in distributed filesystem or in a disaggregated shuffle cluster).

### Why are the changes needed?

Downstream projects have patches to Apache Spark in order to workaround 
this issue, for example Apache Celeborn has 
[this](https://github.com/apache/incubator-celeborn/blob/main/assets/spark-patch/RSS_RDA_spark3.patch).

### Does this PR introduce _any_ user-facing change?

Enhances the `ShuffleDriverComponents` API, but defaults to current 
behavior.

### How was this patch tested?

Existing unit tests, and added more tests.

Closes #40307 from mridulm/support-hook-for-reliable-shuffle.

Authored-by: Mridul Muralidharan 
Signed-off-by: Mridul Muralidharan gmail.com>
---
 .../spark/shuffle/api/ShuffleDriverComponents.java |  9 
 .../apache/spark/ExecutorAllocationManager.scala   |  5 +-
 .../main/scala/org/apache/spark/SparkContext.scala | 25 +-
 .../org/apache/spark/scheduler/DAGScheduler.scala  |  3 +-
 .../apache/spark/scheduler/TaskSetManager.scala|  1 +
 .../spark/ExecutorAllocationManagerSuite.scala |  3 +-
 .../scala/org/apache/spark/SparkContextSuite.scala |  8 
 .../TestShuffleDataIOWithMockedComponents.scala| 53 ++
 .../apache/spark/scheduler/DAGSchedulerSuite.scala | 23 ++
 9 files changed, 115 insertions(+), 15 deletions(-)

diff --git 
a/core/src/main/java/org/apache/spark/shuffle/api/ShuffleDriverComponents.java 
b/core/src/main/java/org/apache/spark/shuffle/api/ShuffleDriverComponents.java
index b4cec17b85b..5c4c1eff9f5 100644
--- 
a/core/src/main/java/org/apache/spark/shuffle/api/ShuffleDriverComponents.java
+++ 
b/core/src/main/java/org/apache/spark/shuffle/api/ShuffleDriverComponents.java
@@ -61,4 +61,13 @@ public interface ShuffleDriverComponents {
* @param blocking Whether this call should block on the deletion of the 
data.
*/
   default void removeShuffle(int shuffleId, boolean blocking) {}
+
+  /**
+   * Does this shuffle component support reliable storage - external to the 
lifecycle of the
+   * executor host ? For example, writing shuffle data to a distributed 
filesystem or
+   * persisting it in a remote shuffle service.
+   */
+  default boolean supportsReliableStorage() {
+return false;
+  }
 }
diff --git 
a/core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala 
b/core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala
index f06312c15cf..187125a66c9 100644
--- a/core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala
+++ b/core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala
@@ -103,7 +103,8 @@ private[spark] class ExecutorAllocationManager(
 conf: SparkConf,
 cleaner: Option[ContextCleaner] = None,
 clock: Clock = new SystemClock(),
-resourceProfileManager: ResourceProfileManager)
+resourceProfileManager: ResourceProfileManager,
+reliableShuffleStorage: Boolean)
   extends Logging {
 
   allocationManager =>
@@ -203,7 +204,7 @@ private[spark] class ExecutorAllocationManager(
   throw new SparkException(
 s"s${DYN_ALLOCATION_SUSTAINED_SCHEDULER_BACKLOG_TIMEOUT.key} must be > 
0!")
 }
-if (!conf.get(config.SHUFFLE_SERVICE_ENABLED)) {
+if (!conf.get(config.SHUFFLE_SERVICE_ENABLED) && !reliableShuffleStorage) {
   if (conf.get(config.DYN_ALLOCATION_SHUFFLE_TRACKING_ENABLED)) {
 logInfo("Dynamic allocation is enabled without a shuffle service.")
   } else if (decommissionEnabled &&
diff --git a/core/src/main/scala/org/apache/spark/SparkContext.scala 
b/core/src/main/scala/org/apache/spark/SparkContext.scala
index bb1d0a1c98d..43573894748 100644
--- a/core/src/main/scala/org/apache/spark/SparkContext.scala
+++ b/core/src/main/scala/org/apache/spark/SparkContext.scala
@@ -551,11 +551,6 @@ class SparkContext(config: SparkConf) extends Logging {
   executorEnvs.put("OMP_NUM_THREADS", _conf.get("spark.task.cpus", "1"))
 }
 
-_shuffleDriverComponents = 
ShuffleDataIOUtils.loadShuffleDataIO(config).drive

svn commit: r60498 - /dev/spark/v3.4.0-rc3-bin/

2023-03-08 Thread xinrong
Author: xinrong
Date: Thu Mar  9 07:11:38 2023
New Revision: 60498

Log:
Apache Spark v3.4.0-rc3

Added:
dev/spark/v3.4.0-rc3-bin/
dev/spark/v3.4.0-rc3-bin/SparkR_3.4.0.tar.gz   (with props)
dev/spark/v3.4.0-rc3-bin/SparkR_3.4.0.tar.gz.asc
dev/spark/v3.4.0-rc3-bin/SparkR_3.4.0.tar.gz.sha512
dev/spark/v3.4.0-rc3-bin/pyspark-3.4.0.tar.gz   (with props)
dev/spark/v3.4.0-rc3-bin/pyspark-3.4.0.tar.gz.asc
dev/spark/v3.4.0-rc3-bin/pyspark-3.4.0.tar.gz.sha512
dev/spark/v3.4.0-rc3-bin/spark-3.4.0-bin-hadoop3-scala2.13.tgz   (with 
props)
dev/spark/v3.4.0-rc3-bin/spark-3.4.0-bin-hadoop3-scala2.13.tgz.asc
dev/spark/v3.4.0-rc3-bin/spark-3.4.0-bin-hadoop3-scala2.13.tgz.sha512
dev/spark/v3.4.0-rc3-bin/spark-3.4.0-bin-hadoop3.tgz   (with props)
dev/spark/v3.4.0-rc3-bin/spark-3.4.0-bin-hadoop3.tgz.asc
dev/spark/v3.4.0-rc3-bin/spark-3.4.0-bin-hadoop3.tgz.sha512
dev/spark/v3.4.0-rc3-bin/spark-3.4.0-bin-without-hadoop.tgz   (with props)
dev/spark/v3.4.0-rc3-bin/spark-3.4.0-bin-without-hadoop.tgz.asc
dev/spark/v3.4.0-rc3-bin/spark-3.4.0-bin-without-hadoop.tgz.sha512
dev/spark/v3.4.0-rc3-bin/spark-3.4.0.tgz   (with props)
dev/spark/v3.4.0-rc3-bin/spark-3.4.0.tgz.asc
dev/spark/v3.4.0-rc3-bin/spark-3.4.0.tgz.sha512

Added: dev/spark/v3.4.0-rc3-bin/SparkR_3.4.0.tar.gz
==
Binary file - no diff available.

Propchange: dev/spark/v3.4.0-rc3-bin/SparkR_3.4.0.tar.gz
--
svn:mime-type = application/octet-stream

Added: dev/spark/v3.4.0-rc3-bin/SparkR_3.4.0.tar.gz.asc
==
--- dev/spark/v3.4.0-rc3-bin/SparkR_3.4.0.tar.gz.asc (added)
+++ dev/spark/v3.4.0-rc3-bin/SparkR_3.4.0.tar.gz.asc Thu Mar  9 07:11:38 2023
@@ -0,0 +1,17 @@
+-BEGIN PGP SIGNATURE-
+
+iQJHBAABCgAxFiEEzGiz0W/jOnZnBRYLp+V5CMek4bEFAmQJhjwTHHhpbnJvbmdA
+YXBhY2hlLm9yZwAKCRCn5XkIx6ThsRdDEACd98Pk0bSFtKVHER3hjis2R2cg1pgG
+gWiqBZArn1GiB6ck0KHglMklJTFFsw2q9/mro42uVhj0b0hJYcTb2hBO+7vyEYeU
+a+YGhik6FXaQQBL1+oB5aTn2FcnNi7no1Qa+x4opkG7d1giapzQe/oZK1D7RNiYZ
+FAdoDhsUTYCeWDVXbRAcEMca49ltsZDPe45XRHwSgXT45hi6s9oRd78G6v2srbMb
++g7ce4KzAhupZrb5wCnP1MmiWWG1gnfcG0n11LDsiAhYPzzDgW/S4urcqIhWu0+4
+uUSrL6es4mprt1SMybBbmyGrHLuXjdmbBy5XHWy576GoCANdJRffImtmbXFFqp5q
+uau5MDCMFcQwp8pOGjTIDYL4q0p9Kpx3mQ2ykQxWiWg/TgVBQ2leadya8yUV9zZ9
+Y6vuRf9R3iYcXTp3B5XlOWtzjYBICa2XQlizOV3U35xybhSFQHLdUSdBBPMLFsDS
+YxYw1+dm8SjGfHhtsTOsk0ZhgSNgpDC8PBP6UUlz/8qRy4UdjQRrVgkqFmIFcLZs
+CPdX5XlH32PQYtN55qGc6AZECoUpbpigGZetvKqdD5SWyf8maRZZsD+XdR7BT9rk
+LLQTJKak3VQRAn80ONx+JxgzH3B5uV1ldN22vr5nLECpJZDbGjC6etystZDujEYh
+szr47LujCxLTNw==
+=l4pQ
+-END PGP SIGNATURE-

Added: dev/spark/v3.4.0-rc3-bin/SparkR_3.4.0.tar.gz.sha512
==
--- dev/spark/v3.4.0-rc3-bin/SparkR_3.4.0.tar.gz.sha512 (added)
+++ dev/spark/v3.4.0-rc3-bin/SparkR_3.4.0.tar.gz.sha512 Thu Mar  9 07:11:38 2023
@@ -0,0 +1 @@
+4703ffdbf82aaf5b30b6afe680a2b21ca15c957863c3648e7e5f120663506fc9e633727a6b7809f7cff7763a9f6227902f6d83fac7c87d3791234afef147cfc3
  SparkR_3.4.0.tar.gz

Added: dev/spark/v3.4.0-rc3-bin/pyspark-3.4.0.tar.gz
==
Binary file - no diff available.

Propchange: dev/spark/v3.4.0-rc3-bin/pyspark-3.4.0.tar.gz
--
svn:mime-type = application/octet-stream

Added: dev/spark/v3.4.0-rc3-bin/pyspark-3.4.0.tar.gz.asc
==
--- dev/spark/v3.4.0-rc3-bin/pyspark-3.4.0.tar.gz.asc (added)
+++ dev/spark/v3.4.0-rc3-bin/pyspark-3.4.0.tar.gz.asc Thu Mar  9 07:11:38 2023
@@ -0,0 +1,17 @@
+-BEGIN PGP SIGNATURE-
+
+iQJHBAABCgAxFiEEzGiz0W/jOnZnBRYLp+V5CMek4bEFAmQJhj4THHhpbnJvbmdA
+YXBhY2hlLm9yZwAKCRCn5XkIx6ThsaMFD/0VbikHk10VpDiRp7RVhquRXR/qHkiK
+ioI02DrZJsZiRElV69Bfxvb1HQSKJhE9xXC+GkS7N+s0neNMXBpYsSxigRICG+Vi
+nPJifZVCNzpckkD5t8t+07X5eTRR7VoRPsHkaYSNKxXiMfXYbOpBOLcP/cvrdPSi
+nXsOnLm3dhxU7kMS+Qy4jbCzQN1fb4XPagxdvPji/aKo6LBw/YiqWHPhHcHlW89h
+cGRAQpN1VjfNkO1zfGxV/h5kD8L/my0zsVMOxtF/r6Qc7FZGBilfMuw8d+8WSVAr
+kRx+s2kB8vuH/undWoRSwpItqv0/gcyFCCvMmLQlbEA0Ku/ldE88XESIuI25uTcC
+tVJFC01Gauh7KlkI4hzsuwlhcDH/geLE1DS59fKC5UMqEYvaKQyQZFzyX0/eFIIS
+8KRZo3B5NUfEXE3fMDOGE8FgJ76QPQ3HO2tB9f+ICeu1/1RioqgucZ7jcKfFIx/J
+FzZ7FkNuLSl3CEnH5BlqdoaCCdmOsZVqcPgaZaGUncgK6ygBWEIEK/I6pE9Sye+Y
+ncBM76ZJf3NsE4Kzdw/v0NCrLaTdIMIK3W3fvVY94IPdk2EY6MuEnGDqG1bn88u4
+zYfP118WS4KtN6fSkczHGf+7+LQIiWrovIb+cQP+TXKeCinRbK1/I6pBWnn4/0u1
+DApXYisgegSYPg==
+=ykwM
+-END PGP SIGNATURE-

Added: dev/spark/v3.4.0-rc3-bin/pyspark-3.4.0.tar.gz.sha512
==
--- dev/spark/v3.4.0-rc3-bin/pyspark-3.4.0.tar.gz.sha512 (added)
+++ dev/spark/v

[spark] branch master updated: [SPARK-42724][CONNECT][BUILD] Upgrade buf to v1.15.1

2023-03-08 Thread ruifengz
This is an automated email from the ASF dual-hosted git repository.

ruifengz pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 1d83a76123a [SPARK-42724][CONNECT][BUILD] Upgrade buf to v1.15.1
1d83a76123a is described below

commit 1d83a76123a9e0167a3dd296ea36a72ba3084797
Author: panbingkun 
AuthorDate: Thu Mar 9 15:46:19 2023 +0800

[SPARK-42724][CONNECT][BUILD] Upgrade buf to v1.15.1

### What changes were proposed in this pull request?
The pr aims to upgrade buf from 1.15.0 to 1.15.1

### Why are the changes needed?
Release Notes: https://github.com/bufbuild/buf/releases
https://github.com/bufbuild/buf/compare/v1.15.0...v1.15.1

Manually test: dev/connect-gen-protos.sh
This upgrade will not change the generated files.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Manually test and Pass GA.

Closes #40348 from panbingkun/SPARK-42724.

Authored-by: panbingkun 
Signed-off-by: Ruifeng Zheng 
---
 .github/workflows/build_and_test.yml| 2 +-
 python/docs/source/development/contributing.rst | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/.github/workflows/build_and_test.yml 
b/.github/workflows/build_and_test.yml
index 7ae04a9bd80..ebc40e92791 100644
--- a/.github/workflows/build_and_test.yml
+++ b/.github/workflows/build_and_test.yml
@@ -580,7 +580,7 @@ jobs:
 - name: Install dependencies for Python code generation check
   run: |
 # See more in "Installation" 
https://docs.buf.build/installation#tarball
-curl -LO 
https://github.com/bufbuild/buf/releases/download/v1.15.0/buf-Linux-x86_64.tar.gz
+curl -LO 
https://github.com/bufbuild/buf/releases/download/v1.15.1/buf-Linux-x86_64.tar.gz
 mkdir -p $HOME/buf
 tar -xvzf buf-Linux-x86_64.tar.gz -C $HOME/buf --strip-components 1
 python3.9 -m pip install 'protobuf==3.19.5' 'mypy-protobuf==3.3.0'
diff --git a/python/docs/source/development/contributing.rst 
b/python/docs/source/development/contributing.rst
index 3b12de72546..2d58c86b15e 100644
--- a/python/docs/source/development/contributing.rst
+++ b/python/docs/source/development/contributing.rst
@@ -120,7 +120,7 @@ Prerequisite
 
 PySpark development requires to build Spark that needs a proper JDK installed, 
etc. See `Building Spark 
`_ for more details.
 
-Note that if you intend to contribute to Spark Connect in Python, ``buf`` 
version ``1.15.0`` is required, see `Buf Installation 
`_ for more details.
+Note that if you intend to contribute to Spark Connect in Python, ``buf`` 
version ``1.15.1`` is required, see `Buf Installation 
`_ for more details.
 
 Conda
 ~


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch branch-3.4 updated: [SPARK-42724][CONNECT][BUILD] Upgrade buf to v1.15.1

2023-03-08 Thread ruifengz
This is an automated email from the ASF dual-hosted git repository.

ruifengz pushed a commit to branch branch-3.4
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.4 by this push:
 new 74cf1a32b0d [SPARK-42724][CONNECT][BUILD] Upgrade buf to v1.15.1
74cf1a32b0d is described below

commit 74cf1a32b0db91ef5cc334616dff26c01da7e06e
Author: panbingkun 
AuthorDate: Thu Mar 9 15:46:19 2023 +0800

[SPARK-42724][CONNECT][BUILD] Upgrade buf to v1.15.1

### What changes were proposed in this pull request?
The pr aims to upgrade buf from 1.15.0 to 1.15.1

### Why are the changes needed?
Release Notes: https://github.com/bufbuild/buf/releases
https://github.com/bufbuild/buf/compare/v1.15.0...v1.15.1

Manually test: dev/connect-gen-protos.sh
This upgrade will not change the generated files.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Manually test and Pass GA.

Closes #40348 from panbingkun/SPARK-42724.

Authored-by: panbingkun 
Signed-off-by: Ruifeng Zheng 
(cherry picked from commit 1d83a76123a9e0167a3dd296ea36a72ba3084797)
Signed-off-by: Ruifeng Zheng 
---
 .github/workflows/build_and_test.yml| 2 +-
 python/docs/source/development/contributing.rst | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/.github/workflows/build_and_test.yml 
b/.github/workflows/build_and_test.yml
index 31ae5cfee9b..29a9a58de08 100644
--- a/.github/workflows/build_and_test.yml
+++ b/.github/workflows/build_and_test.yml
@@ -581,7 +581,7 @@ jobs:
 - name: Install dependencies for Python code generation check
   run: |
 # See more in "Installation" 
https://docs.buf.build/installation#tarball
-curl -LO 
https://github.com/bufbuild/buf/releases/download/v1.15.0/buf-Linux-x86_64.tar.gz
+curl -LO 
https://github.com/bufbuild/buf/releases/download/v1.15.1/buf-Linux-x86_64.tar.gz
 mkdir -p $HOME/buf
 tar -xvzf buf-Linux-x86_64.tar.gz -C $HOME/buf --strip-components 1
 python3.9 -m pip install 'protobuf==3.19.5' 'mypy-protobuf==3.3.0'
diff --git a/python/docs/source/development/contributing.rst 
b/python/docs/source/development/contributing.rst
index 3b12de72546..2d58c86b15e 100644
--- a/python/docs/source/development/contributing.rst
+++ b/python/docs/source/development/contributing.rst
@@ -120,7 +120,7 @@ Prerequisite
 
 PySpark development requires to build Spark that needs a proper JDK installed, 
etc. See `Building Spark 
`_ for more details.
 
-Note that if you intend to contribute to Spark Connect in Python, ``buf`` 
version ``1.15.0`` is required, see `Buf Installation 
`_ for more details.
+Note that if you intend to contribute to Spark Connect in Python, ``buf`` 
version ``1.15.1`` is required, see `Buf Installation 
`_ for more details.
 
 Conda
 ~


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



svn commit: r60500 - in /dev/spark/v3.4.0-rc3-docs: ./ _site/ _site/api/ _site/api/R/ _site/api/R/articles/ _site/api/R/deps/ _site/api/R/deps/bootstrap-5.2.2/ _site/api/R/deps/jquery-3.6.0/ _site/api

2023-03-08 Thread xinrong
Author: xinrong
Date: Thu Mar  9 07:54:14 2023
New Revision: 60500

Log:
Apache Spark v3.4.0-rc3 docs


[This commit notification would consist of 2807 parts, 
which exceeds the limit of 50 ones, so it was shortened to the summary.]

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org